Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

E3Former

An online ensemble Transformer for workload forecasting in large-scale predictive auto-scaling

Chen et al. (2025) Citations


TL;DR

Cloud platforms want to scale a service’s machines before a traffic surge, not after — which means forecasting how busy each service will be in the next minutes-to-hours. The twist this paper attacks is that the workload keeps changing: a forecasting model that was accurate last week slowly goes stale as user behavior drifts and the system gets updated. E3Former is an online forecasting model — it keeps learning from each new observation as it streams in, instead of being frozen after training. Three ideas make it work and keep it cheap enough for a “serverless” (pay-per-use, minimal-overhead) setting: (1) it slices the same input at several patch sizes at once so different copies (“subnetworks”) each capture cycles at a different time scale, but all share one Transformer backbone so the extra copies cost almost no extra parameters; (2) an Adapter nudges the network’s attention weights on the fly to track drift; and (3) an Ensembler continuously re-weights the subnetworks’ forecasts using online-learning theory (with provable “no-regret” guarantees) so the combined forecast stays robust when the data distribution shifts. It comes in two flavors — E3Former-OS (more accurate) and E3Former-FTPL (more efficient). On twelve datasets it beats the previous best online model (OneNet) by roughly 14% MSE / 12% MAE / 19% WMAPE while using only ~36k–41k parameters (about a sixth of OneNet’s). It is deployed in ByteDance’s IHPA auto-scaling platform across 30+ applications and 600,000+ CPU cores, cutting resource usage by over 40% while essentially preserving service quality. E3Former only produces the forecast; ByteDance’s IHPA platform does the actual scaling.


The Problem (and why simple autoscaling isn’t enough)

Autoscaling means automatically adjusting how many resources (machines, containers / “pods”, CPU cores) a cloud application gets, so it can handle its load without wasting money.

The naive approach is reactive autoscaling: watch a metric like CPU, and when it crosses a threshold, add machines. The problem is that allocating physical resources takes time (a provisioning / “cold start” latency). So a reactive scaler is always lagging — by the time the new capacity is ready, users have already suffered.

Proactive (predictive) autoscaling fixes this by forecasting the future workload and scaling ahead of time, hiding the provisioning delay. The forecast is the hard part — and that is exactly what E3Former provides. The paper calls the forecasting model “the core” of a predictive auto-scaling system. The stakes are concrete: the paper reports that “elaborate auto-scaling systems with accurate workload forecasting can enhance service quality by 20% while reducing resource waste by 15%,” and notes in a footnote that at hundreds-of-thousands-of-CPU-core scale “every 1% of resource wastage translates to an annual revenue loss of tens of thousands, if not hundreds of thousands, of dollars.” That is why a better forecast (not a fancier scaler) is worth chasing.

What makes forecasting hard here specifically? The paper lists four workload challenges the model must handle:

  1. Complex periodic patterns. Real cloud workloads repeat on multiple cycles at once — hourly, daily, weekly, even seasonal — and these are intertwined with irregular bursts.

  2. Long prediction length. Because physically allocating resources has latency, the forecast must reach minutes-to-hours into the future, not just one step.

  3. Changing dynamics (the crux of this paper). User behavior shifts and systems get updated, so the statistics of the workload itself drift over time. A model trained once and frozen will slowly degrade. The model must adapt online.

  4. Robustness. It must reliably meet SLAs even as conditions change.

There is a second, easy-to-miss subtlety the paper emphasizes: fine-grained, high-frequency periodicity. The workload is sampled at minute-level granularity, yet it still carries broader daily/weekly/monthly cycles. The paper quotes: “Despite the data being minutely granular, it also captures broader cycles on daily, weekly, and monthly scales.” Capturing this multi-granularity periodic pattern — many nested cycles spanning very different time scales — is the central representation challenge, and the paper argues recent methods cannot capture such complicated dynamics.


Background

Time series. A sequence of numbers measured over time, e.g. queries-per-second every minute. The model’s job: given recent history, predict the next stretch.

Lookback window (L) and forecast horizon (H). The model reads the last L time steps (the lookback) and predicts the next H time steps (the horizon). E3Former uses L = 1440 (= 24 hours at 1-minute granularity) for every benchmark, and varies H over {1, 10, 30, 60} (i.e. 1 to 60 minutes ahead). (Notation note: the paper occasionally writes the horizon as T instead of H; the two symbols are used interchangeably.)

Transformer & “attention”. A Transformer is a neural network whose core trick, attention (more precisely self-attention), lets every point in a sequence weigh every other point to decide what matters. Multi-head self-attention (MHSA) runs several attention computations (“heads”) in parallel and concatenates them.

Patching (and PatchTST). Instead of feeding the Transformer one time step at a time, modern forecasters chop the series into patches — short contiguous chunks of P consecutive values — and treat each patch as one “token.” This is the PatchTST design. E3Former is built directly on PatchTST: the paper states that if you strip out E3Former’s two new ideas (multi-resolution patching + ensemble, and the online Adapter), “the removal of MIMO and the online adaptor from our model results in its simplification to a PatchTST.” So think of E3Former as PatchTST + multi-resolution ensembling + online adaptation.

Why “online” forecasting? (the single most important idea here)

Most forecasting models are trained offline: you collect a fixed dataset, train until convergence, freeze the weights, and deploy. That is fine if the world stays still. But cloud workloads drift — a phenomenon called concept drift or distribution shift: the relationship the model learned (history -> future) slowly stops holding because user behavior and the system itself change. A frozen (“static”) model therefore degrades over time on a drifting stream.

Online learning is the alternative: the model never stops training. It processes the stream one observation at a time, and every time the truth arrives and the prediction was wrong, it updates its own parameters a little. The paper frames this formally:

How do you even define “good” when the model and the data both keep changing? Online learning uses regret. Regret compares the model’s actual cumulative error against the error of the single best fixed parameter setting you could have chosen in hindsight (Eq. 1):

regret = sum_t loss( f(Xt; theta_t), x_t )  -  inf_theta  sum_t loss( f(Xt; theta), x_t )

A learner is “good” if its regret is sub-linear in time, regret = o(T), which means regret / T -> 0 as T -> infinity. In plain words: averaged over time, the online model does asymptotically as well as the best fixed model chosen with perfect hindsight. This “no-regret” property is what the paper proves for both of its ensemble variants (both achieve O(sqrt(T)) regret — see How It Works). It is the theoretical safety net that justifies updating the model continuously.

Why two storage tiers matter (and limit history). The paper notes a practical reason history is bounded: cloud monitoring keeps data in two tiers — cold storage (e.g. ClickHouse / OLAP databases, holding months of data but slow to read) and hot storage (e.g. Redis / in-memory, fast but small, e.g. only the past week). Querying long histories online is expensive, which is exactly why the lookback is capped at L = 1440 — “the maximum reasonable query length that the cloud computing system we are using can handle online.”

The “serverless” cost constraint. Serverless systems demand a lean operational ethos: any forecasting model must add minimal compute and parameter overhead. This is why E3Former works hard to be tiny (~36k–41k params) and to make the ensemble nearly free — see How it stays cheap.


Contribution in Simple Terms

E3Former (the “Ensembled, Efficient, online Former”, an online ensemble Transformer) is an online workload forecaster built from four synergistic components. The paper’s own one-line summary of each:

  1. Representer — “Extracts multi-resolution periodic features through multi-resolution patching operation, explicitly modeling nested cycles.” (It slices the input at several patch sizes so each scale exposes a different cycle.)

  2. Transformer — “Leverages self-attention modules to capture long-range dependencies.” (A single, shared PatchTST-style encoder.)

  3. Adapter — “Implements online adaptation via cumulative gradient-based parameter updates, enabling rapid refining to workload dynamics.” (Tracks drift at inference time.)

  4. Ensembler — “Combines results from different sub-networks through adaptive weighting, enhancing robustness against distribution shifts.” (Continuously re-weights the subnetworks using no-regret online learning.)

The genuinely new pieces, in plain terms:

The headline result: better accuracy than every online baseline on twelve datasets, with a fraction of the parameters and far higher throughput, plus a real production deployment in ByteDance’s IHPA auto-scaling platform.


How It Works, Step by Step

Here is the end-to-end journey, from raw history to forecast. The data flow differs between offline training and online inference, so both are shown.

The four Representer operations (Section 4.1)

The Representer turns the raw window into the multi-resolution patch groups the backbone consumes. Four operations, in order:

(a) Channel Independence. The input Xt is multivariate — L time steps of M metrics (shape L x M). Channel independence treats this as an M-sized mini-batch of M separate univariate series {Xt,i}. Why: it multiplies the effective training set by M, exploits GPU parallelism, and avoids overfitting to spurious correlations between channels. Net effect: forecasting is effectively univariate per channel, even though the raw input is multivariate. (From here on the paper abbreviates each univariate series as Xt in R^L.)

(b) RevIN (Reversible Instance Normalization). Each input series is instance-normalized by subtracting its mean mu and dividing by its std sigma (X_hat = (Xt - mu) / sigma), then passed through a small learnable affine transform with parameters r1, r2 (X_hat = r1 * X_hat + r2). At the output, this is reversed (denormalized) to restore the original scale. Why: it regularizes and handles distribution shift in the input window — instances at wildly different magnitudes get mapped to a comparable distribution.

(c) Multi-resolution patching — the core periodicity trick. Given a patch size P, the normalized series of length L is cut into N = ceil(L/P) patches of size P (the last value is repeated L mod P times to pad the end so dimensions stay consistent). The key insight, citing prior work: “patch-based projections with larger patch sizes can capture high-frequency features and vice versa” — i.e. different patch sizes expose periodicities at different levels. So instead of one patch size, E3Former uses a group of patch sizes {P1, ..., Pd} (default {16, 32, 64, 128}, giving d = 4), producing d groups of patches that explicitly model the nested cycles at multiple resolutions.

(d) Alignment (the efficiency enabler). Different patch sizes yield different patch dimensions, so the d groups can’t be fed to one shared model as-is. Each group is projected through its own small linear layer to a common hidden dimension D. After this alignment, all d groups share a uniform shape and can be processed in parallel by a single backbone. (Extraction note: the paper’s text here contains a typo — “produce f outpus” — which almost certainly means “produce d outputs,” one per input group.)

The shared Transformer backbone (Section 4.2)

A standard Transformer-encoder-style module providing long-range dependency modeling. Per attention layer:

The crucial efficiency point — how it stays cheap enough for serverless: using d independent models would scale compute linearly with d. The MIMO mechanism instead routes all d aligned patch groups through one parameter-shared encoder. The paper states “in E3Former there are d independent subnetworks with a part of shared parameters.” The only per-subnetwork weights are the small alignment linear layers (one per patch size) and the per-group forecast heads; everything heavy (the attention blocks) is shared. That is why the full 4-subnetwork ensemble adds almost nothing over a single PatchTST, landing at just ~36k–41k total parameters.

The Adapter — online drift tracking (Section 4.3)

The Adapter is active in inference only (Figure 5 labels it “Parameter Adaptation (Inference Only)”). It adjusts the self-attention layer’s weights at the parameter level so the model bends toward current conditions. Inspired by FSNet, it works in three moves:

  1. Smoothed gradient (EMA). Single-step gradients on a drifting stream are noisy. The Adapter keeps an Exponential Moving Average (EMA) of the attention layer’s gradient (Eq. 6): nabla_hat_new = gamma * nabla_hat_old + (1 - gamma) * nabla_t, where nabla_t is the current gradient, gamma the decay coefficient (its numeric value is not reported). This gives a stable “gradient memory” of which way the model has consistently needed to move.

  2. Two adaptation coefficients. The flattened EMA gradient is mapped by a single linear layer to two vectors (Eq. 7): [alpha, beta] = Linear(Flatten(nabla_hat)). alpha is the weight adaptation coefficient; beta is the feature adaptation coefficient.

  3. Element-wise adaptation. Adaptation is a simple element-wise (Hadamard) multiply, chosen for simplicity. For the query path (Eq. 8): W_tilde^Q = alpha (*) W^Q; Q = X^P W_tilde^Q; Q_tilde = beta (*) Q. The adapted Q_tilde replaces Q in the attention. The same weight adaptation is applied to W^Q, W^K, W^V; feature adaptation is applied to the hidden feature embeddings.

The ablation shows this matters a lot — and matters far more under drift / cold starts: removing the Adapter raises in-domain error by ~29% (MSE) but raises transfer error by ~54% (MSE). (See Evaluation.)

The Ensembler — combining the subnetworks (Section 4.4)

The Ensembler takes the d subnetwork forecasts f_1(X), ..., f_d(X) and produces one final forecast, continuously re-weighting them as the stream evolves. This is where the online-learning theory lives. Two interchangeable variants are offered:

Variant A — Online Scaling (E3Former-OS), the accurate one. Its base layer is Exponentiated Gradient Descent (EGD), a classic online-learning algorithm. The decision space is a d-dimensional simplex: a set of nonnegative weights w_i >= 0 that sum to 1 (a “soft” blend). The goal (Eq. 9) is to pick w minimizing || sum_i w_i f_i(X) - Y ||^2. EGD starts uniform (w_1 = 1/d each) and updates multiplicatively by how badly each expert just did (Eq. 10): w_{t+1,i} = w_{t,i} * exp(-eta * ||f_i(X) - Y||^2) / Z_t, where Z_t normalizes the weights to sum to 1, and eta is a learning rate (numeric value not reported). Theorem 1 guarantees that for T > 2 log(d), EGD has vanishing regret O(sqrt(T)).

But pure EGD has a known flaw — the “slow switch phenomenon”: an exponentially-weighted forecaster “tends to react sluggishly to significant shifts in data distribution.” To fix that lag, E3Former-OS adds a small learned Online Scaling module on top: an input embedding layer + a multi-head self-attention layer equipped with the same Adapter + an output linear layer. It stacks the d forecasts and the ground truth into F, maps the horizon dimension H to hidden dim D, runs (adapter-equipped) self-attention, and produces scaling weights (Eq. 12): s = Softmax(Linear(Flatten(MHSA(F))) + w). Note the EGD weights w are added as a residual/bias before the softmax — so the learned attention net corrects the EGD policy rather than replacing it. The final output is f = s^T F. Figure 9 visualizes these weights rapidly re-shuffling across clusters during inference, directly countering the slow-switch problem.

Variant B — Follow-The-Perturbed-Leader (E3Former-FTPL), the efficient one. FTPL is a more general, non-parametric online method that works in both convex and non-convex settings (EGD’s guarantee leans on convexity). Its decision space is discrete one-hot: pi_i in {0,1} with exactly one 1 — i.e. it hard-picks a single subnetwork’s output rather than blending. The update (Eq. 13) selects the subnetwork with the best perturbed cumulative loss: pi_{t+1} = argmin_pi ( sum_tau ||f_pi(X_tau) - Y_tau||^2 + sigma(pi) ), where sigma(pi) is a random perturbation drawn from a Uniform distribution. The perturbation injects exploration and mitigates “short-term myopia.” Algorithm 1 repeats this m times to build an empirical distribution over choices, then samples the target subnetwork from it. Theorem 2 gives sublinear expected regret O(sqrt(T)). Because FTPL is non-parametric, it has fewer parameters and faster inference than OS, at a slight accuracy cost.

OS vs FTPL trade-off (the paper’s own framing): “E3Former-FTPL marginally underperforms relative to E3Former-OS. Nonetheless, it offers the benefits of reduced parameter count and expedited inference speeds, contributing to greater throughput ... and facilitating flexible deployment in environments limited by resources.”

Where the diversity comes from. Importantly, the ensemble’s diversity is not from independent weights or an explicit diversity-promoting loss (the paper reports none). It comes solely from the different patch sizes — each subnetwork sees the same series patched at a different scale, so each naturally specializes in a different periodicity. The paper does not quantify this inter-subnetwork diversity.

Non-Transformer ingredients used: RevIN normalization, channel-independence, multi-resolution patching, the MIMO weight-sharing trick, an EMA-of-gradient adapter, and two online-learning algorithms (EGD/Online-Scaling or FTPL). There is no explicit seasonal-trend (STL-style) decomposition — periodicity is captured purely by multi-resolution patching, not by additive trend/seasonal splitting. There is no diffusion, GAN, or frequency/Fourier transform.


Workload Modeling & Prediction Pipeline (full-text deep read)

This section zooms in on two narrow questions: (A) what is “the workload” as a number, and (B) the exact path from that raw number to a prediction. Everything here is grounded in this paper’s full text.

(A) How the workload is modeled and characterized

What real-world quantity is “the workload”? It depends on the cloud service type, but it is always one resource/demand metric per computing instance, sampled over time. Concretely, across the ByteDance datasets:

Formally (Section 3.2), the workload is a stream X = {x1, x2, ..., xt, ...} where each xt is in R^M — a vector of M metrics — sampled at uniform intervals (e.g. every minute).

One signal or many (univariate vs multivariate)? The raw input is multivariate (M metrics, where M ranges from 7 to 321 across datasets — e.g. FaaS_Small M=7, FaaS_Large M=226, IaaS_Medium M=58, Electricity M=321). But the model is channel-independent: each of the M metrics is forecast univariately through the shared pipeline, as a member of an M-sized mini-batch. So in practice the forecasting is univariate-per-channel, even though the system ingests many metrics.

At what time granularity is it sampled? The ByteDance cloud datasets are minute-level (1-minute sampling). Crucially, even at this fine granularity the workload still exhibits daily, weekly, and monthly cycles — the “fine-grained high-frequency periodicity” the model is designed to capture. Public datasets vary (ETTm at 15 min, ETTh / Electricity / Weather at hourly per Table 3 — though the Weather frequency is internally inconsistent; see uncertainties).

What makes it hard? The four challenges from The Problem: (1) nested multi-scale cycles (hourly/daily/weekly/seasonal) tangled with irregular bursts; (2) the need for long horizons because resource allocation has latency; (3) drifting dynamics demanding online adaptation; (4) robustness to meet SLAs. The paper stresses that minute-granular data carrying broad cycles is exactly what recent methods fail to model (Figure 3).

How is it represented numerically inside the model? Not as a raw curve fed whole. The representation stack is:

There are no frequency-domain transforms, no STL-style trend/seasonal/residual split; patches at multiple sizes are the entire mechanism for capturing nested cycles.

(B) The step-by-step prediction pipeline (raw signal -> forecast)

The core idea in one line: slice the recent window at several patch sizes, push all slices through one shared Transformer to get several forecasts, then (online) adapt the weights and ensemble the forecasts into one. Numbers below are the paper’s main-experiment settings: lookback L = 1440 (24h at 1-min), horizon H in {1, 10, 30, 60} (1-60 min ahead).

  1. Collect & window. A monitoring system records each instance’s metric (QPS or CPU) every minute. The most recent L = 1440 points form the input window. Channel-independence splits the M-metric window into M univariate series.

  2. RevIN normalize. Each series is instance-normalized (subtract mean, divide by std, learnable affine). The statistics are stored for restoration at the end.

  3. Multi-resolution patching. Each normalized series is cut into patches at four patch sizes {16, 32, 64, 128}, giving d = 4 patch groups. Each group i has N_i = ceil(L / P_i) patches; the end is padded by repeating the last value. Why: larger patches surface higher-frequency features and smaller patches lower-frequency, so the four groups jointly cover the nested cycles.

  4. Alignment. Each group is projected through its own linear layer to a shared hidden dim D, so all four can be fed in parallel to one backbone.

  5. Shared MIMO Transformer. A single parameter-shared encoder (MHSA + BatchNorm + FFN, positional encoding up front) processes all four aligned groups and emits four H-length forecasts — one per subnetwork. Offline, each forecast trains its own subnetwork independently with MSE loss. Online, all four feed the Ensembler.

  6. (Online) Adapter. At inference, the Adapter uses an EMA of the attention-layer gradient (Eq. 6) to compute element-wise weight (alpha) and feature (beta) adaptation coefficients (Eqs. 7-8), nudging W^Q/W^K/W^V and the hidden features toward current conditions.

  7. (Online) Ensembler. The four forecasts are combined into one. E3Former-OS: EGD simplex weights, refined by a small adapter-equipped attention net via s = Softmax(Linear(Flatten(MHSA(F))) + w), output f = s^T F. E3Former-FTPL: hard-pick one subnetwork via perturbed-cumulative-loss argmin (Eq. 13), sampling from m perturbed draws.

  8. Denormalize & emit. RevIN is reversed to return the forecast to original units — H future values per channel (QPS or CPU).

Prediction cadence. The model forecasts at fixed, non-overlapping intervals (e.g. once every 60 minutes) to conserve compute and match the scaling horizon. The paper argues the ideal interval equals the forecast-horizon length, which makes the online setting “immune” to the H-step feedback-delay gap that the PROCEED method identifies (you never forecast over a window whose truth hasn’t yet arrived).

What is finally predicted — exact shape and type. A multi-step, point forecast (no probability intervals; trained with plain MSE loss). Horizon H in {1, 10, 30, 60} steps; main-table results are averaged over all four horizons. Output units match the input metric: a future QPS curve (FaaS) or CPU curve (IaaS), per channel.

How the result is consumed (briefly). The forecast is E3Former’s only output — it does not emit a pod count or scaling action itself. In the deployed system, the predicted workload is fed to ByteDance’s IHPA platform, which makes the scaling decision; see MAPE-K.


Model Parameters & How They Were Chosen

This section collects every hyperparameter the paper reports and states, per parameter, how it was chosen. A recurring theme: the paper is explicit about windowing, patch sizes, and the ensemble/online machinery, but silent on the conventional Transformer sizing numbers (hidden dimensions, layer count, head count are defined symbolically but never given numerically). Where the paper gives no number, it is marked “not reported”; no defaults are imported.

(A) What are the model parameters?

Architecture. E3Former is a channel-independent, patch-based Transformer (a PatchTST backbone) augmented with multi-resolution patching + a MIMO shared-backbone ensemble + an online Adapter. Structural settings:

ParameterValueNotes
Architecture familyOnline ensemble Transformer (PatchTST backbone + MIMO + online Adapter)Removing MIMO and the Adapter reduces it to PatchTST.
Number of subnetworks (d)4 (default)One per patch size; the d subnetworks share the backbone via MIMO.
Patch sizes{16, 32, 64, 128} (default); expanded to {16, 32, 64, 128, 256, 512} (6 subnetworks) for the parameter-analysis sweepFigure 4 illustrates {8, 16, 32, 64}, which differs from the experimental set — treat the figure as illustrative only (see uncertainties).
Patches per group (N)N_i = ceil(L / P_i)End-padded by repeating the last value L mod P times.
NormalizationRevIN (reversible instance norm) with learnable affine r1, r2Whether r1, r2 are shared across patch groups/channels is not stated.
Backbone normBatch Normalization (not LayerNorm)Matches PatchTST.
Attention layersnot reportedText says “several attention layers” but gives no count.
Attention heads (h)not reported numericallyPresent symbolically (W^O in R^{hD' x D}) but no value.
Hidden dimension (D)not reportedSymbolic only.
Per-head attention dim (D’)not reportedSymbolic only.
Feed-forward dimension (Dff)not reportedSymbolic only (W1 in R^{D x Dff}).
Positional encodingapplied before attention; type not specifiedSinusoidal vs learned not stated.
Dropout / activationReLU in the FFN; dropout not reportedFFN is max(0, ...) (ReLU).

Adapter (online, inference-only).

ParameterValue
EMA decay (gamma)not reported (Eq. 6)
Adaptation coefficientsalpha (weight adaptation), beta (feature adaptation), produced by a linear layer on the flattened EMA gradient (Eq. 7)
Adaptation operatorelement-wise (Hadamard) multiply (Eq. 8)
Applied toweights W^Q, W^K, W^V (via alpha) and hidden feature embeddings (via beta)

Ensembler.

ParameterValue
Variant A — EGD init weightsw_1,i = 1/d (uniform over the d subnetworks)
EGD learning rate (eta)not reported (Eq. 10)
EGD regret conditionT > 2 log(d) gives O(sqrt(T)) regret (Theorem 1)
Online Scaling moduleinput embedding + adapter-equipped MHSA + output linear; weights via s = Softmax(Linear(Flatten(MHSA(F))) + w) (Eq. 12)
Variant B — FTPL perturbationUniform distribution; range/scale not reported
FTPL repeat count (m)not reported (Algorithm 1)
FTPL regret boundexpected O(sqrt(T)) (Theorem 2)

Training.

ParameterValue
Lossl2 / MSE
OptimizerAdamW (following Informer)
Learning ratenot reported
Epochsnot reported (experiments repeated 3 times)
Batch size32 (GPT4TS used 8 to avoid OOM)
Offline pre-training data~1 week of archived data (2 weeks for the Kubernetes HPA test)
Framework / hardwarePyTorch; NVIDIA Tesla V100 32GB

Data / windowing.

ParameterValue
Lookback window (L)1440 (= 24h at 1-min; “maximum reasonable query length the cloud system can handle online”)
Forecast horizon (H){1, 10, 30, 60} steps; main results averaged over all four
Sampling interval1 minute (ByteDance FaaS/IaaS)
Prediction cadencefixed, non-overlapping intervals (e.g. every 60 min)

Total parameter count / model size. E3Former-OS = 41k, E3Former-FTPL = 36k parameters (Figure 8). For comparison: OneNet 216k, FSNet 103k, Time-FSNet 113k. E3Former-FTPL is 16.7% of OneNet’s parameters, with >800% higher throughput and 39.6% less inference time. (The paper does not explicitly break down whether the ~5k OS-vs-FTPL delta is exactly the Online Scaling network, but that is the plausible source — see uncertainties.)

(B) How were the parameters chosen?


Inputs (what it consumes)


Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

The MAPE-K loop is the standard reference model for self-adaptive systems: Monitor -> Analyze -> Plan -> Execute, over shared Knowledge. E3Former is a forecaster feeding ByteDance’s IHPA (Intelligent Horizontal Pod Auto-scaling) platform.

What the autoscaling tests showed (controlled Kubernetes HPA, replaying a real FaaS workload):

StrategyAvg latencyMax latency99-pct latencyAvg podsMax pods
Naive HPA (reactive, built-in)0.231 s91.022 s0.689 s16.57434
OneNet (predictive)0.267 s52.906 s1.217 s14.94922
FSNet (predictive)0.236 s7.347 s1.039 s14.94229
E3Former-OS0.218 s7.535 s0.731 s15.36821
E3Former-FTPL0.223 s7.734 s0.767 s15.07122
Ideal HPA (perfect foresight)0.219 s9.419 s0.739 s14.60824

Versus the built-in Naive HPA, E3Former cut average / maximum Pod occupation by 7.3% / 29.4% and average / maximum latency by 5.6% / 91.7% (the maximum-latency drop is dominated by Naive HPA’s 91s tail). It reports roughly 30% lower p99 latency, and E3Former-FTPL gives a 16.5% average-latency improvement over OneNet without a significant pod increase. E3Former offers QoS similar to the Ideal (perfect-foresight) HPA, albeit with a slightly higher average pod count.

A separate cold-start / transfer HPA test (Table 9) pretrains on other FaaS clusters then fine-tunes online on the target. There, E3Former-FTPL lands closest to Ideal (attributed to FTPL’s simplicity giving better generalizability), and all models show larger max-latency/max-pods early on while the online learner warms up.

Production deployment. E3Former is deployed in ByteDance’s IHPA platform, supporting 30+ applications (including Douyin E-Commerce, TouTiao, and Volcano Engine), with predictive auto-scaling capacity over 600,000 CPU cores, reducing resource utilization by over 40% while “essentially ensuring service quality.” (Honesty note: the conclusion names only “Douyin E-Commerce and Toutiao”; the abstract/intro additionally name Volcano Engine.)


Evaluation (datasets & metrics, briefly)

Datasets (twelve in total). Six ByteDance cloud-workload datasets plus six public benchmarks. The ByteDance set (all 1-minute granularity, open-sourced at huggingface.co/datasets/ByteDance/CloudTimeSeriesData):

DatasetCloudMetricVariates (M)Samples
FaaS_SmallPrivateQPS723,041
FaaS_MediumPrivateQPS9323,041
FaaS_LargePrivateQPS22623,041
IaaS_SmallPublicCPU769,764
IaaS_MediumPublicCPU5869,764
IaaS_LargePublicCPU9369,764

Public datasets: ETTh1, ETTh2 (hourly), ETTm1, ETTm2 (15 min), Electricity/ECL (321 vars, hourly), Weather (21 vars). (Table 3 lists Weather as hourly but the text says it is sampled every 10 minutes — an internal inconsistency.) The ByteDance traces span 1-2 months; experiments run 16 days (private cloud) and 49 days (public cloud), with online prediction lasting 1 week on FaaS and 6 weeks on IaaS.

Baselines (four families). Statistical (ETS, SARIMA, STL); deep nets tested with retraining (DLinear, TimesNet, iTransformer, PatchTST); LLM-based (GPT4TS, plus Online-GPT4TS); and online-prediction nets (FSNet, OneNet — the leading baseline, Time-FSNet).

Metrics. MSE and MAE (computed on normalized data) and WMAPE (Weighted Mean Absolute Percentage Error, computed after normalization+denormalization; WMAPE = sum_j |x_hat^j - x^j| / sum_k x^k, multiplied by 100 for display on FaaS). For the HPA tests: latency percentiles (Average, Max, 99.9, 99, 90) and Pod occupation (Average, Max). The HPA comparison deliberately excludes AHPA “as it is not open-sourced.”

Headline accuracy.

TaskResult
Online forecasting (vs OneNet)E3Former-OS reduces MSE 13.9% / MAE 11.7% / WMAPE 19.3% on average; E3Former-FTPL by 9.2% / 8.5% / 13.3%.
Online transfer / cold-start (vs OneNet)E3Former-OS reduces MSE 15.3% / MAE 14.1% / WMAPE 26.3% across six transfer tasks; E3Former-FTPL by 3.1% / 7.9% / 13.1%.
Efficiency (Figure 8)E3Former-OS 41k / E3Former-FTPL 36k params vs OneNet 216k; FTPL = 16.7% of OneNet’s params, >800% higher throughput, 39.6% less inference time.
Predictive auto-scaling (Tables 8-9)vs Naive HPA: avg/max Pod -7.3% / -29.4%, avg/max latency -5.6% / -91.7%, ~30% lower p99.

Main per-dataset forecasting results (Tables 4-5, L=1440, averaged over horizons {1,10,30,60}). Lower is better; E3Former-OS shown against the leading online baseline OneNet, in [MSE / MAE / WMAPE]:

DatasetE3Former-OSE3Former-FTPLOneNet (leading baseline)
FaaS_Small0.203 / 0.263 / 1.4190.208 / 0.270 / 1.4770.228 / 0.294 / 1.709
FaaS_Medium0.208 / 0.267 / 1.2810.214 / 0.274 / 1.3810.229 / 0.293 / 1.781
FaaS_Large0.217 / 0.277 / 1.4450.214 / 0.274 / 1.3850.229 / 0.293 / 1.781
IaaS_Small0.634 / 0.563 / 0.6650.646 / 0.565 / 0.6690.674 / 0.599 / 0.705
IaaS_Medium0.733 / 0.689 / 0.7460.741 / 0.692 / 0.7510.777 / 0.718 / 0.785
IaaS_Large0.734 / 0.682 / 0.6950.753 / 0.691 / 0.7070.761 / 0.709 / 0.725
ETTh10.275 / 0.352 / 0.3650.285 / 0.362 / 0.373n/r in this table extract
ETTm20.328 / 0.371 / 0.0710.333 / 0.374 / 0.071n/r in this table extract
Electricity (ECL)0.171 / 0.242 / 0.0830.180 / 0.246 / 0.087n/r in this table extract

(GPT4TS / Online-GPT4TS show “-” on Electricity due to CUDA out-of-memory.)

Ablation (Table 7) — both new pieces matter. Removing MIMO collapses E3Former to a single subnetwork (no ensemble); removing the Online Adaptor collapses the backbone to a basic Transformer; removing both = “Online PatchTST.” For E3Former-OS: removing MIMO raises MSE/MAE/WMAPE by 15.0% / 8.2% / 13.4%; removing the Adapter raises them by 29.4% / 14.3% / 13.5% in-domain — and by a striking 54.2% / 25.4% / 21.0% in transfer tasks, underscoring the Adapter’s role for cold starts and domain shift.

Honesty flags on the numbers:


Training & pre-training

Trained from scratch, then continuously updated online — no external foundation model.

E3Former (~36k-41k parameters) is trained from random initialization as a task-specific forecaster, then deployed in an online regime where it never stops learning. The lifecycle:

Training details: MSE (l2) loss, AdamW optimizer (following Informer), batch size 32 (GPT4TS used 8 to avoid OOM), implemented in PyTorch on NVIDIA Tesla V100 32GB GPUs, with experiments repeated 3 times. The number of epochs and the learning rate are not reported.

The paper motivates the brevity of pre-training by citing commercial cold-start requirements: AWS ECS needs at least 24 hours of history (two weeks ideal), Google Cloud 3 days, Azure 7 days, Alibaba Cloud 7 days — and shows (via the transfer/cold-start tests) that E3Former’s online adaptation lets it perform well even from a thin pre-training base.


Strengths

Limitations


Glossary

References
  1. Chen, J., He, X., Ye, H., Jiang, F., Zhang, T., Chen, J., & Gao, X. (2025). Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling. arXiv. 10.48550/ARXIV.2508.12773