E3Former
An online ensemble Transformer for workload forecasting in large-scale predictive auto-scaling
TL;DR¶
Cloud platforms want to scale a service’s machines before a traffic surge, not after — which means forecasting how busy each service will be in the next minutes-to-hours. The twist this paper attacks is that the workload keeps changing: a forecasting model that was accurate last week slowly goes stale as user behavior drifts and the system gets updated. E3Former is an online forecasting model — it keeps learning from each new observation as it streams in, instead of being frozen after training. Three ideas make it work and keep it cheap enough for a “serverless” (pay-per-use, minimal-overhead) setting: (1) it slices the same input at several patch sizes at once so different copies (“subnetworks”) each capture cycles at a different time scale, but all share one Transformer backbone so the extra copies cost almost no extra parameters; (2) an Adapter nudges the network’s attention weights on the fly to track drift; and (3) an Ensembler continuously re-weights the subnetworks’ forecasts using online-learning theory (with provable “no-regret” guarantees) so the combined forecast stays robust when the data distribution shifts. It comes in two flavors — E3Former-OS (more accurate) and E3Former-FTPL (more efficient). On twelve datasets it beats the previous best online model (OneNet) by roughly 14% MSE / 12% MAE / 19% WMAPE while using only ~36k–41k parameters (about a sixth of OneNet’s). It is deployed in ByteDance’s IHPA auto-scaling platform across 30+ applications and 600,000+ CPU cores, cutting resource usage by over 40% while essentially preserving service quality. E3Former only produces the forecast; ByteDance’s IHPA platform does the actual scaling.
The Problem (and why simple autoscaling isn’t enough)¶
Autoscaling means automatically adjusting how many resources (machines, containers / “pods”, CPU cores) a cloud application gets, so it can handle its load without wasting money.
Give it too little -> requests slow down, time out, and SLAs (Service Level Agreements — the promises made to users about speed/uptime) are violated. (Under-provisioning.)
Give it too much -> you pay for idle machines. (Over-provisioning.)
The naive approach is reactive autoscaling: watch a metric like CPU, and when it crosses a threshold, add machines. The problem is that allocating physical resources takes time (a provisioning / “cold start” latency). So a reactive scaler is always lagging — by the time the new capacity is ready, users have already suffered.
Proactive (predictive) autoscaling fixes this by forecasting the future workload and scaling ahead of time, hiding the provisioning delay. The forecast is the hard part — and that is exactly what E3Former provides. The paper calls the forecasting model “the core” of a predictive auto-scaling system. The stakes are concrete: the paper reports that “elaborate auto-scaling systems with accurate workload forecasting can enhance service quality by 20% while reducing resource waste by 15%,” and notes in a footnote that at hundreds-of-thousands-of-CPU-core scale “every 1% of resource wastage translates to an annual revenue loss of tens of thousands, if not hundreds of thousands, of dollars.” That is why a better forecast (not a fancier scaler) is worth chasing.
What makes forecasting hard here specifically? The paper lists four workload challenges the model must handle:
Complex periodic patterns. Real cloud workloads repeat on multiple cycles at once — hourly, daily, weekly, even seasonal — and these are intertwined with irregular bursts.
Long prediction length. Because physically allocating resources has latency, the forecast must reach minutes-to-hours into the future, not just one step.
Changing dynamics (the crux of this paper). User behavior shifts and systems get updated, so the statistics of the workload itself drift over time. A model trained once and frozen will slowly degrade. The model must adapt online.
Robustness. It must reliably meet SLAs even as conditions change.
There is a second, easy-to-miss subtlety the paper emphasizes: fine-grained, high-frequency periodicity. The workload is sampled at minute-level granularity, yet it still carries broader daily/weekly/monthly cycles. The paper quotes: “Despite the data being minutely granular, it also captures broader cycles on daily, weekly, and monthly scales.” Capturing this multi-granularity periodic pattern — many nested cycles spanning very different time scales — is the central representation challenge, and the paper argues recent methods cannot capture such complicated dynamics.
Background¶
Time series. A sequence of numbers measured over time, e.g. queries-per-second every minute. The model’s job: given recent history, predict the next stretch.
Lookback window (L) and forecast horizon (H). The model reads the last L time steps (the lookback) and predicts the next H time steps (the horizon). E3Former uses L = 1440 (= 24 hours at 1-minute granularity) for every benchmark, and varies H over {1, 10, 30, 60} (i.e. 1 to 60 minutes ahead). (Notation note: the paper occasionally writes the horizon as T instead of H; the two symbols are used interchangeably.)
Transformer & “attention”. A Transformer is a neural network whose core trick, attention (more precisely self-attention), lets every point in a sequence weigh every other point to decide what matters. Multi-head self-attention (MHSA) runs several attention computations (“heads”) in parallel and concatenates them.
Patching (and PatchTST). Instead of feeding the Transformer one time step at a time, modern forecasters chop the series into patches — short contiguous chunks of P consecutive values — and treat each patch as one “token.” This is the PatchTST design. E3Former is built directly on PatchTST: the paper states that if you strip out E3Former’s two new ideas (multi-resolution patching + ensemble, and the online Adapter), “the removal of MIMO and the online adaptor from our model results in its simplification to a PatchTST.” So think of E3Former as PatchTST + multi-resolution ensembling + online adaptation.
Why “online” forecasting? (the single most important idea here)¶
Most forecasting models are trained offline: you collect a fixed dataset, train until convergence, freeze the weights, and deploy. That is fine if the world stays still. But cloud workloads drift — a phenomenon called concept drift or distribution shift: the relationship the model learned (history -> future) slowly stops holding because user behavior and the system itself change. A frozen (“static”) model therefore degrades over time on a drifting stream.
Online learning is the alternative: the model never stops training. It processes the stream one observation at a time, and every time the truth arrives and the prediction was wrong, it updates its own parameters a little. The paper frames this formally:
The workload is a never-ending stream
X = {x1, x2, ..., xt, ...}, withxta vector ofMmetrics, sampled at uniform intervals (e.g. every minute).At time
t, the model uses the recent historyXt = {x_{t-L}, ..., x_{t-1}}to forecast the nextHsteps, using the current parameterstheta_t:Y_hat_t = f(Xt; theta_t).There is no train/eval split — every step is both a test (predict) and a training opportunity (update). Parameters
thetaare updated each step when the loss is nonzero.
How do you even define “good” when the model and the data both keep changing? Online learning uses regret. Regret compares the model’s actual cumulative error against the error of the single best fixed parameter setting you could have chosen in hindsight (Eq. 1):
regret = sum_t loss( f(Xt; theta_t), x_t ) - inf_theta sum_t loss( f(Xt; theta), x_t )A learner is “good” if its regret is sub-linear in time, regret = o(T), which means regret / T -> 0 as T -> infinity. In plain words: averaged over time, the online model does asymptotically as well as the best fixed model chosen with perfect hindsight. This “no-regret” property is what the paper proves for both of its ensemble variants (both achieve O(sqrt(T)) regret — see How It Works). It is the theoretical safety net that justifies updating the model continuously.
Why two storage tiers matter (and limit history). The paper notes a practical reason history is bounded: cloud monitoring keeps data in two tiers — cold storage (e.g. ClickHouse / OLAP databases, holding months of data but slow to read) and hot storage (e.g. Redis / in-memory, fast but small, e.g. only the past week). Querying long histories online is expensive, which is exactly why the lookback is capped at L = 1440 — “the maximum reasonable query length that the cloud computing system we are using can handle online.”
The “serverless” cost constraint. Serverless systems demand a lean operational ethos: any forecasting model must add minimal compute and parameter overhead. This is why E3Former works hard to be tiny (~36k–41k params) and to make the ensemble nearly free — see How it stays cheap.
Contribution in Simple Terms¶
E3Former (the “Ensembled, Efficient, online Former”, an online ensemble Transformer) is an online workload forecaster built from four synergistic components. The paper’s own one-line summary of each:
Representer — “Extracts multi-resolution periodic features through multi-resolution patching operation, explicitly modeling nested cycles.” (It slices the input at several patch sizes so each scale exposes a different cycle.)
Transformer — “Leverages self-attention modules to capture long-range dependencies.” (A single, shared PatchTST-style encoder.)
Adapter — “Implements online adaptation via cumulative gradient-based parameter updates, enabling rapid refining to workload dynamics.” (Tracks drift at inference time.)
Ensembler — “Combines results from different sub-networks through adaptive weighting, enhancing robustness against distribution shifts.” (Continuously re-weights the subnetworks using no-regret online learning.)
The genuinely new pieces, in plain terms:
Multi-resolution patching + a shared backbone (MIMO). Different patch sizes capture periodicities at different scales. Naively you’d train one model per patch size, costing
dx the compute. Instead E3Former uses a Multi-Input Multi-Output (MIMO) trick: a single, parameter-shared Transformer processes alldpatch streams in parallel and emitsdforecasts. Thed“subnetworks” are therefore not separate networks — they aredinput streams routed through one shared backbone. This is what makes ensembling nearly free.An online Adapter that, only at inference time, adjusts the attention layer’s weights based on a smoothed (moving-average) gradient signal — letting the model bend toward whatever the workload is doing right now.
An online-learning Ensembler that combines the
dforecasts with provable no-regret guarantees, in two interchangeable forms: Online Scaling (OS) for accuracy and Follow-The-Perturbed-Leader (FTPL) for efficiency.
The headline result: better accuracy than every online baseline on twelve datasets, with a fraction of the parameters and far higher throughput, plus a real production deployment in ByteDance’s IHPA auto-scaling platform.
How It Works, Step by Step¶
Here is the end-to-end journey, from raw history to forecast. The data flow differs between offline training and online inference, so both are shown.
The four Representer operations (Section 4.1)¶
The Representer turns the raw window into the multi-resolution patch groups the backbone consumes. Four operations, in order:
(a) Channel Independence. The input Xt is multivariate — L time steps of M metrics (shape L x M). Channel independence treats this as an M-sized mini-batch of M separate univariate series {Xt,i}. Why: it multiplies the effective training set by M, exploits GPU parallelism, and avoids overfitting to spurious correlations between channels. Net effect: forecasting is effectively univariate per channel, even though the raw input is multivariate. (From here on the paper abbreviates each univariate series as Xt in R^L.)
(b) RevIN (Reversible Instance Normalization). Each input series is instance-normalized by subtracting its mean mu and dividing by its std sigma (X_hat = (Xt - mu) / sigma), then passed through a small learnable affine transform with parameters r1, r2 (X_hat = r1 * X_hat + r2). At the output, this is reversed (denormalized) to restore the original scale. Why: it regularizes and handles distribution shift in the input window — instances at wildly different magnitudes get mapped to a comparable distribution.
(c) Multi-resolution patching — the core periodicity trick. Given a patch size P, the normalized series of length L is cut into N = ceil(L/P) patches of size P (the last value is repeated L mod P times to pad the end so dimensions stay consistent). The key insight, citing prior work: “patch-based projections with larger patch sizes can capture high-frequency features and vice versa” — i.e. different patch sizes expose periodicities at different levels. So instead of one patch size, E3Former uses a group of patch sizes {P1, ..., Pd} (default {16, 32, 64, 128}, giving d = 4), producing d groups of patches that explicitly model the nested cycles at multiple resolutions.
(d) Alignment (the efficiency enabler). Different patch sizes yield different patch dimensions, so the d groups can’t be fed to one shared model as-is. Each group is projected through its own small linear layer to a common hidden dimension D. After this alignment, all d groups share a uniform shape and can be processed in parallel by a single backbone. (Extraction note: the paper’s text here contains a typo — “produce f outpus” — which almost certainly means “produce d outputs,” one per input group.)
The shared Transformer backbone (Section 4.2)¶
A standard Transformer-encoder-style module providing long-range dependency modeling. Per attention layer:
Multi-head self-attention (MHSA): embedding matrices
W^Q, W^K, W^Vmap the input to Queries, Keys, Values;head_i = Softmax(Q_i K_i^T / sqrt(D')) V_i; outputs are concatenated and projected byW^O. (D'is the per-head attention dimension,hthe number of heads.)Layer structure (Eqs. 3-5):
X^A = BatchNorm(X^P + MHSA(X^P)), thenFFN(X^A) = max(0, X^A W1 + b1) W2 + b2(a ReLU MLP), thenX^O = BatchNorm(X^A + FFN(X^A)). Note it uses Batch Normalization, not LayerNorm — matching PatchTST’s design. Positional encoding is applied before the attention block.Forecast head: the last layer’s output
X^Ois flattened and mapped by a single linear layer to theH-length forecastYt = Linear(X^O).
The crucial efficiency point — how it stays cheap enough for serverless: using d independent models would scale compute linearly with d. The MIMO mechanism instead routes all d aligned patch groups through one parameter-shared encoder. The paper states “in E3Former there are d independent subnetworks with a part of shared parameters.” The only per-subnetwork weights are the small alignment linear layers (one per patch size) and the per-group forecast heads; everything heavy (the attention blocks) is shared. That is why the full 4-subnetwork ensemble adds almost nothing over a single PatchTST, landing at just ~36k–41k total parameters.
The Adapter — online drift tracking (Section 4.3)¶
The Adapter is active in inference only (Figure 5 labels it “Parameter Adaptation (Inference Only)”). It adjusts the self-attention layer’s weights at the parameter level so the model bends toward current conditions. Inspired by FSNet, it works in three moves:
Smoothed gradient (EMA). Single-step gradients on a drifting stream are noisy. The Adapter keeps an Exponential Moving Average (EMA) of the attention layer’s gradient (Eq. 6):
nabla_hat_new = gamma * nabla_hat_old + (1 - gamma) * nabla_t, wherenabla_tis the current gradient,gammathe decay coefficient (its numeric value is not reported). This gives a stable “gradient memory” of which way the model has consistently needed to move.Two adaptation coefficients. The flattened EMA gradient is mapped by a single linear layer to two vectors (Eq. 7):
[alpha, beta] = Linear(Flatten(nabla_hat)).alphais the weight adaptation coefficient;betais the feature adaptation coefficient.Element-wise adaptation. Adaptation is a simple element-wise (Hadamard) multiply, chosen for simplicity. For the query path (Eq. 8):
W_tilde^Q = alpha (*) W^Q;Q = X^P W_tilde^Q;Q_tilde = beta (*) Q. The adaptedQ_tildereplacesQin the attention. The same weight adaptation is applied toW^Q, W^K, W^V; feature adaptation is applied to the hidden feature embeddings.
The ablation shows this matters a lot — and matters far more under drift / cold starts: removing the Adapter raises in-domain error by ~29% (MSE) but raises transfer error by ~54% (MSE). (See Evaluation.)
The Ensembler — combining the subnetworks (Section 4.4)¶
The Ensembler takes the d subnetwork forecasts f_1(X), ..., f_d(X) and produces one final forecast, continuously re-weighting them as the stream evolves. This is where the online-learning theory lives. Two interchangeable variants are offered:
Variant A — Online Scaling (E3Former-OS), the accurate one.
Its base layer is Exponentiated Gradient Descent (EGD), a classic online-learning algorithm. The decision space is a d-dimensional simplex: a set of nonnegative weights w_i >= 0 that sum to 1 (a “soft” blend). The goal (Eq. 9) is to pick w minimizing || sum_i w_i f_i(X) - Y ||^2. EGD starts uniform (w_1 = 1/d each) and updates multiplicatively by how badly each expert just did (Eq. 10): w_{t+1,i} = w_{t,i} * exp(-eta * ||f_i(X) - Y||^2) / Z_t, where Z_t normalizes the weights to sum to 1, and eta is a learning rate (numeric value not reported). Theorem 1 guarantees that for T > 2 log(d), EGD has vanishing regret O(sqrt(T)).
But pure EGD has a known flaw — the “slow switch phenomenon”: an exponentially-weighted forecaster “tends to react sluggishly to significant shifts in data distribution.” To fix that lag, E3Former-OS adds a small learned Online Scaling module on top: an input embedding layer + a multi-head self-attention layer equipped with the same Adapter + an output linear layer. It stacks the d forecasts and the ground truth into F, maps the horizon dimension H to hidden dim D, runs (adapter-equipped) self-attention, and produces scaling weights (Eq. 12): s = Softmax(Linear(Flatten(MHSA(F))) + w). Note the EGD weights w are added as a residual/bias before the softmax — so the learned attention net corrects the EGD policy rather than replacing it. The final output is f = s^T F. Figure 9 visualizes these weights rapidly re-shuffling across clusters during inference, directly countering the slow-switch problem.
Variant B — Follow-The-Perturbed-Leader (E3Former-FTPL), the efficient one.
FTPL is a more general, non-parametric online method that works in both convex and non-convex settings (EGD’s guarantee leans on convexity). Its decision space is discrete one-hot: pi_i in {0,1} with exactly one 1 — i.e. it hard-picks a single subnetwork’s output rather than blending. The update (Eq. 13) selects the subnetwork with the best perturbed cumulative loss: pi_{t+1} = argmin_pi ( sum_tau ||f_pi(X_tau) - Y_tau||^2 + sigma(pi) ), where sigma(pi) is a random perturbation drawn from a Uniform distribution. The perturbation injects exploration and mitigates “short-term myopia.” Algorithm 1 repeats this m times to build an empirical distribution over choices, then samples the target subnetwork from it. Theorem 2 gives sublinear expected regret O(sqrt(T)). Because FTPL is non-parametric, it has fewer parameters and faster inference than OS, at a slight accuracy cost.
OS vs FTPL trade-off (the paper’s own framing): “E3Former-FTPL marginally underperforms relative to E3Former-OS. Nonetheless, it offers the benefits of reduced parameter count and expedited inference speeds, contributing to greater throughput ... and facilitating flexible deployment in environments limited by resources.”
Where the diversity comes from. Importantly, the ensemble’s diversity is not from independent weights or an explicit diversity-promoting loss (the paper reports none). It comes solely from the different patch sizes — each subnetwork sees the same series patched at a different scale, so each naturally specializes in a different periodicity. The paper does not quantify this inter-subnetwork diversity.
Non-Transformer ingredients used: RevIN normalization, channel-independence, multi-resolution patching, the MIMO weight-sharing trick, an EMA-of-gradient adapter, and two online-learning algorithms (EGD/Online-Scaling or FTPL). There is no explicit seasonal-trend (STL-style) decomposition — periodicity is captured purely by multi-resolution patching, not by additive trend/seasonal splitting. There is no diffusion, GAN, or frequency/Fourier transform.
Workload Modeling & Prediction Pipeline (full-text deep read)¶
This section zooms in on two narrow questions: (A) what is “the workload” as a number, and (B) the exact path from that raw number to a prediction. Everything here is grounded in this paper’s full text.
(A) How the workload is modeled and characterized¶
What real-world quantity is “the workload”? It depends on the cloud service type, but it is always one resource/demand metric per computing instance, sampled over time. Concretely, across the ByteDance datasets:
QPS = Queries Per Second — for FaaS (Function-as-a-Service) instances: how many requests hit the function each second.
CPU usage — for IaaS (Infrastructure-as-a-Service) instances.
Formally (Section 3.2), the workload is a stream X = {x1, x2, ..., xt, ...} where each xt is in R^M — a vector of M metrics — sampled at uniform intervals (e.g. every minute).
One signal or many (univariate vs multivariate)? The raw input is multivariate (M metrics, where M ranges from 7 to 321 across datasets — e.g. FaaS_Small M=7, FaaS_Large M=226, IaaS_Medium M=58, Electricity M=321). But the model is channel-independent: each of the M metrics is forecast univariately through the shared pipeline, as a member of an M-sized mini-batch. So in practice the forecasting is univariate-per-channel, even though the system ingests many metrics.
At what time granularity is it sampled? The ByteDance cloud datasets are minute-level (1-minute sampling). Crucially, even at this fine granularity the workload still exhibits daily, weekly, and monthly cycles — the “fine-grained high-frequency periodicity” the model is designed to capture. Public datasets vary (ETTm at 15 min, ETTh / Electricity / Weather at hourly per Table 3 — though the Weather frequency is internally inconsistent; see uncertainties).
What makes it hard? The four challenges from The Problem: (1) nested multi-scale cycles (hourly/daily/weekly/seasonal) tangled with irregular bursts; (2) the need for long horizons because resource allocation has latency; (3) drifting dynamics demanding online adaptation; (4) robustness to meet SLAs. The paper stresses that minute-granular data carrying broad cycles is exactly what recent methods fail to model (Figure 3).
How is it represented numerically inside the model? Not as a raw curve fed whole. The representation stack is:
Channel independence — factor the
M-metric input intoMunivariate series.RevIN — instance-normalize each series (mean/std + learnable affine), reversed at the output.
Multi-resolution patching — cut each series into patches at several patch sizes
{16, 32, 64, 128}, so each scale exposes a different periodicity. This is the core representation idea, and it replaces explicit trend/seasonal decomposition.Alignment — project each patch group through its own linear layer to a shared dimension
D, so a single MIMO backbone can process them all.
There are no frequency-domain transforms, no STL-style trend/seasonal/residual split; patches at multiple sizes are the entire mechanism for capturing nested cycles.
(B) The step-by-step prediction pipeline (raw signal -> forecast)¶
The core idea in one line: slice the recent window at several patch sizes, push all slices through one shared Transformer to get several forecasts, then (online) adapt the weights and ensemble the forecasts into one. Numbers below are the paper’s main-experiment settings: lookback L = 1440 (24h at 1-min), horizon H in {1, 10, 30, 60} (1-60 min ahead).
Collect & window. A monitoring system records each instance’s metric (QPS or CPU) every minute. The most recent
L = 1440points form the input window. Channel-independence splits theM-metric window intoMunivariate series.RevIN normalize. Each series is instance-normalized (subtract mean, divide by std, learnable affine). The statistics are stored for restoration at the end.
Multi-resolution patching. Each normalized series is cut into patches at four patch sizes
{16, 32, 64, 128}, givingd = 4patch groups. Each groupihasN_i = ceil(L / P_i)patches; the end is padded by repeating the last value. Why: larger patches surface higher-frequency features and smaller patches lower-frequency, so the four groups jointly cover the nested cycles.Alignment. Each group is projected through its own linear layer to a shared hidden dim
D, so all four can be fed in parallel to one backbone.Shared MIMO Transformer. A single parameter-shared encoder (MHSA + BatchNorm + FFN, positional encoding up front) processes all four aligned groups and emits four
H-length forecasts — one per subnetwork. Offline, each forecast trains its own subnetwork independently with MSE loss. Online, all four feed the Ensembler.(Online) Adapter. At inference, the Adapter uses an EMA of the attention-layer gradient (Eq. 6) to compute element-wise weight (
alpha) and feature (beta) adaptation coefficients (Eqs. 7-8), nudgingW^Q/W^K/W^Vand the hidden features toward current conditions.(Online) Ensembler. The four forecasts are combined into one. E3Former-OS: EGD simplex weights, refined by a small adapter-equipped attention net via
s = Softmax(Linear(Flatten(MHSA(F))) + w), outputf = s^T F. E3Former-FTPL: hard-pick one subnetwork via perturbed-cumulative-lossargmin(Eq. 13), sampling frommperturbed draws.Denormalize & emit. RevIN is reversed to return the forecast to original units —
Hfuture values per channel (QPS or CPU).
Prediction cadence. The model forecasts at fixed, non-overlapping intervals (e.g. once every 60 minutes) to conserve compute and match the scaling horizon. The paper argues the ideal interval equals the forecast-horizon length, which makes the online setting “immune” to the H-step feedback-delay gap that the PROCEED method identifies (you never forecast over a window whose truth hasn’t yet arrived).
What is finally predicted — exact shape and type. A multi-step, point forecast (no probability intervals; trained with plain MSE loss). Horizon H in {1, 10, 30, 60} steps; main-table results are averaged over all four horizons. Output units match the input metric: a future QPS curve (FaaS) or CPU curve (IaaS), per channel.
How the result is consumed (briefly). The forecast is E3Former’s only output — it does not emit a pod count or scaling action itself. In the deployed system, the predicted workload is fed to ByteDance’s IHPA platform, which makes the scaling decision; see MAPE-K.
Model Parameters & How They Were Chosen¶
This section collects every hyperparameter the paper reports and states, per parameter, how it was chosen. A recurring theme: the paper is explicit about windowing, patch sizes, and the ensemble/online machinery, but silent on the conventional Transformer sizing numbers (hidden dimensions, layer count, head count are defined symbolically but never given numerically). Where the paper gives no number, it is marked “not reported”; no defaults are imported.
(A) What are the model parameters?¶
Architecture. E3Former is a channel-independent, patch-based Transformer (a PatchTST backbone) augmented with multi-resolution patching + a MIMO shared-backbone ensemble + an online Adapter. Structural settings:
| Parameter | Value | Notes |
|---|---|---|
| Architecture family | Online ensemble Transformer (PatchTST backbone + MIMO + online Adapter) | Removing MIMO and the Adapter reduces it to PatchTST. |
| Number of subnetworks (d) | 4 (default) | One per patch size; the d subnetworks share the backbone via MIMO. |
| Patch sizes | {16, 32, 64, 128} (default); expanded to {16, 32, 64, 128, 256, 512} (6 subnetworks) for the parameter-analysis sweep | Figure 4 illustrates {8, 16, 32, 64}, which differs from the experimental set — treat the figure as illustrative only (see uncertainties). |
| Patches per group (N) | N_i = ceil(L / P_i) | End-padded by repeating the last value L mod P times. |
| Normalization | RevIN (reversible instance norm) with learnable affine r1, r2 | Whether r1, r2 are shared across patch groups/channels is not stated. |
| Backbone norm | Batch Normalization (not LayerNorm) | Matches PatchTST. |
| Attention layers | not reported | Text says “several attention layers” but gives no count. |
| Attention heads (h) | not reported numerically | Present symbolically (W^O in R^{hD' x D}) but no value. |
| Hidden dimension (D) | not reported | Symbolic only. |
| Per-head attention dim (D’) | not reported | Symbolic only. |
| Feed-forward dimension (Dff) | not reported | Symbolic only (W1 in R^{D x Dff}). |
| Positional encoding | applied before attention; type not specified | Sinusoidal vs learned not stated. |
| Dropout / activation | ReLU in the FFN; dropout not reported | FFN is max(0, ...) (ReLU). |
Adapter (online, inference-only).
| Parameter | Value |
|---|---|
| EMA decay (gamma) | not reported (Eq. 6) |
| Adaptation coefficients | alpha (weight adaptation), beta (feature adaptation), produced by a linear layer on the flattened EMA gradient (Eq. 7) |
| Adaptation operator | element-wise (Hadamard) multiply (Eq. 8) |
| Applied to | weights W^Q, W^K, W^V (via alpha) and hidden feature embeddings (via beta) |
Ensembler.
| Parameter | Value |
|---|---|
| Variant A — EGD init weights | w_1,i = 1/d (uniform over the d subnetworks) |
| EGD learning rate (eta) | not reported (Eq. 10) |
| EGD regret condition | T > 2 log(d) gives O(sqrt(T)) regret (Theorem 1) |
| Online Scaling module | input embedding + adapter-equipped MHSA + output linear; weights via s = Softmax(Linear(Flatten(MHSA(F))) + w) (Eq. 12) |
| Variant B — FTPL perturbation | Uniform distribution; range/scale not reported |
| FTPL repeat count (m) | not reported (Algorithm 1) |
| FTPL regret bound | expected O(sqrt(T)) (Theorem 2) |
Training.
| Parameter | Value |
|---|---|
| Loss | l2 / MSE |
| Optimizer | AdamW (following Informer) |
| Learning rate | not reported |
| Epochs | not reported (experiments repeated 3 times) |
| Batch size | 32 (GPT4TS used 8 to avoid OOM) |
| Offline pre-training data | ~1 week of archived data (2 weeks for the Kubernetes HPA test) |
| Framework / hardware | PyTorch; NVIDIA Tesla V100 32GB |
Data / windowing.
| Parameter | Value |
|---|---|
| Lookback window (L) | 1440 (= 24h at 1-min; “maximum reasonable query length the cloud system can handle online”) |
| Forecast horizon (H) | {1, 10, 30, 60} steps; main results averaged over all four |
| Sampling interval | 1 minute (ByteDance FaaS/IaaS) |
| Prediction cadence | fixed, non-overlapping intervals (e.g. every 60 min) |
Total parameter count / model size. E3Former-OS = 41k, E3Former-FTPL = 36k parameters (Figure 8). For comparison: OneNet 216k, FSNet 103k, Time-FSNet 113k. E3Former-FTPL is 16.7% of OneNet’s parameters, with >800% higher throughput and 39.6% less inference time. (The paper does not explicitly break down whether the ~5k OS-vs-FTPL delta is exactly the Online Scaling network, but that is the plausible source — see uncertainties.)
(B) How were the parameters chosen?¶
Number of subnetworks
d = 4, patch sizes{16, 32, 64, 128}: chosen by a sensitivity study (Section 5.4.4, Figure 10). The study sweeps the patch-size group from 1 to 6 subnetworks. Patch group{16}alone equals PatchTST (one subnetwork). Error declines slowly as subnetworks are added, but going from 4 subnetworks{16,32,64,128}to 6{16,32,64,128,256,512}— a 50% increase in count — yields only comparable performance, and the 4-subnetwork model even beats the 6-subnetwork one on the FaaS->IaaS transfer task. This is diminishing marginal returns as network capacity saturates, so 4 is chosen as the efficiency/accuracy sweet spot.Lookback
L = 1440: chosen by a system constraint, not tuned. It is “the maximum reasonable query length that the cloud computing system we are using can handle online” — i.e. 24 hours of minute-level data, bounded by hot-storage limits.Horizon
H in {1, 10, 30, 60}: chosen to span the practical scaling range (1 to 60 minutes ahead); all main results are averaged over these four.Training hyperparameters (MSE loss, AdamW, batch 32): held uniform across models for a fair comparison. “We follow the optimization details in Informer ... batch size for all models is set to 32, except for GPT4TS, which is set to 8 to avoid GPU memory overflow.” Experiments are repeated 3 times.
The numeric Transformer sizes (D, D’, Dff, layer count, head count
h), the EMA decaygamma, EGD learning rateeta, and FTPL repeat countm: not reported, and the paper is silent on how they were set. They cannot be recovered from the text.
Inputs (what it consumes)¶
A multivariate workload stream of
Mmetrics per system, processed channel-independently (each metric forecast univariately). Metric meaning depends on service type:QPS (Queries Per Second) — FaaS function instances.
CPU usage — IaaS instances.
Lookback window (L): the recent history, fixed at
L = 1440(24 hours at 1-minute granularity) for every benchmark.Ground truth feedback (online): as truth arrives, it is used to (a) update parameters via the Adapter, and (b) re-weight the Ensembler. E3Former-OS’s Online Scaling module even stacks recent ground truth into its input. (The exact lag at which ground truth becomes available to the OS module is not stated explicitly.)
No hand-engineered features required — patching, normalization, and the online machinery are built into the model.
Outputs (what it produces)¶
A forecasted future workload series
Y_hatof lengthHper channel — predicted QPS (FaaS) or CPU usage (IaaS) for each future step.Horizon (H): {1, 10, 30, 60} steps (1-60 minutes ahead); a point forecast (no prediction intervals).
That forecast is the only output. E3Former does not itself output a pod count or scaling action — it feeds the downstream IHPA auto-scaler, which converts the forecast into scaling decisions.
How It Fits the Autoscaling Framework (MAPE-K)¶
The MAPE-K loop is the standard reference model for self-adaptive systems: Monitor -> Analyze -> Plan -> Execute, over shared Knowledge. E3Former is a forecaster feeding ByteDance’s IHPA (Intelligent Horizontal Pod Auto-scaling) platform.
Where E3Former sits: the ANALYZE stage. It is the predictive brain that turns monitored history into a forecast of future demand. It does not do Monitor, Plan, or Execute itself.
Makes scaling PROACTIVE. By forecasting ahead, it lets the autoscaler provision capacity before the load changes, hiding the resource-allocation delay. This is the whole point versus reactive (threshold) scaling.
Horizontal scaling. The deployment is horizontal pod auto-scaling (changing the number of pods), via ByteDance’s IHPA platform and, in the controlled test, Kubernetes’ HPA mechanism. (No vertical CPU/RAM resizing.)
A distinctive online twist on the loop. Unlike a static forecaster, E3Former’s Analyze stage updates itself using the feedback (realized truth) flowing back from Monitor — the Adapter and Ensembler both consume recent error to adjust. In the deployed system these two cadences are decoupled: scaling decisions execute every scaling interval (default 1 minute), while online model updates run every update interval (default 10 minutes) — “the system review[s] forecasting errors based on real-time metrics and performs online model updates.”
Plan/Execute are left to IHPA. E3Former is honest about its role: it supplies the prediction; turning that into a replica count and applying it is IHPA’s job. It is a drop-in predictive front-end for a proactive horizontal autoscaler.
What the autoscaling tests showed (controlled Kubernetes HPA, replaying a real FaaS workload):
| Strategy | Avg latency | Max latency | 99-pct latency | Avg pods | Max pods |
|---|---|---|---|---|---|
| Naive HPA (reactive, built-in) | 0.231 s | 91.022 s | 0.689 s | 16.574 | 34 |
| OneNet (predictive) | 0.267 s | 52.906 s | 1.217 s | 14.949 | 22 |
| FSNet (predictive) | 0.236 s | 7.347 s | 1.039 s | 14.942 | 29 |
| E3Former-OS | 0.218 s | 7.535 s | 0.731 s | 15.368 | 21 |
| E3Former-FTPL | 0.223 s | 7.734 s | 0.767 s | 15.071 | 22 |
| Ideal HPA (perfect foresight) | 0.219 s | 9.419 s | 0.739 s | 14.608 | 24 |
Versus the built-in Naive HPA, E3Former cut average / maximum Pod occupation by 7.3% / 29.4% and average / maximum latency by 5.6% / 91.7% (the maximum-latency drop is dominated by Naive HPA’s 91s tail). It reports roughly 30% lower p99 latency, and E3Former-FTPL gives a 16.5% average-latency improvement over OneNet without a significant pod increase. E3Former offers QoS similar to the Ideal (perfect-foresight) HPA, albeit with a slightly higher average pod count.
A separate cold-start / transfer HPA test (Table 9) pretrains on other FaaS clusters then fine-tunes online on the target. There, E3Former-FTPL lands closest to Ideal (attributed to FTPL’s simplicity giving better generalizability), and all models show larger max-latency/max-pods early on while the online learner warms up.
Production deployment. E3Former is deployed in ByteDance’s IHPA platform, supporting 30+ applications (including Douyin E-Commerce, TouTiao, and Volcano Engine), with predictive auto-scaling capacity over 600,000 CPU cores, reducing resource utilization by over 40% while “essentially ensuring service quality.” (Honesty note: the conclusion names only “Douyin E-Commerce and Toutiao”; the abstract/intro additionally name Volcano Engine.)
Evaluation (datasets & metrics, briefly)¶
Datasets (twelve in total). Six ByteDance cloud-workload datasets plus six public benchmarks. The ByteDance set (all 1-minute granularity, open-sourced at huggingface
| Dataset | Cloud | Metric | Variates (M) | Samples |
|---|---|---|---|---|
| FaaS_Small | Private | QPS | 7 | 23,041 |
| FaaS_Medium | Private | QPS | 93 | 23,041 |
| FaaS_Large | Private | QPS | 226 | 23,041 |
| IaaS_Small | Public | CPU | 7 | 69,764 |
| IaaS_Medium | Public | CPU | 58 | 69,764 |
| IaaS_Large | Public | CPU | 93 | 69,764 |
Public datasets: ETTh1, ETTh2 (hourly), ETTm1, ETTm2 (15 min), Electricity/ECL (321 vars, hourly), Weather (21 vars). (Table 3 lists Weather as hourly but the text says it is sampled every 10 minutes — an internal inconsistency.) The ByteDance traces span 1-2 months; experiments run 16 days (private cloud) and 49 days (public cloud), with online prediction lasting 1 week on FaaS and 6 weeks on IaaS.
Baselines (four families). Statistical (ETS, SARIMA, STL); deep nets tested with retraining (DLinear, TimesNet, iTransformer, PatchTST); LLM-based (GPT4TS, plus Online-GPT4TS); and online-prediction nets (FSNet, OneNet — the leading baseline, Time-FSNet).
Metrics. MSE and MAE (computed on normalized data) and WMAPE (Weighted Mean Absolute Percentage Error, computed after normalization+denormalization; WMAPE = sum_j |x_hat^j - x^j| / sum_k x^k, multiplied by 100 for display on FaaS). For the HPA tests: latency percentiles (Average, Max, 99.9, 99, 90) and Pod occupation (Average, Max). The HPA comparison deliberately excludes AHPA “as it is not open-sourced.”
Headline accuracy.
| Task | Result |
|---|---|
| Online forecasting (vs OneNet) | E3Former-OS reduces MSE 13.9% / MAE 11.7% / WMAPE 19.3% on average; E3Former-FTPL by 9.2% / 8.5% / 13.3%. |
| Online transfer / cold-start (vs OneNet) | E3Former-OS reduces MSE 15.3% / MAE 14.1% / WMAPE 26.3% across six transfer tasks; E3Former-FTPL by 3.1% / 7.9% / 13.1%. |
| Efficiency (Figure 8) | E3Former-OS 41k / E3Former-FTPL 36k params vs OneNet 216k; FTPL = 16.7% of OneNet’s params, >800% higher throughput, 39.6% less inference time. |
| Predictive auto-scaling (Tables 8-9) | vs Naive HPA: avg/max Pod -7.3% / -29.4%, avg/max latency -5.6% / -91.7%, ~30% lower p99. |
Main per-dataset forecasting results (Tables 4-5, L=1440, averaged over horizons {1,10,30,60}). Lower is better; E3Former-OS shown against the leading online baseline OneNet, in [MSE / MAE / WMAPE]:
| Dataset | E3Former-OS | E3Former-FTPL | OneNet (leading baseline) |
|---|---|---|---|
| FaaS_Small | 0.203 / 0.263 / 1.419 | 0.208 / 0.270 / 1.477 | 0.228 / 0.294 / 1.709 |
| FaaS_Medium | 0.208 / 0.267 / 1.281 | 0.214 / 0.274 / 1.381 | 0.229 / 0.293 / 1.781 |
| FaaS_Large | 0.217 / 0.277 / 1.445 | 0.214 / 0.274 / 1.385 | 0.229 / 0.293 / 1.781 |
| IaaS_Small | 0.634 / 0.563 / 0.665 | 0.646 / 0.565 / 0.669 | 0.674 / 0.599 / 0.705 |
| IaaS_Medium | 0.733 / 0.689 / 0.746 | 0.741 / 0.692 / 0.751 | 0.777 / 0.718 / 0.785 |
| IaaS_Large | 0.734 / 0.682 / 0.695 | 0.753 / 0.691 / 0.707 | 0.761 / 0.709 / 0.725 |
| ETTh1 | 0.275 / 0.352 / 0.365 | 0.285 / 0.362 / 0.373 | n/r in this table extract |
| ETTm2 | 0.328 / 0.371 / 0.071 | 0.333 / 0.374 / 0.071 | n/r in this table extract |
| Electricity (ECL) | 0.171 / 0.242 / 0.083 | 0.180 / 0.246 / 0.087 | n/r in this table extract |
(GPT4TS / Online-GPT4TS show “-” on Electricity due to CUDA out-of-memory.)
Ablation (Table 7) — both new pieces matter. Removing MIMO collapses E3Former to a single subnetwork (no ensemble); removing the Online Adaptor collapses the backbone to a basic Transformer; removing both = “Online PatchTST.” For E3Former-OS: removing MIMO raises MSE/MAE/WMAPE by 15.0% / 8.2% / 13.4%; removing the Adapter raises them by 29.4% / 14.3% / 13.5% in-domain — and by a striking 54.2% / 25.4% / 21.0% in transfer tasks, underscoring the Adapter’s role for cold starts and domain shift.
Honesty flags on the numbers:
The abstract’s “~10% average forecast-error reduction” is a coarse headline; the precise per-metric body figures are 13.9% / 11.7% / 19.x%.
WMAPE online gain is reported inconsistently: 19.3% in Section 5.2 vs 19.1% in the Introduction/abstract.
Dataset-count ambiguity: Section 5.2 says the online averages are “across seven datasets,” but Table 4 lists six cloud datasets and Table 5 six public ones — the “seven” is unreconciled (possibly a per-domain cluster count of 7) and should be treated as ambiguous.
No standard deviations / confidence intervals are reported in the main accuracy tables despite the 3 repeats; only point estimates are given. Some baseline cells appear garbled by PDF extraction (e.g. a possibly mis-ordered Time-FSNet row in Table 6, an implausibly low DLinear FaaS_Large MSE); E3Former and OneNet headline rows are internally consistent.
Training & pre-training¶
Trained from scratch, then continuously updated online — no external foundation model.
E3Former (~36k-41k parameters) is trained from random initialization as a task-specific forecaster, then deployed in an online regime where it never stops learning. The lifecycle:
Offline pre-training: a short warm-up on archived history — about 1 week of data for the forecasting experiments (and 2 weeks for the Kubernetes HPA test). This is just enough to give the model a sensible starting point.
Online phase: from deployment onward there is no train/eval separation. Each step the model forecasts with its current weights, observes the truth, and updates — the Adapter adjusts attention weights (EMA-gradient based) and the Ensembler re-weights the subnetworks. Both achieve provable
O(sqrt(T))regret.
Training details: MSE (l2) loss, AdamW optimizer (following Informer), batch size 32 (GPT4TS used 8 to avoid OOM), implemented in PyTorch on NVIDIA Tesla V100 32GB GPUs, with experiments repeated 3 times. The number of epochs and the learning rate are not reported.
The paper motivates the brevity of pre-training by citing commercial cold-start requirements: AWS ECS needs at least 24 hours of history (two weeks ideal), Google Cloud 3 days, Azure 7 days, Alibaba Cloud 7 days — and shows (via the transfer/cold-start tests) that E3Former’s online adaptation lets it perform well even from a thin pre-training base.
Strengths¶
Built for drift (online learning with guarantees). It keeps learning on the live stream rather than going stale, and the ensemble’s re-weighting carries provable no-regret (
O(sqrt(T))) bounds — a principled answer to concept drift that static forecasters lack.Captures fine-grained multi-scale periodicity. Multi-resolution patching at
{16,32,64,128}lets different subnetworks specialize in different nested cycles, addressing the minute-granular-yet-daily/weekly-periodic structure that the paper says recent methods miss.Extremely lightweight (the serverless win). Just ~36k-41k parameters — 16.7% of OneNet’s — with >800% higher throughput and 39.6% lower inference time, thanks to the MIMO shared backbone making the ensemble nearly free.
Two deployment-tunable variants. OS for maximum accuracy, FTPL for maximum efficiency/throughput and (per the cold-start test) best generalization.
Strong, broad accuracy gains. Beats the best online baseline by ~14% MSE / 12% MAE / 19% WMAPE online, and by even more on transfer/cold-start tasks.
Real production validation + open data. Deployed in ByteDance IHPA across 30+ apps and 600,000+ CPU cores (>40% resource reduction), with the six cloud datasets open-sourced.
Limitations¶
Only the prediction, not the actuation. E3Former stops at the forecast; converting it to a pod count and applying it is left to IHPA. Nonlinear scaling behavior, multi-resource bottlenecks, and pod startup dynamics are not modeled by the forecaster.
Univariate / channel-independent by design. Despite ingesting
Mmetrics, it forecasts each one independently and ignores cross-metric / cross-instance correlations — information left on the table where services genuinely interact.Diversity is implicit and unmeasured. Ensemble diversity is assumed to emerge solely from different patch sizes; there is no explicit diversity loss, and inter-subnetwork diversity is never quantified. Returns diminish quickly past 4 subnetworks.
Many internal numbers undisclosed. Hidden dimensions (
D, D', Dff), layer/head counts, EMA decaygamma, EGD learning rateeta, FTPL repeat countm, learning rate, and epoch count are all unreported — limiting reproducibility from the paper alone.Reporting inconsistencies. The abstract’s “~10%” vs body’s 13.9/11.7/19.x%, WMAPE 19.1% vs 19.3%, the unreconciled “seven datasets” claim, a Figure-4-vs-experiment patch-size mismatch, and the Weather-frequency conflict are all present and unresolved in the text.
No uncertainty quantification. Point forecasts only (no prediction intervals), and no confidence intervals on the main tables despite 3 repeats.
Controlled HPA test, not a long-term production study of the scaling loop. The end-to-end auto-scaling numbers come from a controlled Kubernetes test replaying a derived FaaS workload; the production deployment is described in aggregate (cores, app count, % savings) rather than measured end-to-end in the paper.
Glossary¶
Workload forecasting: predicting a cloud service’s future demand (QPS, CPU, etc.).
Autoscaling: automatically changing the resources allocated to an app to match demand.
Reactive vs. proactive scaling: react after a threshold is crossed (always late) vs. forecast and scale ahead (hides provisioning delay).
Online learning: the model never stops training — it updates its parameters from each new streaming observation. Contrast with static / offline training (train once, then freeze).
Concept drift / distribution shift: the statistical relationship the model learned changes over time, degrading a frozen model.
Regret / no-regret / sub-linear regret: cumulative error of the online model minus that of the best fixed model chosen in hindsight; “no-regret” means this grows slower than time (
o(T)), so the average gap vanishes. E3Former’s ensembles achieveO(sqrt(T)).Ensemble / subnetwork: here,
dforecasts from one shared backbone fed the same series atddifferent patch sizes; the Ensembler blends or picks among them.MIMO (Multi-Input Multi-Output): one parameter-shared model that learns from several inputs and emits several outputs — the trick that makes the ensemble nearly free.
Patch / patching / PatchTST: chopping a series into short chunks (“patches”) used as tokens; PatchTST is the patch-based Transformer forecaster E3Former is built on.
Multi-resolution patching: patching the same series at several patch sizes so each exposes a different periodicity.
RevIN (Reversible Instance Norm): normalize each input (mean/std + learnable affine), reversed at the output, to handle differing magnitudes / distribution shift.
Channel independence: each metric/series forecast separately, ignoring cross-series correlations.
Adapter / EMA-gradient: the inference-only module that nudges attention weights using an Exponential Moving Average of gradients (a smoothed “gradient memory”).
EGD (Exponentiated Gradient Descent): online algorithm that keeps simplex weights over experts and updates them multiplicatively by recent error.
Online Scaling (OS): E3Former-OS’s ensembler — EGD refined by a small adapter-equipped attention net; the accurate variant.
FTPL (Follow-The-Perturbed-Leader): a non-parametric online method that hard-picks one expert via perturbed cumulative loss; E3Former-FTPL is the efficient variant.
Slow switch phenomenon: an exponentially-weighted forecaster’s sluggish reaction to a sudden distribution shift; Online Scaling exists to counter it.
Self-attention / MHSA / Transformer: neural mechanism where each element weighs all others; MHSA runs several attention “heads” in parallel.
MAPE-K: Monitor-Analyze-Plan-Execute over shared Knowledge; E3Former lives in Analyze.
Horizontal scaling / HPA / IHPA: adding/removing instances (pods); HPA is Kubernetes’ Horizontal Pod Autoscaler; IHPA is ByteDance’s Intelligent Horizontal Pod Auto-scaling platform where E3Former is deployed.
SLA: Service Level Agreement — the promise made to users about speed/uptime; violating it is the cost of under-provisioning.
FaaS / IaaS: cloud service types — Function-as-a-Service (workload = QPS) and Infrastructure-as-a-Service (workload = CPU).
MSE / MAE / WMAPE: error metrics (lower is better); WMAPE is a weighted absolute-percentage error.
Serverless / lean operational ethos: pay-per-use cloud model demanding minimal compute/parameter overhead — the constraint that pushes E3Former to be tiny.
- Chen, J., He, X., Ye, H., Jiang, F., Zhang, T., Chen, J., & Gao, X. (2025). Online Ensemble Transformer for Accurate Cloud Workload Forecasting in Predictive Auto-Scaling. arXiv. 10.48550/ARXIV.2508.12773