Fremer - Luca's Research @ PEG

Ye et al. (2025) Citations

TL;DR¶

Cloud platforms constantly ask the question: “how much traffic / CPU / requests will this service get over the next day, so I can spin up the right number of machines ahead of time?” Answering it well is called workload forecasting, and it is what lets autoscaling be proactive (scale before the spike) instead of reactive (scale only after the spike has already hurt users). Modern forecasters use Transformers (attention-based neural networks), but at ByteDance’s scale — over 100,000 forecasts per hour — standard Transformers are too slow and too heavy. Fremer is a redesigned, much smaller Transformer that does its work in the frequency domain (it looks at the repeating cycles / periods in the data instead of the raw time series). It is roughly 3x faster and ~12x smaller than the best competing model while being more accurate, and when plugged into a real Kubernetes autoscaler it cut average request latency by ~19% while using ~2% fewer machines. The model only produces the forecast; a standard Kubernetes autoscaler still does the actual scaling.

The Problem (and why simple autoscaling isn’t enough)¶

Autoscaling means automatically adjusting how many resources (virtual machines, containers/“pods”, CPU, memory) a cloud application gets, so it can handle its load without wasting money.

Give it too little -> requests slow down, time out, SLAs/SLOs are violated. (Under-provisioning.)
Give it too much -> you pay for idle machines. (Over-provisioning.)

The naive approach is reactive autoscaling: watch a metric like CPU, and when it crosses a threshold (e.g. CPU > 70%), add machines. The problem is that booting a new machine or container takes time (the “cold start” / provisioning delay, often tens of seconds to minutes). So a reactive scaler is always lagging behind — by the time the new capacity is ready, users have already suffered the slowdown.

Proactive (predictive) autoscaling fixes this by forecasting the future workload and scaling ahead of time, hiding the provisioning delay. The forecast is the hard part — and that is exactly what Fremer provides.

Why is forecasting hard here?

Scale. ByteDance Cloud runs 100,000+ forecasting tasks per hour across thousands of instances. The model must be tiny and fast, not just accurate.
Complex, overlapping cycles. Real cloud workloads repeat on multiple periods at once — hourly, daily, and weekly (think: a website busy every weekday lunchtime, busiest on weekends, quiet at 3am). Standard Transformers struggle to untangle these overlapping rhythms.
Noise and overfitting. Raw time series are noisy, and big neural nets easily memorize noise instead of learning the real pattern.

Background¶

Time series. Just a sequence of numbers measured over time, e.g. CPU usage every 10 minutes. Fremer’s job: given the recent history, predict the next stretch.

Lookback window (L) and forecast horizon (T). The model reads the last L time steps (the lookback) and predicts the next T time steps (the horizon). In the paper’s main experiments L = 5 days of history and T = 1 day ahead.

Transformer & “attention”. A Transformer is a neural network whose core trick, attention, lets every point in a sequence “look at” and weigh every other point to decide what matters. It is powerful but expensive: its cost grows with the square of the sequence length (O(L²)). For long minute-level series, that is brutal.

Time domain vs. frequency domain. This is the key idea.

The time domain is the raw curve: “value at each minute.”
The frequency domain is the same data re-expressed as a set of repeating waves of different speeds: a slow weekly wave, a faster daily wave, an even faster hourly wave, plus tiny fast wiggles that are mostly noise. The tool that converts between them is the Fourier Transform (computed efficiently by the FFT, Fast Fourier Transform).
Analogy: a musical chord. In the time domain you see a messy wiggling waveform; in the frequency domain you see “this is a C, an E, and a G played together.” Cloud workloads are similarly a “chord” of an hourly note + a daily note + a weekly note. The frequency domain makes those notes easy to read separately, and the whole signal becomes compact (a few strong frequencies carry most of the meaning) — which is what makes Fremer both fast and accurate.

Why frequency helps specifically here: periods, trends, and noise get cleanly separated; two workloads that look different in time often look similar in frequency (good for generalizing across instances); and a few frequency “peaks” summarize a long history.

Contribution in Simple Terms¶

Fremer is an encoder-only Transformer that operates on the frequency spectrum of the workload instead of the raw time series. The genuinely new pieces:

It forecasts a spectrum, not a curve. Fremer predicts the frequency spectrum of the complete series (history + future) from the spectrum of the history, then converts that back to a time curve. Working in frequency makes it naturally good at multi-period (hourly+daily+weekly) workloads.
Learnable Linear Padding (LLP) — fixes a subtle, mostly-ignored bug called frequency mis-alignment. (Explained step-by-step below. The authors claim to be the first to address it.)
Complex-valued Spectrum Attention (CSA) — instead of paying attention to individual frequencies one by one, it groups frequencies into combinations and attends over those. A single frequency rarely means much on its own (real periodicity shows up as harmonics — combinations), so this both improves accuracy and slashes compute (attention cost drops to roughly 1/200 of a standard Transformer’s).
Frequency Filters — a high-pass and low-pass filter that throw away the noisiest top frequencies and protect the model from overfitting to the dominant low frequencies.
Four open-source real-world workload datasets from ByteDance (FaaS, IaaS, PaaS, RDS) covering thousands of instances over 1–2 months — a contribution to the research community in its own right.

The headline result: better accuracy than every state-of-the-art baseline on workload forecasting, while being smaller and faster, and a demonstrated win in a real Kubernetes autoscaling test.

How It Works, Step by Step¶

Here is the journey from raw history to forecast. (L = lookback length, T = horizon length.)

Step 1 — Learnable Linear Padding (LLP): fix frequency mis-alignment. The FFT’s frequency “ruler” depends on the series length: a length-L series and a length-L+T series measure frequencies at different spacings. So a true period in the future (e.g. period 24) may simply not exist as a sampled frequency in the history’s spectrum — the history can’t “see” it cleanly. Fremer’s fix: before doing any FFT, extend the length-L input to length L+T using a small learnable linear layer (X_p = concat(X, WᵀX + b)). This makes the history’s frequency ruler match the full series’s ruler, so frequencies line up. Ablation shows this is the single most important component — remove it and accuracy drops sharply.

Step 2 — rFFT to the frequency domain. Apply the real-valued FFT to the padded series, producing a complex-valued spectrum of about (L+T)/2 + 1 frequency points (real FFT exploits symmetry to store only the non-redundant half).

Step 3 — Frequency Filters.

Low-Pass Filter (drop highest ~1% of frequencies): the very top frequencies are mostly noise/artifacts; removing them lowers training, validation, and test error.
High-Pass handling of the lowest ~3%: the lowest frequencies hold the trend/mean and have large magnitude, so the model tends to overfit them. Fremer keeps them but removes them from the attention backbone’s input, then re-adds them to the output at the end. This both preserves the trend and curbs overfitting (and gives a simple knob to tune generalization).

Step 4 — Frequency Reversible Instance Norm (F-RIN). RevIN is a normalization trick (subtract mean, divide by std) originally used in the time domain to handle “distribution shift” (when the scale/level of a series drifts). Fremer applies it to the spectrum so that workloads with very different magnitudes get mapped to a similar distribution. The subtracted statistics are stored and added back at the end (that’s the “reversible” part).

Step 5 — Project to frequency combinations. A trainable linear projection maps the many individual frequency points down to a smaller set of L' frequency combinations (by default L' ≈ L/5). This is the compression that makes attention cheap, and it reflects the insight that meaning lives in combinations (harmonics), not single frequencies.

Step 6 — Complex-valued Spectrum Attention (CSA). This is the heart of the model — a multi-head attention mechanism (8 heads by default) adapted to complex numbers and run over the frequency combinations.

It builds Query/Key/Value projections from the combined spectrum and computes attention using the magnitude of the complex Query·Key product: Attention = Softmax(|Q·Kᵀ|)·V.
The “inverted” multi-head design (splitting across the combination dimension) cuts complexity to O(L'²/H) (with H heads), making CSA dramatically cheaper than ordinary O(L²) attention.
Different heads specialize: the paper’s visualization shows one head focusing on the hourly period and another on the sub-hourly period.
Standard Transformer touches (LayerNorm, FeedForward, residual connections) follow, extended to complex numbers.

Contrast with FEDformer (the closest prior frequency-Transformer): FEDformer attends over individual sampled frequencies and mixes channels together. Fremer attends over frequency combinations and treats each channel independently (because the global periodic patterns of one instance rarely correlate with another’s). Channel independence is also what gives Fremer its strong transfer/zero-shot ability.

Step 7–9 — Back to time and slice off the forecast. De-normalize (undo F-RIN), re-insert the low-frequency trend that was set aside, recover the full spectrum, run the inverse rFFT to get a time-domain series of length L+T, and keep the last T points — that’s the predicted future workload Ŷ.

Non-Transformer ingredients used: FFT/rFFT and inverse FFT (signal processing), frequency filters (signal processing), a linear padding layer, RevIN normalization, and channel-independence. There is no diffusion, GAN, or similarity search; the only convolution-like operation is the FFT machinery.

Inputs (what it consumes)¶

A univariate workload time series per computing instance (channel-independent: each instance/metric forecast on its own). Depending on dataset the metric is:
- CPU usage (IaaS virtual machines; PaaS services in milli-cores; the public Materna VM traces),
- QPS = Queries Per Second (FaaS function instances; RDS / MySQL instances).
Lookback window (L): the recent history. Main experiments use 5 days = 1440 points at 5-min granularity, or 720 points at 10-min granularity. Sensitivity tests sweep L from 36 up to 1008; accuracy jumps once L ≥ 144 (i.e. once it sees at least a full day).
No hand-engineered features are required — the FFT/frequency machinery is built into the model. (For general-domain tests it also handles multivariate datasets like Traffic and Electricity, but always channel-independently.)

Outputs (what it produces)¶

A forecasted future workload series Ŷ of length T — i.e. the predicted CPU%/QPS for each future time step.
Horizon (T): main experiments predict 1 day ahead (288 points at 5-min, 144 at 10-min). Sensitivity tests go from 18 steps up to 720 (hours to days).
That forecast is the only output. Fremer does not itself output a number of pods or a scaling action — it feeds a downstream autoscaler. In the Kubernetes experiment, the predicted next-day workload is converted into a pod-count recommendation under the assumption that workload is linearly related to resource consumption, and the standard HPA logic does the rest.

How It Fits the Autoscaling Framework (MAPE-K)¶

The MAPE-K loop is the standard reference model for self-adaptive systems: Monitor -> Analyze -> Plan -> Execute, over shared Knowledge.

Where Fremer sits: the ANALYZE stage. It is a time-series forecaster — the predictive brain that turns monitored history into a forecast of future demand. It does not do Monitor, Plan, or Execute itself.
Makes scaling PROACTIVE. By forecasting the next day’s load, it lets the autoscaler provision capacity before the spike, hiding cold-start/provisioning delay. This is the whole point versus reactive (threshold) scaling.
Horizontal scaling. The evaluation uses the Kubernetes Horizontal Pod Autoscaler (HPA) — scaling out/in by changing the number of pods. (It does not do vertical CPU/RAM resizing.)
Plan/Execute are left to the standard HPA. Fremer is honest about its role: it supplies the prediction; converting that prediction into a replica count and applying it is done by Kubernetes’ existing HPA mechanism (using the linear workload->resource assumption). So Fremer is a drop-in predictive front-end for an off-the-shelf proactive autoscaler.

What the autoscaling test showed (24-hour Kubernetes HPA simulation, replaying a real FaaS workload; 10s timeout):

Strategy	Avg latency	99-pct latency	Timeout rate	Avg pods
Naïve HPA (reactive)	0.996 s	3.764 s	0.386%	29.18
PatchTST (predictive)	1.017 s	3.644 s	0.132%	22.48
DLinear (predictive)	1.081 s	4.108 s	0.22%	19.25
Fremer (predictive)	0.826 s	2.292 s	0.102%	21.95
Ideal (perfect foresight)	0.789 s	2.063 s	0.026%	21.38

Versus PatchTST, Fremer cut average latency by 18.78% and used 2.35% fewer pods — and it lands close to the “Ideal” upper bound that uses the true future values. (Note: DLinear used the fewest pods but had the worst latency, because it under-predicted load and starved the service — a good illustration that under-provisioning hurts.)

Strengths¶

Accuracy: beats all SOTA baselines on workload forecasting — average +5.5% MSE, +4.7% MAE, +8.6% SMAPE over the previous best (vs. SOTA overall; ~2.9/2.5/3.5% vs. the best baseline per dataset).
Efficiency / lightweight: only ~0.57M parameters (vs. 6.85M for PatchTST, 14M for FEDformer, 111M+ for Fredformer). On IaaS it cut training time 82%, inference 28%, and parameter count 92% versus PatchTST. Attention cost is ~1/200 of a standard Transformer — crucial for 100k+ forecasts/hour.
Multi-period robustness: the frequency-domain design naturally disentangles hourly/daily/weekly cycles where time-domain Transformers struggle.
Generalizes well: strong intra-dataset transfer (predict for unseen instances) and even cross-dataset / zero-shot transfer (train on PaaS, predict RDS), thanks to channel independence — useful as a foundation for large shared forecasting models.
Beyond cloud: also tops baselines on general periodic benchmarks (Traffic, Electricity, PEMS) across horizons 96–720.
Real-world validation + open data/code: demonstrated end-to-end on Kubernetes HPA; four ByteDance datasets and source code released.

Training & pre-training¶

Trained from scratch — no external pretrained or foundation model.

Fremer (~0.57M parameters) is trained from random initialization as a task-specific forecaster. It is implemented in PyTorch and trained on an NVIDIA A100-SXM 80GB GPU, using the Adam optimizer (initial learning rate 1e-3) and MSE loss. Each dataset is split into its own train/test partition (e.g. 8:2), with normalization statistics computed only on the training set to avoid test leakage. The training data spans the four ByteDance traces (FaaS, IaaS, PaaS, RDS) plus public benchmarks (Materna, Traffic, Electricity, PEMS).

One clarification to head off a false friend: the paper’s own wording of “zero-shot” and “foundational backbone” refers to Fremer itself — a single model trained once and then transferring across datasets via channel independence, and its potential to serve as a future shared backbone. It does not use, fine-tune, or borrow any external pretrained time-series foundation model; there is no separate pretraining stage.

Limitations¶

Only the prediction, not the actuation. Fremer stops at the forecast; the scaling decision relies on a simple linear “workload -> pods” assumption and the stock HPA. Nonlinear scaling behavior, multi-resource bottlenecks (CPU vs. memory vs. I/O), or pod startup times are not modeled.
Univariate / channel-independent by design. It deliberately ignores correlations between instances or between metrics. The paper argues these are weak in their data, but in settings where cross-service correlation matters (cascading microservices), that information is left on the table.
Best on periodic workloads. Its edge comes from frequency/periodicity; on weakly periodic series (e.g. some weather/air-quality datasets in their general tests) its advantage shrinks or disappears. Bursty, aperiodic, or one-off spikes (a viral event with no historical precedent) are inherently hard to forecast.
The autoscaling test is a 24-hour simulation on a single FaaS workload, replaying recorded traffic — a strong proof of concept, not a long-term production deployment.
Hyperparameters need some tuning — notably the frequency-filter thresholds, the number of combinations L' (≈L/5), and head count (≈8) — though the paper shows these are fairly stable.
Fixed input/output lengths. Like most such models, a trained Fremer targets specific L and T; very short inputs (under a day) degrade because periodicity can’t be seen.

Glossary¶

Workload forecasting: predicting a cloud service’s future demand (CPU%, QPS, etc.).
Autoscaling: automatically changing the resources allocated to an app to match demand.
Reactive vs. proactive scaling: react after a threshold is crossed (always late) vs. forecast and scale ahead (hides cold-start delay).
Horizontal scaling / HPA: adding/removing instances (Kubernetes pods); the Horizontal Pod Autoscaler does this. (Vertical = resizing CPU/RAM of one instance — not used here.)
MAPE-K: Monitor–Analyze–Plan–Execute over shared Knowledge; Fremer lives in Analyze.
Time domain / frequency domain: the raw value-over-time curve vs. its decomposition into repeating waves (periods).
FFT / rFFT / iFFT: Fast Fourier Transform — efficient conversion to the frequency domain; rFFT is the real-valued version; irFFT converts back.
Spectrum: the set of (complex-valued) frequency components produced by the FFT.
Harmonics / frequency combinations: real periodicity shows up as several related frequencies together; CSA attends over these combinations rather than single points.
Attention / Transformer: neural mechanism where each element weighs all others; cost normally grows as O(L²).
Encoder-only: uses only the Transformer’s encoder half (no separate decoder).
LLP (Learnable Linear Padding): learnable extension of the input so the history’s and full series’s frequency rulers align.
CSA (Complex-valued Spectrum Attention): Fremer’s cheap, complex-number, combination-level attention.
F-RIN (Frequency Reversible Instance Norm): normalization applied to the spectrum to handle differing magnitudes/distribution shift.
Frequency filter (low-/high-pass): drops top noise frequencies; sets aside the lowest trend frequencies to prevent overfitting.
Channel independence: each series forecast separately, ignoring cross-series correlations.
SMAPE / MSE / MAE: error metrics; SMAPE (Symmetric Mean Absolute Percentage Error) is scale-independent, useful when QPS and CPU% live on wildly different scales.
FaaS / IaaS / PaaS / RDS: ByteDance cloud service types providing the datasets (Functions, Infrastructure/VMs, Platform, Relational Database Service).

References¶

Ye, H., Chen, J., Jiang, F., He, X., Zhang, T., Chen, J., & Gao, X. (2025). Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services. Proceedings of the VLDB Endowment, 18(11), 3812–3825. 10.14778/3749646.3749656