Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Fremer

A lightweight frequency-domain transformer for multi-period cloud workload forecasting

Ye et al. (2025) Citations


TL;DR

Cloud platforms constantly ask the question: “how much traffic / CPU / requests will this service get over the next day, so I can spin up the right number of machines ahead of time?” Answering it well is called workload forecasting, and it is what lets autoscaling be proactive (scale before the spike) instead of reactive (scale only after the spike has already hurt users). Modern forecasters use Transformers (attention-based neural networks), but at ByteDance’s scale — over 100,000 forecasts per hour — standard Transformers are too slow and too heavy. Fremer is a redesigned, much smaller Transformer that does its work in the frequency domain (it looks at the repeating cycles / periods in the data instead of the raw time series). It is roughly 3x faster and ~12x smaller than the best competing model while being more accurate, and when plugged into a real Kubernetes autoscaler it cut average request latency by ~19% while using ~2% fewer machines. The model only produces the forecast; a standard Kubernetes autoscaler still does the actual scaling.


The Problem (and why simple autoscaling isn’t enough)

Autoscaling means automatically adjusting how many resources (virtual machines, containers/“pods”, CPU, memory) a cloud application gets, so it can handle its load without wasting money.

The naive approach is reactive autoscaling: watch a metric like CPU, and when it crosses a threshold (e.g. CPU > 70%), add machines. The problem is that booting a new machine or container takes time (the “cold start” / provisioning delay, often tens of seconds to minutes). So a reactive scaler is always lagging behind — by the time the new capacity is ready, users have already suffered the slowdown.

Proactive (predictive) autoscaling fixes this by forecasting the future workload and scaling ahead of time, hiding the provisioning delay. The forecast is the hard part — and that is exactly what Fremer provides.

Why is forecasting hard here?

  1. Scale. ByteDance Cloud runs 100,000+ forecasting tasks per hour across thousands of instances. The model must be tiny and fast, not just accurate.

  2. Complex, overlapping cycles. Real cloud workloads repeat on multiple periods at once — hourly, daily, and weekly (think: a website busy every weekday lunchtime, busiest on weekends, quiet at 3am). Standard Transformers struggle to untangle these overlapping rhythms.

  3. Noise and overfitting. Raw time series are noisy, and big neural nets easily memorize noise instead of learning the real pattern.


Background

Time series. Just a sequence of numbers measured over time, e.g. CPU usage every 10 minutes. Fremer’s job: given the recent history, predict the next stretch.

Lookback window (L) and forecast horizon (T). The model reads the last L time steps (the lookback) and predicts the next T time steps (the horizon). In the paper’s main experiments L = 5 days of history and T = 1 day ahead.

Transformer & “attention”. A Transformer is a neural network whose core trick, attention, lets every point in a sequence “look at” and weigh every other point to decide what matters. It is powerful but expensive: its cost grows with the square of the sequence length (O(L²)). For long minute-level series, that is brutal.

Time domain vs. frequency domain. This is the key idea.

Why frequency helps specifically here: periods, trends, and noise get cleanly separated; two workloads that look different in time often look similar in frequency (good for generalizing across instances); and a few frequency “peaks” summarize a long history.


Contribution in Simple Terms

Fremer is an encoder-only Transformer that operates on the frequency spectrum of the workload instead of the raw time series. The genuinely new pieces:

  1. It forecasts a spectrum, not a curve. Fremer predicts the frequency spectrum of the complete series (history + future) from the spectrum of the history, then converts that back to a time curve. Working in frequency makes it naturally good at multi-period (hourly+daily+weekly) workloads.

  2. Learnable Linear Padding (LLP) — fixes a subtle, mostly-ignored bug called frequency mis-alignment. (Explained step-by-step below. The authors claim to be the first to address it.)

  3. Complex-valued Spectrum Attention (CSA) — instead of paying attention to individual frequencies one by one, it groups frequencies into combinations and attends over those. A single frequency rarely means much on its own (real periodicity shows up as harmonics — combinations), so this both improves accuracy and slashes compute (attention cost drops to roughly 1/200 of a standard Transformer’s).

  4. Frequency Filters — a high-pass and low-pass filter that throw away the noisiest top frequencies and protect the model from overfitting to the dominant low frequencies.

  5. Four open-source real-world workload datasets from ByteDance (FaaS, IaaS, PaaS, RDS) covering thousands of instances over 1–2 months — a contribution to the research community in its own right.

The headline result: better accuracy than every state-of-the-art baseline on workload forecasting, while being smaller and faster, and a demonstrated win in a real Kubernetes autoscaling test.


How It Works, Step by Step

Here is the journey from raw history to forecast. (L = lookback length, T = horizon length.)

Step 1 — Learnable Linear Padding (LLP): fix frequency mis-alignment. The FFT’s frequency “ruler” depends on the series length: a length-L series and a length-L+T series measure frequencies at different spacings. So a true period in the future (e.g. period 24) may simply not exist as a sampled frequency in the history’s spectrum — the history can’t “see” it cleanly. Fremer’s fix: before doing any FFT, extend the length-L input to length L+T using a small learnable linear layer (X_p = concat(X, WᵀX + b)). This makes the history’s frequency ruler match the full series’s ruler, so frequencies line up. Ablation shows this is the single most important component — remove it and accuracy drops sharply.

Step 2 — rFFT to the frequency domain. Apply the real-valued FFT to the padded series, producing a complex-valued spectrum of about (L+T)/2 + 1 frequency points (real FFT exploits symmetry to store only the non-redundant half).

Step 3 — Frequency Filters.

Step 4 — Frequency Reversible Instance Norm (F-RIN). RevIN is a normalization trick (subtract mean, divide by std) originally used in the time domain to handle “distribution shift” (when the scale/level of a series drifts). Fremer applies it to the spectrum so that workloads with very different magnitudes get mapped to a similar distribution. The subtracted statistics are stored and added back at the end (that’s the “reversible” part).

Step 5 — Project to frequency combinations. A trainable linear projection maps the many individual frequency points down to a smaller set of L' frequency combinations (by default L' ≈ L/5). This is the compression that makes attention cheap, and it reflects the insight that meaning lives in combinations (harmonics), not single frequencies.

Step 6 — Complex-valued Spectrum Attention (CSA). This is the heart of the model — a multi-head attention mechanism (8 heads by default) adapted to complex numbers and run over the frequency combinations.

Contrast with FEDformer (the closest prior frequency-Transformer): FEDformer attends over individual sampled frequencies and mixes channels together. Fremer attends over frequency combinations and treats each channel independently (because the global periodic patterns of one instance rarely correlate with another’s). Channel independence is also what gives Fremer its strong transfer/zero-shot ability.

Step 7–9 — Back to time and slice off the forecast. De-normalize (undo F-RIN), re-insert the low-frequency trend that was set aside, recover the full spectrum, run the inverse rFFT to get a time-domain series of length L+T, and keep the last T points — that’s the predicted future workload Ŷ.

Non-Transformer ingredients used: FFT/rFFT and inverse FFT (signal processing), frequency filters (signal processing), a linear padding layer, RevIN normalization, and channel-independence. There is no diffusion, GAN, or similarity search; the only convolution-like operation is the FFT machinery.


Inputs (what it consumes)


Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

The MAPE-K loop is the standard reference model for self-adaptive systems: Monitor -> Analyze -> Plan -> Execute, over shared Knowledge.

What the autoscaling test showed (24-hour Kubernetes HPA simulation, replaying a real FaaS workload; 10s timeout):

StrategyAvg latency99-pct latencyTimeout rateAvg pods
Naïve HPA (reactive)0.996 s3.764 s0.386%29.18
PatchTST (predictive)1.017 s3.644 s0.132%22.48
DLinear (predictive)1.081 s4.108 s0.22%19.25
Fremer (predictive)0.826 s2.292 s0.102%21.95
Ideal (perfect foresight)0.789 s2.063 s0.026%21.38

Versus PatchTST, Fremer cut average latency by 18.78% and used 2.35% fewer pods — and it lands close to the “Ideal” upper bound that uses the true future values. (Note: DLinear used the fewest pods but had the worst latency, because it under-predicted load and starved the service — a good illustration that under-provisioning hurts.)


Strengths

Training & pre-training

Trained from scratch — no external pretrained or foundation model.

Fremer (~0.57M parameters) is trained from random initialization as a task-specific forecaster. It is implemented in PyTorch and trained on an NVIDIA A100-SXM 80GB GPU, using the Adam optimizer (initial learning rate 1e-3) and MSE loss. Each dataset is split into its own train/test partition (e.g. 8:2), with normalization statistics computed only on the training set to avoid test leakage. The training data spans the four ByteDance traces (FaaS, IaaS, PaaS, RDS) plus public benchmarks (Materna, Traffic, Electricity, PEMS).

One clarification to head off a false friend: the paper’s own wording of “zero-shot” and “foundational backbone” refers to Fremer itself — a single model trained once and then transferring across datasets via channel independence, and its potential to serve as a future shared backbone. It does not use, fine-tune, or borrow any external pretrained time-series foundation model; there is no separate pretraining stage.

Limitations


Glossary

References
  1. Ye, H., Chen, J., Jiang, F., He, X., Zhang, T., Chen, J., & Gao, X. (2025). Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services. Proceedings of the VLDB Endowment, 18(11), 3812–3825. 10.14778/3749646.3749656