Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

FELT

One similarity-aware transformer that forecasts workload for thousands of cloud containers at once

Ding et al. (2025) Citations


TL;DR

Modern cloud apps are split into hundreds or thousands of tiny pieces (“microservices”) that each run inside a container (a lightweight, isolated box that holds one piece of an application). To run them efficiently you want to know in advance how busy each container will get, so you can add or remove resources before the demand actually hits. The usual way to forecast this is to train one prediction model per container — but with 20,000+ containers, that means 20,000 models to store, train, and babysit, which is wildly expensive. FELT solves this with a single Transformer model that predicts the future busyness of ALL containers at once. Its trick is to (a) compress each container’s recent history into just 6 cheap summary numbers instead of raw data, and (b) figure out, in real time, which containers behave alike and let the Transformer’s attention focus only on those look-alike groups. The result: comparable or better accuracy than state-of-the-art methods, but with up to ~96% less training time and up to ~85% fewer parameters.


The Problem (and why simple autoscaling isn’t enough)

Cloud providers sign SLAs with customers, and want to honor them at the lowest cost. The enemy is mismatch between resources and demand:

The naive fix is reactive autoscaling: watch a metric like CPU usage and, when it crosses a threshold (say CPU > 70%), add capacity. The problem is that booting a new container or VM takes time (the “cold start” / provisioning delay, typically 30+ seconds). By the time the new capacity is ready, the spike may already have hurt users. So you always lag behind.

The better approach is proactive (predictive) autoscaling: forecast the workload a few steps into the future, and scale ahead of time so the new capacity is warm and ready when demand arrives. FELT is a forecasting engine built for exactly this — it is the “predict the future” brain that a proactive autoscaler needs.

But forecasting at cloud scale has two hard challenges the paper calls out:

  1. Heterogeneity — Different containers behave completely differently. One might be a steady background job; another spikes every evening; another is bursty and noisy. A model that fits one won’t fit another. Worse, each container mixes long-term trends, short bursts, and random noise.

  2. Accuracy vs. overhead conflict — More accuracy usually means bigger, more complex models, and (in the per-container approach) more models. At 20,000+ containers the storage/training/management cost explodes. Simpler, cheaper models tend to be too coarse and inaccurate. You need both accuracy and low cost.

The field has tried three structural approaches; FELT positions itself as the best version of the third:


Background


Contribution in Simple Terms

FELT’s headline idea: don’t train one model per container, and don’t even train one per group — train a single Transformer for everything, and teach it to internally figure out which containers resemble each other right now so it can borrow signal across them. Three concrete innovations make this work:

  1. Cheap feature summaries instead of raw data. Rather than feeding the Transformer long raw CPU time series, FELT compresses each container’s behavior in each time window into just 6 numbers describing its “value” (the distribution of levels) and its “waveform” (the shape of change — rising, falling, flat, or fluctuating). This slashes storage and model complexity while still capturing both the big-picture and fine-grained behavior.

  2. Similarity that blends history AND the recent moment. Earlier group-based methods cluster containers using only long-term history. But clouds are volatile: two containers that usually behave alike may diverge this hour. FELT computes historical similarity (offline, stable groups) and recent similarity (online, recomputed every prediction window), then combines them. This keeps the model adaptive to sudden shifts.

  3. Feeding that similarity into the Transformer itself, via custom position encoding (places similar containers near each other) and an attention mask (forbids attention to dissimilar containers). So the single shared model can specialize per-group on the fly without needing separate models.

Headline results: up to 28.06% better accuracy, up to 95.75% less training time, and up to 84.55% fewer parameters than state-of-the-art baselines across three real datasets.

Analogy: Imagine forecasting how busy 20,000 shops will be tonight. Approach (A) hires one analyst per shop — accurate but absurdly expensive. FELT instead hires ONE smart analyst who first notices “these 500 shops are all coffee shops near offices, these 300 are nightclubs,” and — crucially — re-checks each evening (“tonight there’s a football match, so the sports bars suddenly behave like nightclubs”). Then when forecasting any one shop, the analyst only looks at the currently similar shops and ignores the rest. Same brain, group-aware focus.


How It Works, Step by Step

FELT has four stages (matching Fig. 1 in the paper). The first does feature extraction; the next two compute “who is similar to whom”; the last is the Transformer that produces the forecast.

Step 1 — Pre-processing & feature extraction. Raw CPU usage is collected from the monitoring database, outliers are handled, and values are normalized. The series is chopped into fixed prediction periods T (each holding p timestamps; a timestamp is typically ≥30 s — the monitor’s sampling interval). For every container in every period, FELT extracts two kinds of features:

Step 2 — Offline historical similarity (stable groups). Using all the historical feature sequences, FELT runs an improved K-Medoids clustering to sort containers into K groups. The improvement is a custom distance: two containers are “close” if their value-feature vectors are near (Euclidean distance) and their waveform types match; a penalty α (= 4√5) is added when waveforms differ, balancing value vs. waveform. The result is a binary Historical Similarity Matrix (HSM): HSM[i][l] = 1 if containers i and l landed in the same group, else 0. This is recomputed only occasionally — when accuracy drifts below a confidence interval for a while, signaling the groups have evolved.

Step 3 — Online recent similarity (this-moment groups). Because the cloud is volatile, FELT also checks, for the current window T, which containers look alike right now. Two containers get RSM[i][l] = 1 (Recent Similarity Matrix) if, in this window, their waveform types match AND the Euclidean distance between their value-feature vectors ≤ a threshold θ. This is recomputed every period — it’s cheap (the paper measures ~31.5 ms, well under the 30 s budget).

Step 4 — Combine. The final per-window grouping is CSM = HSM AND RSM: two containers count as similar only if they are both historically similar AND recently similar. This filters out spurious matches (look-alike right now but normally different) and stale matches (usually similar but diverging today).

Step 5 — Build the Transformer inputs. Two inputs go in:

Step 6 — Attention with a similarity mask. Inside the Transformer encoder, multi-head self-attention learns cross-container relationships. To stop dissimilar containers from polluting each other, FELT adds an attention mask from CSM: if two containers are similar the mask is 0 (attention allowed); if not, the mask is −∞, which after softmax becomes 0 attention. So each container effectively attends only to its current look-alikes. (The architecture also uses ResNet layers on the inputs, a 1D-CNN for feature extraction, add&norm layers, and a feed-forward block — standard Transformer-encoder machinery. Notably FELT is encoder-only (no decoder), which cuts out cross-attention cost.)

Step 7 — Output. A final linear layer turns the encoder output into the prediction: the workload level (one of 5 discrete classes) for every container at the next timestamp t. The whole thing trains in parallel via matrix multiplication.

Why predict a level (1 of 5) instead of an exact CPU%? The paper deliberately discretizes 0–100% CPU into 5 levels. Reasons: integers use less memory than floats; fewer output classes means far fewer parameters (key at scale); and 5 levels are enough to drive resource-allocation decisions (you don’t need 73.4% vs 73.5% to decide how many replicas to run). They cite that 5-level quantization is a sweet spot and that industrial systems (Azure’s VM scheduler) discretize similarly.


Inputs (what it consumes)

Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

Concrete example: An online store the night before a big sale. A reactive autoscaler waits until checkout-service CPU is already pegged, then boots more pods — too late, customers see errors. With FELT feeding the loop, the system forecasts that the checkout and payment containers (currently behaving like the look-alike “spiky-evening” group) will jump to level 4–5 in the next window, and warms up extra pods before the rush. FELT supplies that early warning for thousands of services at once using a single, lightweight model.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch — no pretrained or foundation model.

FELT trains a single custom encoder-only Transformer from random initialization on cloud workload traces, learning to classify the next-step CPU workload level (1 of 5 classes) for all containers at once. There is no pretrained backbone, foundation model, fine-tuning, or zero-shot transfer — its efficiency comes entirely from the single-model-plus-similarity-mask design, not from transfer learning. Each container-window is reduced to 6 engineered features (5 value quantiles + 1 waveform class). Training uses Alibaba cluster-trace-microservices v2021 (20,000+ microservices) and v2022 (28,000+), plus Fisher (10 containers, 30 days), with CPU usage split 7:1:2 train/val/test; the generalization test additionally uses the Azure Public Dataset. Overhead is reported as training time, inference time, and parameter count (~0.57M for FELT-r).


Strengths

Limitations


Glossary

References
  1. Ding, Z., Feng, B., & Yu, W. (2025). FELT: Large-Scale Cloud Workload Prediction Through Adaptive Feature-Enhanced and Similarity-Aware Transformer. Tsinghua Science and Technology. 10.26599/tst.2025.9010102