FELT
One similarity-aware transformer that forecasts workload for thousands of cloud containers at once
TL;DR¶
Modern cloud apps are split into hundreds or thousands of tiny pieces (“microservices”) that each run inside a container (a lightweight, isolated box that holds one piece of an application). To run them efficiently you want to know in advance how busy each container will get, so you can add or remove resources before the demand actually hits. The usual way to forecast this is to train one prediction model per container — but with 20,000+ containers, that means 20,000 models to store, train, and babysit, which is wildly expensive. FELT solves this with a single Transformer model that predicts the future busyness of ALL containers at once. Its trick is to (a) compress each container’s recent history into just 6 cheap summary numbers instead of raw data, and (b) figure out, in real time, which containers behave alike and let the Transformer’s attention focus only on those look-alike groups. The result: comparable or better accuracy than state-of-the-art methods, but with up to ~96% less training time and up to ~85% fewer parameters.
The Problem (and why simple autoscaling isn’t enough)¶
Cloud providers sign SLAs with customers, and want to honor them at the lowest cost. The enemy is mismatch between resources and demand:
Under-provisioning → too few resources → slow responses, dropped requests, SLA violations.
Over-provisioning → too many idle resources → wasted money.
The naive fix is reactive autoscaling: watch a metric like CPU usage and, when it crosses a threshold (say CPU > 70%), add capacity. The problem is that booting a new container or VM takes time (the “cold start” / provisioning delay, typically 30+ seconds). By the time the new capacity is ready, the spike may already have hurt users. So you always lag behind.
The better approach is proactive (predictive) autoscaling: forecast the workload a few steps into the future, and scale ahead of time so the new capacity is warm and ready when demand arrives. FELT is a forecasting engine built for exactly this — it is the “predict the future” brain that a proactive autoscaler needs.
But forecasting at cloud scale has two hard challenges the paper calls out:
Heterogeneity — Different containers behave completely differently. One might be a steady background job; another spikes every evening; another is bursty and noisy. A model that fits one won’t fit another. Worse, each container mixes long-term trends, short bursts, and random noise.
Accuracy vs. overhead conflict — More accuracy usually means bigger, more complex models, and (in the per-container approach) more models. At 20,000+ containers the storage/training/management cost explodes. Simpler, cheaper models tend to be too coarse and inaccurate. You need both accuracy and low cost.
The field has tried three structural approaches; FELT positions itself as the best version of the third:
Background¶
Container: a lightweight virtualized box running one piece of an app (think a Kubernetes “pod”). “Workload” here means how busy a container is — measured by CPU utilization over time.
Time series: a sequence of measurements over time (e.g. CPU% sampled every 30 seconds). Forecasting = predicting the next values.
Transformer: a type of neural network originally built for language (it powers things like ChatGPT). Its core idea is self-attention — for each item in a sequence, the model learns which other items are relevant and pays more “attention” to those. Two pieces of jargon you’ll meet:
Position encoding: because attention by itself is order-blind, the model is given extra signals telling it where each item sits. In language that’s word order; FELT cleverly repurposes position to encode which containers are similar, not time order.
Attention mask: a way to forbid attention between certain items (set their attention weight to minus-infinity so it becomes zero). FELT uses this to make a container ignore all the containers that are not similar to it.
Multi-head self-attention: running several attention computations in parallel; this parallelism is also what makes Transformers fast to train (matrix multiplications instead of step-by-step loops, unlike LSTMs).
LSTM / GRU / CNN: older neural nets for sequences (LSTM/GRU process step-by-step; CNN slides small filters). The paper argues Transformers capture long-range and cross-container relationships better while training in parallel.
Clustering (K-Medoids): an algorithm that groups similar items. K-Medoids is like K-Means but each group’s “center” is an actual real container (a medoid), which makes it robust to outliers. FELT uses it to find groups of similar containers from history.
MAPE-K loop: the standard blueprint for self-managing systems — Monitor (collect metrics) → Analyze (detect/forecast) → Plan (decide the action) → Execute (apply it), over shared Knowledge. FELT lives in the Analyze stage.
Contribution in Simple Terms¶
FELT’s headline idea: don’t train one model per container, and don’t even train one per group — train a single Transformer for everything, and teach it to internally figure out which containers resemble each other right now so it can borrow signal across them. Three concrete innovations make this work:
Cheap feature summaries instead of raw data. Rather than feeding the Transformer long raw CPU time series, FELT compresses each container’s behavior in each time window into just 6 numbers describing its “value” (the distribution of levels) and its “waveform” (the shape of change — rising, falling, flat, or fluctuating). This slashes storage and model complexity while still capturing both the big-picture and fine-grained behavior.
Similarity that blends history AND the recent moment. Earlier group-based methods cluster containers using only long-term history. But clouds are volatile: two containers that usually behave alike may diverge this hour. FELT computes historical similarity (offline, stable groups) and recent similarity (online, recomputed every prediction window), then combines them. This keeps the model adaptive to sudden shifts.
Feeding that similarity into the Transformer itself, via custom position encoding (places similar containers near each other) and an attention mask (forbids attention to dissimilar containers). So the single shared model can specialize per-group on the fly without needing separate models.
Headline results: up to 28.06% better accuracy, up to 95.75% less training time, and up to 84.55% fewer parameters than state-of-the-art baselines across three real datasets.
Analogy: Imagine forecasting how busy 20,000 shops will be tonight. Approach (A) hires one analyst per shop — accurate but absurdly expensive. FELT instead hires ONE smart analyst who first notices “these 500 shops are all coffee shops near offices, these 300 are nightclubs,” and — crucially — re-checks each evening (“tonight there’s a football match, so the sports bars suddenly behave like nightclubs”). Then when forecasting any one shop, the analyst only looks at the currently similar shops and ignores the rest. Same brain, group-aware focus.
How It Works, Step by Step¶
FELT has four stages (matching Fig. 1 in the paper). The first does feature extraction; the next two compute “who is similar to whom”; the last is the Transformer that produces the forecast.
Step 1 — Pre-processing & feature extraction. Raw CPU usage is collected from the monitoring database, outliers are handled, and values are normalized. The series is chopped into fixed prediction periods T (each holding p timestamps; a timestamp is typically ≥30 s — the monitor’s sampling interval). For every container in every period, FELT extracts two kinds of features:
Value features (5 numbers): five quantiles of the workload distribution — minimum, lower quartile, median, upper quartile, maximum. These capture the level/magnitude and its long-tail shape.
Waveform features (the shape of change): computed from two statistics — the slope of a best-fit line (least-squares trend) and the standard deviation (volatility) within the window. Two thresholds (
kthresfor slope,sthresfor std) classify the window into one of 4 waveform types:0 = smooth/flat,1 = rising,2 = falling,3 = fluctuating.So each container-window becomes a compact vector of dimension 6 (this is the Transformer’s feature dimension
dmodel = 6).
Step 2 — Offline historical similarity (stable groups). Using all the historical feature sequences, FELT runs an improved K-Medoids clustering to sort containers into K groups. The improvement is a custom distance: two containers are “close” if their value-feature vectors are near (Euclidean distance) and their waveform types match; a penalty α (= 4√5) is added when waveforms differ, balancing value vs. waveform. The result is a binary Historical Similarity Matrix (HSM): HSM[i][l] = 1 if containers i and l landed in the same group, else 0. This is recomputed only occasionally — when accuracy drifts below a confidence interval for a while, signaling the groups have evolved.
Step 3 — Online recent similarity (this-moment groups). Because the cloud is volatile, FELT also checks, for the current window T, which containers look alike right now. Two containers get RSM[i][l] = 1 (Recent Similarity Matrix) if, in this window, their waveform types match AND the Euclidean distance between their value-feature vectors ≤ a threshold θ. This is recomputed every period — it’s cheap (the paper measures ~31.5 ms, well under the 30 s budget).
Step 4 — Combine. The final per-window grouping is CSM = HSM AND RSM: two containers count as similar only if they are both historically similar AND recently similar. This filters out spurious matches (look-alike right now but normally different) and stale matches (usually similar but diverging today).
Step 5 — Build the Transformer inputs. Two inputs go in:
Feature Encoding (FE): the 6-dim value+waveform vectors for all containers.
Position Encoding (PE): derived from the similarity relationship (NOT from time). FELT offers two flavors:
Absolute position encoding (FELT-a): containers are ordered so that similar ones sit next to each other (sort each group’s members by distance to their medoid; sort groups by distance to a chosen “base group medoid”), then apply the classic sine/cosine position formula. Simpler, slightly cheaper, but can’t directly express the relation between any two arbitrary containers, and the ordering changes if the container count changes.
Relative position encoding (FELT-r): instead of one global order, it builds a Relative Position Matrix (RPM) giving each container’s rank-distance to every other, using a weighted blend of historical and recent distance (
rela_dist = β·historical + (1−β)·recent, withβ = 0.4, so recent matters more). This lets the model reason about any pair directly and stays consistent regardless of how many containers there are. Slightly heavier per step but more accurate and more general.
Step 6 — Attention with a similarity mask. Inside the Transformer encoder, multi-head self-attention learns cross-container relationships. To stop dissimilar containers from polluting each other, FELT adds an attention mask from CSM: if two containers are similar the mask is 0 (attention allowed); if not, the mask is −∞, which after softmax becomes 0 attention. So each container effectively attends only to its current look-alikes. (The architecture also uses ResNet layers on the inputs, a 1D-CNN for feature extraction, add&norm layers, and a feed-forward block — standard Transformer-encoder machinery. Notably FELT is encoder-only (no decoder), which cuts out cross-attention cost.)
Step 7 — Output. A final linear layer turns the encoder output into the prediction: the workload level (one of 5 discrete classes) for every container at the next timestamp t. The whole thing trains in parallel via matrix multiplication.
Why predict a level (1 of 5) instead of an exact CPU%? The paper deliberately discretizes 0–100% CPU into 5 levels. Reasons: integers use less memory than floats; fewer output classes means far fewer parameters (key at scale); and 5 levels are enough to drive resource-allocation decisions (you don’t need 73.4% vs 73.5% to decide how many replicas to run). They cite that 5-level quantization is a sweet spot and that industrial systems (Azure’s VM scheduler) discretize similarly.
Inputs (what it consumes)¶
Per-container CPU utilization time series, sampled at the monitor’s interval (≥30 s; e.g. 30 s in Alibaba-2021, 60 s in Alibaba-2022).
A lookback of historical windows for the offline grouping, plus the current window for recent similarity.
These raw series are never fed raw to the model. Each container-window is reduced to 6 engineered features: 5 value-distribution quantiles (min, Q1, median, Q3, max) + a waveform class derived from slope and standard deviation.
Derived structures built from those features: the historical (HSM), recent (RSM), and combined (CSM) similarity matrices, and the position encoding (absolute order or relative-position matrix).
The paper uses CPU usage only, though the value/waveform feature recipe is multivariate-capable in principle.
Outputs (what it produces)¶
For every container simultaneously, the predicted workload level (1 of 5 discrete CPU-utilization bands) at the next timestamp
t(single-step-ahead in the main setup).It does not directly emit a number of pods/replicas or a scaling command. It produces the forecast; the actuation (deciding and applying scaling) is left to a downstream resource manager. The authors explicitly note as future work that they will explore “how the proposed FELT can effectively guide resource allocation... to enable proactive optimization.”
How It Fits the Autoscaling Framework (MAPE-K)¶
Stage: FELT is squarely an Analyze-stage component — a time-series workload forecaster. It consumes monitored metrics and produces a forward-looking signal.
Reactive vs. proactive: FELT is the enabler of proactive / predictive autoscaling. By forecasting the next window’s workload level before it happens, it lets a scaler act ahead of the provisioning delay rather than after a threshold is breached.
Horizontal vs. vertical: FELT itself is scaling-agnostic — it just predicts demand. That prediction can feed either a horizontal scaler (add/remove replicas, e.g. the Kubernetes HPA) or a vertical one (resize CPU/RAM, e.g. a VPA). The 5-level output is coarse but, the authors argue, sufficient to guide such allocation decisions.
Honest scope: This paper delivers the prediction half of proactive autoscaling. The Plan and Execute steps (turning “container X will be at level 4” into “run 6 replicas” and applying it) are out of scope here and acknowledged as future work. So think of FELT as a drop-in, low-overhead forecaster that a standard autoscaler would call.
Concrete example: An online store the night before a big sale. A reactive autoscaler waits until checkout-service CPU is already pegged, then boots more pods — too late, customers see errors. With FELT feeding the loop, the system forecasts that the checkout and payment containers (currently behaving like the look-alike “spiky-evening” group) will jump to level 4–5 in the next window, and warms up extra pods before the rush. FELT supplies that early warning for thousands of services at once using a single, lightweight model.
Evaluation (datasets & metrics, briefly)¶
Datasets: Alibaba cluster-trace-microservices v2021 (20,000+ microservices, 12 h, 30 s sampling), v2022 (28,000+ microservices, 13 days, 60 s sampling), and Fisher (a real Kubernetes system, 10 containers, 30 days). CPU usage extracted; split 7:1:2 (train/val/test). A bonus generalization test runs on the Azure Public Dataset (edge-computing benchmark).
Metric: Accuracy = correctly-predicted levels / total (since output is a 5-class level). Overhead measured as training time, inference time, and parameter count.
Baselines: VSBG (per-container, VMD+BiLSTM/GridLSTM), NBeats (per-group, K-Means+DTW), MSCNet (multi-scale CNN), GTN-LA (graph Transformer + LSTM), TimeXer (Transformer), STGCN (spatio-temporal graph CNN). FELT comes in two variants: FELT-r (relative PE) and FELT-a (absolute PE).
Headline numbers: FELT-r tops accuracy on all three datasets (e.g. up to +28.06% on Alibaba-2021); FELT-a is second. FELT-a has the lowest training time on all datasets; FELT-r is third-lowest while being most accurate, with the third-fewest parameters (0.57×10⁶). Inference latency (~6.4 ms GPU, ~16–20 ms CPU-only) is far under the 30 s allocation interval, so it’s fine for online use.
Ablations confirm each piece pulls weight: adding recent similarity raised accuracy (51.42% → 58.17%); adding the attention mask jumped it to 68.92% and cut training time ~24% / inference ~23% (it suppresses noise from dissimilar containers). Value+Waveform together (68.92%) beat value-only (61.43%) and waveform-only (28.36%) — value features carry most of the signal, but the combo is best. The encoder-only design beats a vanilla Transformer (6.43 ms / 0.57M params vs 10.72 ms / 0.84M).
Training & pre-training¶
Trained from scratch — no pretrained or foundation model.
FELT trains a single custom encoder-only Transformer from random initialization on cloud workload traces, learning to classify the next-step CPU workload level (1 of 5 classes) for all containers at once. There is no pretrained backbone, foundation model, fine-tuning, or zero-shot transfer — its efficiency comes entirely from the single-model-plus-similarity-mask design, not from transfer learning. Each container-window is reduced to 6 engineered features (5 value quantiles + 1 waveform class). Training uses Alibaba cluster-trace-microservices v2021 (20,000+ microservices) and v2022 (28,000+), plus Fisher (10 containers, 30 days), with CPU usage split 7:1:2 train/val/test; the generalization test additionally uses the Azure Public Dataset. Overhead is reported as training time, inference time, and parameter count (~0.57M for FELT-r).
The paper does not state optimizer, epochs, learning rate, or loss explicitly.
The hand-tuned knobs (group count
K, thresholds, blend weightβ = 0.4, penaltyα) are algorithm hyperparameters, not pretraining artifacts.
Strengths¶
Genuinely scalable: one model for all containers — no per-container or per-group model explosion. Big drops in parameters (up to ~85%) and training time (up to ~96%).
Adaptive to volatility: blending historical + recent similarity, recomputed every window, keeps it accurate when containers suddenly change behavior — a real weakness of prior group-based methods.
Smart use of the Transformer: repurposing position encoding and attention masks to encode container similarity (not time order) is an elegant way to get group-specialization from a single shared model.
Cheap, expressive features: 6 summary numbers cut storage/compute yet capture both magnitude (value) and shape (waveform); ablation shows both matter.
Deployment-ready latency: inference comfortably under the 30 s allocation interval, even CPU-only, and shown to generalize to an edge (Azure) dataset.
Two variants for different needs: FELT-r (more accurate, count-invariant, fewer params) vs FELT-a (fastest training).
Limitations¶
It only forecasts; it doesn’t scale. No Plan/Execute — turning the prediction into an actual replica count or scaling action is left to a downstream autoscaler and acknowledged as future work. End-to-end SLA/cost impact isn’t measured.
Coarse 5-level output. Discretizing CPU into 5 bands saves cost but throws away fine-grained magnitude that some allocation policies might want; the paper assumes 5 levels are “enough.”
CPU-only, single-step-ahead in the main experiments. Memory, network, request rate, latency, and longer multi-step horizons aren’t the focus, even though the feature recipe could extend to them.
Several hand-tuned knobs: number of groups
K, slope/std thresholds (kthres,sthres), recent-similarity thresholdθ, blend weightβ, theαpenalty. Accuracy is sensitive to thresholds (it peaks then drops), so tuning per deployment may be needed.Group re-clustering is reactive: historical groups are only refreshed after accuracy has degraded “for a relatively long period,” so a regime change can hurt accuracy until the system notices.
Quadratic recent-similarity step: the online similarity computation is O(n²) in container count; vectorization keeps it fast at tested scales, but the authors note the true scalability ceiling depends on workload complexity, not just container count.
Glossary¶
Container / pod: a lightweight isolated box running one piece of an app.
Workload: how busy a container is; here, CPU utilization over time.
SLA/SLO: the promised service quality (latency/availability) the provider must meet.
Proactive (predictive) autoscaling: scaling based on a forecast of future demand, ahead of the provisioning delay — vs. reactive, which waits for a threshold to be crossed.
Horizontal scaling (HPA): add/remove instances. Vertical scaling (VPA): resize an instance’s CPU/RAM.
MAPE-K: Monitor–Analyze–Plan–Execute over shared Knowledge; FELT sits in Analyze.
Transformer: attention-based neural net; here used as a multivariate time-series forecaster.
Self-attention: mechanism letting each item weigh how relevant every other item is. Multi-head = several in parallel (enables fast parallel training).
Position encoding: signal telling the model where an item sits; FELT uses it to encode container similarity, not time order. Absolute = global order; relative = pairwise rank-distance.
Attention mask: forces attention between chosen items to zero; FELT masks out dissimilar containers.
Value features: 5 quantiles (min, Q1, median, Q3, max) describing the workload’s distribution.
Waveform features: slope + standard deviation → a class (flat / rising / falling / fluctuating) describing the shape of change.
K-Medoids: clustering where each group center is a real data point (robust to outliers); used to form historical groups.
HSM / RSM / CSM: Historical / Recent / Combined Similarity Matrices — binary “are these two containers alike?” tables, the last being
HSM AND RSM.Encoder-only Transformer: drops the decoder (and its cross-attention), cutting compute and memory — FELT’s design choice for efficiency.
- Ding, Z., Feng, B., & Yu, W. (2025). FELT: Large-Scale Cloud Workload Prediction Through Adaptive Feature-Enhanced and Similarity-Aware Transformer. Tsinghua Science and Technology. 10.26599/tst.2025.9010102