CloudFormer - Luca's Research @ PEG

Shahbazinia et al. (2025) Citations

TL;DR¶

When many virtual machines (VMs) share one physical server in the cloud, they fight over hidden shared hardware (cache, memory bandwidth, network). This “noisy neighbor” contention silently slows applications down even though each VM was promised its own CPUs and memory. CloudFormer is a neural network that predicts how much an application will be slowed down, using only metrics the cloud provider can see from outside the VM (a “black box” view that respects privacy). It is a dual-branch Transformer: one branch reads the time series of metrics (how things change second by second), and the other reads the mix of metrics (how 206 different sensors relate to each other). It outputs a single number between 0 and 1: the ratio of ideal performance to actual performance. On unseen applications it predicts slowdown with a mean error of only 7.8 percent, beating the best baseline by at least 28 percent. This prediction is the “Analyze” brain that lets a cloud manager scale, migrate, or reschedule VMs before users notice a slowdown, instead of reacting after the damage is done.

The Problem (and why simple autoscaling isn’t enough)¶

Cloud providers pack (“consolidate”) many customer VMs onto one physical server to save money and energy. Virtualization (Intel VT, AMD-V) cleanly splits up dedicated resources: you get your own vCPUs, your own RAM quota, your own disk partition. The catch: it cannot split up shared resources. All VMs on the box still share:

the Last-Level Cache (LLC) — the big on-chip memory cache,
memory bandwidth — the pipe between CPU and RAM,
the network interface.

When a greedy neighbor VM hammers these, your VM slows down even though, on paper, it has everything it was promised. This is performance interference (the “noisy neighbor” problem). A request that should take 10 ms might take 25 ms, throughput drops, and Service Level Agreements (SLAs) get violated.

Two things make this very hard to predict in a public cloud:

Black-box constraint. For privacy, the provider cannot look inside the VM — no application source code, no internal app metrics. They only see host-level hardware counters observed from the hypervisor.
Dynamic, unknown workloads. Real traffic rises and falls (think an online store quiet at 3am, slammed during a flash sale). A slowdown can come either from a noisy neighbor or from the app’s own workload spiking, and you must tell those two apart.

Why plain threshold-based autoscaling isn’t enough here. Classic reactive autoscaling says “CPU went over 70 percent, add a VM.” But interference slowdown does not show up cleanly as one metric crossing a line — your CPU can look fine while cache contention quietly doubles your latency. By the time a Quality-of-Service (QoS) violation is visible, users are already suffering, and booting/migrating a VM takes time (the provisioning delay). You need to forecast the slowdown ahead of time so the orchestrator can act early. CloudFormer provides exactly that forecasting signal.

Prior work fell short in recognizable ways: many methods only handle one fixed scenario type (CPU-heavy or network-heavy), or only output a yes/no “is there interference?” classification, or require peeking inside the app, or model the metric mix but ignore how things change over time. CloudFormer is built to cover all of these gaps at once.

Background¶

Virtual machine (VM): a software computer running on shared physical hardware. The cloud rents these out.

Multi-tenant / consolidation: many customers’ VMs on one server. Great for utilization, bad for isolation.

Hypervisor / host: the supervising layer the provider controls. It can read hardware counters about every VM without entering the VM. This is the only legal vantage point in a public cloud.

Performance degradation (the target): how much slower the app runs versus running alone. Defined as a ratio P = (ideal performance) / (actual performance), where 0 < P ≤ 1. P = 1 means no slowdown; P near 0 means severe slowdown. (“Ideal” is measured by replaying the exact same request pattern on a VM running alone, so workload changes are not mistaken for interference.)

Transformer (the model family): a neural network built around self-attention. Attention lets every element in a set look at (“attend to”) every other element and decide which ones matter for the task. Originally built for language (“Attention Is All You Need”, 2017), it is now widely used for time-series data. Here it is used as a performance predictor, not a text generator. A handy trick it uses is a class token: an extra learnable slot added to the input whose job is to soak up a summary of everything via attention, giving one clean vector to make the final prediction from.

Where this sits relative to autoscaling forecasting: Many papers in this area forecast future workload (e.g., next 10 minutes of CPU or request rate). CloudFormer is slightly different: it predicts how badly performance is/will be degraded given the system state. It is a performance predictor feeding the decision logic, not a raw workload forecaster.

Contribution in Simple Terms¶

Three concrete things:

A new model, CloudFormer. A lightweight (~228k parameters) dual-branch Transformer that predicts VM performance degradation in black-box clouds. The two branches deliberately split the job:
- Temporal branch — “how does it change over time?” (catches transient spikes, periodic I/O bursts).
- System branch — “how do the 206 sensors relate to each other right now?” (catches the overall contention fingerprint, e.g., high LLC misses + high backend-bound = trouble). Splitting them and then fusing lets the model handle both static and dynamic workloads without retuning per scenario — a key generalization win.
A new dataset, CloudPerfTrace. Two months of traces, 206 metrics at 1-second resolution, spanning 11 real cloud applications (Redis, HBase, Web Search, Flink, MLPerf, etc.) and four workload shapes (static, monotonic, periodic, random). Roughly 317 days of recorded data total, ~18 GB compressed, openly released. This is far finer-grained and richer than existing public datasets (Google Borg, Alibaba, Azure Resource Central) that only log CPU/memory every few minutes.
Evidence it generalizes. Tested on entirely unseen applications, CloudFormer hits 7.8 percent MAE, beating Random Forest, LSTM, Decision Trees, and linear/gamma regression by at least 28 percent — and an ablation shows each branch genuinely contributes.

The honest framing: this paper delivers the prediction primitive. It does not itself implement the scheduler or autoscaler; it produces the accurate degradation signal that such systems need.

How It Works, Step by Step¶

Collect input. For one application run, build a matrix of shape S × T: S = 206 system metrics (fixed), T = number of 1-second time steps (varies by run length). Metrics come from libvirt (53 VM-level, e.g., CPU%, free memory), Linux perf (38 hardware counters, e.g., cycles, LLC misses, retired instructions), and Intel Top-Down analysis (12 bottleneck metrics, e.g., frontend-bound%, backend-bound%). Half describe the target VM; half are the averaged metrics of its neighbor VMs (so the model can “see” the noisy neighbors).
Normalize all metrics.
Temporal branch (models change over time):
- A dense layer with ReLU compresses each time step’s 206 features into a small embedding (dimension dt = 64).
- A learnable class token is prepended to summarize the whole sequence.
- Sinusoidal positional encoding is added so the model knows the order of time steps.
- The sequence passes through BT = 4 Transformer encoder blocks using masked multi-head self-attention (masking lets it handle variable-length runs with padding), plus residual connections, layer norm, dropout, and feedforward layers.
- Output: one vector (the class token) summarizing the temporal story.
System branch (models relationships among metrics):
- First, average each metric over time (mean pooling), collapsing the time axis. Now order in time is gone — what remains is the overall contention profile.
- A 1D convolution projects pooled metrics into a system embedding (dimension ds = 64).
- A learnable class token is prepended. No positional encoding here — metrics have no natural order.
- It passes through BS = 4 Transformer encoder blocks with standard (unmasked) attention, letting every metric attend to every other metric (e.g., LLC occupancy influencing cache misses).
- Output: one vector summarizing cross-metric interactions.
Fuse and predict. Concatenate the two class-token vectors into one joint embedding. Feed it through a small MLP (dense layers, Swish activation, layer norm, dropout 0.4) ending in a sigmoid, producing a single scalar in (0, 1] — the predicted performance ratio P̂.
Train. Adam optimizer, low learning rate (1e-5), log-cosh loss (behaves like MSE for small errors but is robust to outliers/noise), with linear warm-up + cosine decay scheduling. The whole model is tiny (~228k params: ~222k temporal, ~6k system).

The two non-Transformer pieces worth noting: a 1D convolution (system-branch projection) and mean pooling over time (system-branch summarization). Everything else is attention.

Inputs (what it consumes)¶

206 host-observable system metrics per run, sampled at 1-second resolution, as a time series of shape S=206 × T:
- 53 VM-level metrics via libvirt (CPU utilization %, unused memory, etc.)
- 38 Linux perf hardware counters (cycles, LLC misses, retired instructions, etc.)
- 12 Intel Top-Down bottleneck metrics (frontend-bound %, backend-bound %, etc.)
- These 103 metrics are duplicated: 103 for the target VM + 103 averaged across neighbor VMs = 206.
Variable lookback length: T is the full duration of the run (variable-length, handled by attention masking) — not a fixed window.
Black-box only: no app source code, no in-VM telemetry. All sensing is from the hypervisor.

Outputs (what it produces)¶

A single scalar P̂ ∈ (0, 1] — the predicted performance ratio (ideal ÷ actual).
- P̂ = 1 → no degradation; the VM runs at full speed.
- P̂ → 0 → severe slowdown (e.g., Redis and Minio in the data degraded to ~6–10 percent).
It is a regression output (a continuous magnitude of slowdown), not a yes/no classification — which is what distinguishes it from prior “is there interference?” detectors.
Horizon: the paper frames this as predicting the degradation of a run from its observed system trace (a quantified degradation index), rather than a fixed-minutes-ahead forecast of a future metric. The value is in turning noisy host counters into one actionable slowdown number.

How It Fits the Autoscaling Framework (MAPE-K)¶

The MAPE-K loop: Monitor → Analyze → Plan → Execute, over shared Knowledge.

Monitor: CloudFormer consumes the monitored host-level metrics (it does not collect them itself, but it is the consumer of the Monitor stage).
Analyze (this is CloudFormer’s home): it is the analytical brain that turns raw counters into a forecasted performance-degradation signal. This is precisely where a predictive model lives in MAPE-K.
Plan / Execute: the paper leaves these to existing systems. CloudFormer outputs the prediction; a separate orchestrator/scheduler decides what to do — migrate the VM to a less crowded host, reschedule placement to avoid interference, throttle the noisy neighbor, or trigger scaling — and then carries it out. The authors explicitly position the model as the signal that “guides resource management decisions” and enables “proactive migration” and “SLA compliance.”

Reactive vs. proactive: CloudFormer is designed to make management proactive. Instead of reacting after a QoS violation has already occurred, it predicts degradation so the orchestrator can act before users notice. The whole motivation is to stop “reacting only after QoS violations have already happened.”

Horizontal vs. vertical: The paper does not bind itself to one. Its prediction can feed either: scheduling/migration and placement decisions (a form of horizontal placement/scale management) or throttling/resource provisioning for a noisy VM (vertical-style adjustment). It is scaler-agnostic — it provides the prediction; the actuation is whatever the cloud operator already uses.

Analogy: Think of an online store the night before a big sale. Reactive autoscaling is a fire alarm — it rings only once the building is already smoking (the site is already slow). CloudFormer is more like a smoke-and-heat sensor wired to a forecaster: it watches subtle signals (cache misses creeping up on a crowded server) and says “this VM is about to run at 60 percent speed” — so the operations team moves it to a quieter server before customers ever hit a slow checkout.

Strengths¶

Black-box friendly: needs only host-visible counters; respects public-cloud privacy. No app instrumentation.
Generalizes to unseen apps: trained on 7 apps, tested on 4 never-seen apps; 7.8 percent MAE, ≥28 percent better than the best baseline (Random Forest at 10.78 MAE; LSTM 15.42; linear/gamma far worse).
Handles both static and dynamic workloads without per-scenario retuning — a direct answer to prior methods’ biggest weakness.
Quantifies slowdown (regression), not just detects it (classification), enabling fine-grained planning.
Two complementary views proven useful: ablation shows temporal-only (8.47 MAE) and system-only (8.65 MAE) each beat baselines, and fusing them (7.80) is best. Notably the system branch achieves this with only ~6k parameters.
Lightweight: ~228k parameters total — cheap to run, suitable for online use.
Releases a strong open dataset (CloudPerfTrace: 206 metrics, 1-sec, 317 days, 11 apps, CC-BY-4.0), valuable to the whole field.

Limitations¶

Prediction only — no closed loop. It does not implement Plan/Execute; actuation (scheduling, migration, scaling) is delegated to external systems and is not evaluated end-to-end.
Not a forward-time forecaster in the usual sense. It outputs a degradation ratio from an observed trace rather than an explicit “value N minutes from now,” so plugging it into a strictly predictive autoscaler still needs design work (the authors list online adaptation as future work).
Single-node scope. Evaluated on one physical server (up to 9 server VMs per socket). Multi-node/cluster settings are listed as future work.
Dataset realism caveats: generated on one testbed (dual Intel Xeon Gold 6240) with synthetic workload shapes; real public-cloud heterogeneity (varied hardware, hypervisors, tenant mixes) may differ.
Needs neighbor metrics: the 206 features include averaged neighbor-VM metrics, which assumes the provider can observe co-located VMs (true for an operator, but the feature design leans on that).
No energy or cost objective yet — purely accuracy-focused; energy-aware extensions are proposed but not done.

Evaluation at a glance¶

Dataset: CloudPerfTrace — 11 apps, 206 metrics, 1-sec resolution, ~317 days total, static + monotonic + periodic + random workloads.
Split: 7 training apps / 4 unseen test apps; randomized 6 times (6 seeds), results aggregated.
Metrics: MAE and MSE. CloudFormer: MAE 7.80 ± 1.55, MSE 142.67 ± 49.71. Best baseline Random Forest: MAE 10.78, MSE 205.67. LSTM: MAE 15.42. Linear/Gamma regression far worse.
Error distribution: CloudFormer has the largest share of predictions within 5 percent and 10 percent error and the fewest large (>25 percent) errors. Static scenarios are easier to predict than dynamic ones.

Training & pre-training¶

Trained from scratch — no pretrained or foundation model.

CloudFormer is a compact (~228k-parameter) dual-branch Transformer trained from random initialization on the authors’ own CloudPerfTrace dataset (11 real cloud apps, 206 host-level metrics at 1-second resolution, ~317 days of traces). There is no pretrained backbone, no foundation model, no fine-tuning, and no zero-shot transfer — every weight is learned for this task on this data.

Crucially, the “generalization” claim is not transfer from a pretrained model. It is measured by an application-level split: 7 apps are used for training and 4 entirely unseen apps for testing, randomized over 6 seeds. Held-out unseen apps, not a borrowed backbone, are what the 7.8 percent MAE demonstrates.

The training recipe itself is modest: inputs are normalized, optimization uses Adam with a deliberately low initial learning rate (1e-5), and the objective is log-cosh loss (chosen over MSE for robustness to outliers), under a learning-rate schedule of linear warm-up followed by cosine decay.

Glossary¶

Autoscaling: automatically adjusting resources (VMs, CPU/RAM) to match demand without over- or under-provisioning.
Reactive vs. proactive scaling: react after a threshold is crossed (always lagging) vs. forecast demand/problems and act ahead of time.
MAPE-K: the self-adaptive control loop — Monitor, Analyze, Plan, Execute, over shared Knowledge. CloudFormer is an Analyze-stage component.
Performance interference / noisy neighbor: slowdown caused by co-located VMs competing for shared hardware.
Black-box environment: the provider sees only external/host metrics, never inside the VM.
Performance degradation ratio P: ideal ÷ actual performance, in (0, 1]; 1 = healthy, near 0 = badly slowed.
Transformer / self-attention: a neural net where each element weighs the relevance of all others; used here as a performance predictor over time-series and over metrics.
Class token: an extra learnable input slot that aggregates a summary of everything via attention, used as the final feature for prediction.
Positional encoding: signal added so the Transformer knows the order of time steps (used only in the temporal branch).
LLC (Last-Level Cache): the large shared on-chip cache; a major source of interference.
MAE / MSE: Mean Absolute Error / Mean Squared Error — average prediction error magnitudes; lower is better.
libvirt / perf / Top-Down: the tools providing the 206 host-level metrics (VM stats, hardware counters, CPU bottleneck breakdown).

References¶

Shahbazinia, A., Huang, D., Costero, L., & Atienza, D. (2025). CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload. arXiv. 10.48550/ARXIV.2509.03394