Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

CloudFormer

Predicting a cloud VM's performance-degradation ratio from black-box host metrics with a dual-branch transformer

Shahbazinia et al. (2025) Citations


TL;DR

When many virtual machines (VMs) share one physical server in the cloud, they fight over hidden shared hardware (cache, memory bandwidth, network). This “noisy neighbor” contention silently slows applications down even though each VM was promised its own CPUs and memory. CloudFormer is a neural network that predicts how much an application will be slowed down, using only metrics the cloud provider can see from outside the VM (a “black box” view that respects privacy). It is a dual-branch Transformer: one branch reads the time series of metrics (how things change second by second), and the other reads the mix of metrics (how 206 different sensors relate to each other). It outputs a single number between 0 and 1: the ratio of ideal performance to actual performance. On unseen applications it predicts slowdown with a mean error of only 7.8 percent, beating the best baseline by at least 28 percent. This prediction is the “Analyze” brain that lets a cloud manager scale, migrate, or reschedule VMs before users notice a slowdown, instead of reacting after the damage is done.


The Problem (and why simple autoscaling isn’t enough)

Cloud providers pack (“consolidate”) many customer VMs onto one physical server to save money and energy. Virtualization (Intel VT, AMD-V) cleanly splits up dedicated resources: you get your own vCPUs, your own RAM quota, your own disk partition. The catch: it cannot split up shared resources. All VMs on the box still share:

When a greedy neighbor VM hammers these, your VM slows down even though, on paper, it has everything it was promised. This is performance interference (the “noisy neighbor” problem). A request that should take 10 ms might take 25 ms, throughput drops, and Service Level Agreements (SLAs) get violated.

Two things make this very hard to predict in a public cloud:

  1. Black-box constraint. For privacy, the provider cannot look inside the VM — no application source code, no internal app metrics. They only see host-level hardware counters observed from the hypervisor.

  2. Dynamic, unknown workloads. Real traffic rises and falls (think an online store quiet at 3am, slammed during a flash sale). A slowdown can come either from a noisy neighbor or from the app’s own workload spiking, and you must tell those two apart.

Why plain threshold-based autoscaling isn’t enough here. Classic reactive autoscaling says “CPU went over 70 percent, add a VM.” But interference slowdown does not show up cleanly as one metric crossing a line — your CPU can look fine while cache contention quietly doubles your latency. By the time a Quality-of-Service (QoS) violation is visible, users are already suffering, and booting/migrating a VM takes time (the provisioning delay). You need to forecast the slowdown ahead of time so the orchestrator can act early. CloudFormer provides exactly that forecasting signal.

Prior work fell short in recognizable ways: many methods only handle one fixed scenario type (CPU-heavy or network-heavy), or only output a yes/no “is there interference?” classification, or require peeking inside the app, or model the metric mix but ignore how things change over time. CloudFormer is built to cover all of these gaps at once.


Background

Virtual machine (VM): a software computer running on shared physical hardware. The cloud rents these out.

Multi-tenant / consolidation: many customers’ VMs on one server. Great for utilization, bad for isolation.

Hypervisor / host: the supervising layer the provider controls. It can read hardware counters about every VM without entering the VM. This is the only legal vantage point in a public cloud.

Performance degradation (the target): how much slower the app runs versus running alone. Defined as a ratio P = (ideal performance) / (actual performance), where 0 < P ≤ 1. P = 1 means no slowdown; P near 0 means severe slowdown. (“Ideal” is measured by replaying the exact same request pattern on a VM running alone, so workload changes are not mistaken for interference.)

Transformer (the model family): a neural network built around self-attention. Attention lets every element in a set look at (“attend to”) every other element and decide which ones matter for the task. Originally built for language (“Attention Is All You Need”, 2017), it is now widely used for time-series data. Here it is used as a performance predictor, not a text generator. A handy trick it uses is a class token: an extra learnable slot added to the input whose job is to soak up a summary of everything via attention, giving one clean vector to make the final prediction from.

Where this sits relative to autoscaling forecasting: Many papers in this area forecast future workload (e.g., next 10 minutes of CPU or request rate). CloudFormer is slightly different: it predicts how badly performance is/will be degraded given the system state. It is a performance predictor feeding the decision logic, not a raw workload forecaster.


Contribution in Simple Terms

Three concrete things:

  1. A new model, CloudFormer. A lightweight (~228k parameters) dual-branch Transformer that predicts VM performance degradation in black-box clouds. The two branches deliberately split the job:

    • Temporal branch — “how does it change over time?” (catches transient spikes, periodic I/O bursts).

    • System branch — “how do the 206 sensors relate to each other right now?” (catches the overall contention fingerprint, e.g., high LLC misses + high backend-bound = trouble). Splitting them and then fusing lets the model handle both static and dynamic workloads without retuning per scenario — a key generalization win.

  2. A new dataset, CloudPerfTrace. Two months of traces, 206 metrics at 1-second resolution, spanning 11 real cloud applications (Redis, HBase, Web Search, Flink, MLPerf, etc.) and four workload shapes (static, monotonic, periodic, random). Roughly 317 days of recorded data total, ~18 GB compressed, openly released. This is far finer-grained and richer than existing public datasets (Google Borg, Alibaba, Azure Resource Central) that only log CPU/memory every few minutes.

  3. Evidence it generalizes. Tested on entirely unseen applications, CloudFormer hits 7.8 percent MAE, beating Random Forest, LSTM, Decision Trees, and linear/gamma regression by at least 28 percent — and an ablation shows each branch genuinely contributes.

The honest framing: this paper delivers the prediction primitive. It does not itself implement the scheduler or autoscaler; it produces the accurate degradation signal that such systems need.


How It Works, Step by Step

  1. Collect input. For one application run, build a matrix of shape S × T: S = 206 system metrics (fixed), T = number of 1-second time steps (varies by run length). Metrics come from libvirt (53 VM-level, e.g., CPU%, free memory), Linux perf (38 hardware counters, e.g., cycles, LLC misses, retired instructions), and Intel Top-Down analysis (12 bottleneck metrics, e.g., frontend-bound%, backend-bound%). Half describe the target VM; half are the averaged metrics of its neighbor VMs (so the model can “see” the noisy neighbors).

  2. Normalize all metrics.

  3. Temporal branch (models change over time):

    • A dense layer with ReLU compresses each time step’s 206 features into a small embedding (dimension dt = 64).

    • A learnable class token is prepended to summarize the whole sequence.

    • Sinusoidal positional encoding is added so the model knows the order of time steps.

    • The sequence passes through BT = 4 Transformer encoder blocks using masked multi-head self-attention (masking lets it handle variable-length runs with padding), plus residual connections, layer norm, dropout, and feedforward layers.

    • Output: one vector (the class token) summarizing the temporal story.

  4. System branch (models relationships among metrics):

    • First, average each metric over time (mean pooling), collapsing the time axis. Now order in time is gone — what remains is the overall contention profile.

    • A 1D convolution projects pooled metrics into a system embedding (dimension ds = 64).

    • A learnable class token is prepended. No positional encoding here — metrics have no natural order.

    • It passes through BS = 4 Transformer encoder blocks with standard (unmasked) attention, letting every metric attend to every other metric (e.g., LLC occupancy influencing cache misses).

    • Output: one vector summarizing cross-metric interactions.

  5. Fuse and predict. Concatenate the two class-token vectors into one joint embedding. Feed it through a small MLP (dense layers, Swish activation, layer norm, dropout 0.4) ending in a sigmoid, producing a single scalar in (0, 1] — the predicted performance ratio P̂.

  6. Train. Adam optimizer, low learning rate (1e-5), log-cosh loss (behaves like MSE for small errors but is robust to outliers/noise), with linear warm-up + cosine decay scheduling. The whole model is tiny (~228k params: ~222k temporal, ~6k system).

The two non-Transformer pieces worth noting: a 1D convolution (system-branch projection) and mean pooling over time (system-branch summarization). Everything else is attention.


Inputs (what it consumes)


Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

The MAPE-K loop: Monitor → Analyze → Plan → Execute, over shared Knowledge.

Reactive vs. proactive: CloudFormer is designed to make management proactive. Instead of reacting after a QoS violation has already occurred, it predicts degradation so the orchestrator can act before users notice. The whole motivation is to stop “reacting only after QoS violations have already happened.”

Horizontal vs. vertical: The paper does not bind itself to one. Its prediction can feed either: scheduling/migration and placement decisions (a form of horizontal placement/scale management) or throttling/resource provisioning for a noisy VM (vertical-style adjustment). It is scaler-agnostic — it provides the prediction; the actuation is whatever the cloud operator already uses.

Analogy: Think of an online store the night before a big sale. Reactive autoscaling is a fire alarm — it rings only once the building is already smoking (the site is already slow). CloudFormer is more like a smoke-and-heat sensor wired to a forecaster: it watches subtle signals (cache misses creeping up on a crowded server) and says “this VM is about to run at 60 percent speed” — so the operations team moves it to a quieter server before customers ever hit a slow checkout.


Strengths

Limitations


Evaluation at a glance


Training & pre-training

Trained from scratch — no pretrained or foundation model.

CloudFormer is a compact (~228k-parameter) dual-branch Transformer trained from random initialization on the authors’ own CloudPerfTrace dataset (11 real cloud apps, 206 host-level metrics at 1-second resolution, ~317 days of traces). There is no pretrained backbone, no foundation model, no fine-tuning, and no zero-shot transfer — every weight is learned for this task on this data.

Crucially, the “generalization” claim is not transfer from a pretrained model. It is measured by an application-level split: 7 apps are used for training and 4 entirely unseen apps for testing, randomized over 6 seeds. Held-out unseen apps, not a borrowed backbone, are what the 7.8 percent MAE demonstrates.

The training recipe itself is modest: inputs are normalized, optimization uses Adam with a deliberately low initial learning rate (1e-5), and the objective is log-cosh loss (chosen over MSE for robustness to outliers), under a learning-rate schedule of linear warm-up followed by cosine decay.


Glossary

References
  1. Shahbazinia, A., Huang, D., Costero, L., & Atienza, D. (2025). CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload. arXiv. 10.48550/ARXIV.2509.03394