MV-Transformer - Luca's Research @ PEG

Kumar et al. (2025) Citations

TL;DR¶

Cloud apps run on virtual machines (VMs) or containers that cost money. If you give the app too few, it slows down or drops requests; too many, and you waste money. Autoscaling is the automatic act of adding or removing those machines as demand rises and falls. This paper builds an autoscaler that does not just react to load that has already arrived — it predicts the near-future load and scales ahead of time. The prediction engine is a Transformer, used here as a short-term forecaster of cloud workload. Crucially, it is multivariate: instead of looking at one signal like CPU alone, it looks at several metrics together (CPU, memory, disk, network) and learns how they interact. They wrap this predictor inside the classic MAPE control loop (Monitor → Analyze → Plan → Execute) and wire it to real Docker + Kubernetes so it actually spins VMs up and down. On three public workload datasets it forecasts more accurately than LSTM, Bi-LSTM, WGAN-GP-transformer, and DynEformer, and its autoscaling causes less under-provisioning and a ~2.98x “elastic speedup” over running with no autoscaler.

The Problem (and why simple autoscaling isn’t enough)¶

Imagine an online store the night before a big sale. Traffic is about to explode at midnight.

Reactive autoscaling is like a manager who only adds cashiers after the queue is already out the door. It waits for a metric to cross a threshold (e.g., “CPU > 70% → add a server”). The problem: booting a new VM or container takes time (the provisioning delay / “cold start”). By the time the new capacity is ready, customers have already waited, timed out, or left. Reactive scaling is always one step behind, and during choppy, fluctuating traffic it constantly over- or under-corrects.
Proactive (predictive) autoscaling is the manager who looks at last year’s sale, sees the pattern, and staffs up before midnight. It forecasts demand and scales in advance, hiding the provisioning delay.

Two specific gaps this paper targets:

Most prior predictors are univariate — they forecast from one signal (just CPU, or just request rate). But cloud metrics are entangled: a spike in network traffic usually drags memory and CPU with it. Looking at only one signal throws away information.
The popular predictive models have weaknesses. LSTM/Bi-LSTM (recurrent neural nets) are slow on long sequences, hungry for memory, and suffer the vanishing gradient problem (they “forget” patterns from far back). Some transformer rivals (WGAN-GP transformer, DynEformer) are tuned for single-variable input and don’t fully exploit multiple metrics together.

So the research question is: can a multivariate transformer forecast cloud demand more accurately and cheaply, and can we plug it into a full, real-world autoscaling loop that actually moves VMs?

Background¶

Define the jargon once:

VM / container / pod: units of compute you rent. Adding more = “scaling out / up”; removing = “scaling in / down”. (Kubernetes calls a unit a pod; this paper mostly says “VM”.)
Horizontal scaling: change the number of instances. Vertical scaling: change the size (CPU/RAM) of one instance. This paper does horizontal scaling — it adjusts the number of VMs.
Under-provisioning: too few resources → SLA violations, dropped requests. Over-provisioning: too many → wasted cost. A good autoscaler minimizes both.
MAPE-K loop: the textbook blueprint for self-managing systems. Monitor (collect metrics) → Analyze (detect/forecast) → Plan (decide the action) → Execute (apply it), all sharing a common Knowledge base. The predictive model almost always lives in the Analyze stage.
Time-series forecasting: predicting future values of a signal from its history. Here the signal is workload (specifically network throughput in KB/s).
Transformer: a neural network built on self-attention. Instead of reading a sequence step-by-step like an RNN, attention lets every time-step look directly at every other time-step and decide which ones matter (“which past moments best predict the next one?”). This avoids recurrence, so it trains faster, uses fewer parameters, and captures long-range patterns without forgetting them.
Multivariate vs univariate: univariate = one input series; multivariate = several series fed in together so the model learns their interdependencies.

Contribution in Simple Terms¶

The paper’s genuine novelty is the combination, not any single brand-new neural net:

A multivariate transformer (“MV-Transformer”) as the forecaster. It ingests several cloud metrics at once (CPU, memory, disk read/write, network throughput) and predicts the next value of the chosen target metric. Using self-attention, it captures long-term dependencies and cross-metric interactions while using less memory than LSTM/Bi-LSTM and sidestepping vanishing gradients/overfitting.
Feature selection up front. Before forecasting, they use the Pearson correlation coefficient (a simple statistic measuring how linearly two signals move together, from −1 to +1) to pick the most informative pair of metrics. This trims noise, speeds up training, and reduces overfitting.
A complete, working MAPE autoscaling framework around the predictor — not just an offline forecast. The forecast feeds a planning algorithm that computes how many VMs are needed, with guards against thrashing, and an execute stage that actually scales VMs.
A real Kubernetes + Docker Desktop deployment. They don’t just simulate — they run the loop on Minikube/kubectl with real container scaling, and show it beats default Kubernetes behavior.

In plain terms: they took a smarter, multi-signal forecaster and embedded it in a real control loop that proactively right-sizes a cloud cluster.

How It Works, Step by Step¶

Collect. A collector module continuously gathers resource-utilization data — CPU usage C(t), memory M(t), disk read/write D(t), network throughput N(t) — over a sliding time window [t−w, t].
Monitor. These readings are stacked into a matrix X (rows = timestamps, columns = the four metrics) and stored. In the live deployment, metrics can come from Docker/Kubernetes built-ins or Prometheus; in the experiments they used default API-gathered data. The monitor also decides if an immediate (reactive) action is needed; otherwise it forwards data on for proactive analysis.
Analyze — feature selection. Compute Pearson correlation between candidate metric pairs and keep the strongest pair. The chosen target metric to forecast is network throughput; the partner feature is the metric most correlated with it (e.g., on GWA-T-12 they pair network throughput with memory).
Analyze — MV-Transformer forecast. The selected multivariate window is fed into the transformer:
- The input has shape [Batch, Sequence length, Features].
- Self-attention computes Query/Key/Value matrices and attention scores softmax(QKᵀ/√dk)·V to weigh which past time-steps matter most; multi-head attention does this several ways in parallel.
- Architecture: an encoder (2 layers: masked multi-head attention → feed-forward → Add & Norm) and a decoder (1 layer: masked multi-head attention → cross-attention with encoder output → feed-forward → Add & Norm). Hidden size 512, 50 hidden units, dropout 0.2, ReLU/SoftMax activations.
- Output: the predicted next-step value, N(t+1) = f(Xt) — i.e., the forecasted network throughput for the upcoming interval.
Plan. Convert the forecast into a resource decision. Estimate VMs needed as:
```
VMs(t+1) = predicted_workload(t+1) / workload_per_VM
```
where workload_per_VM is each VM’s max handling capacity (admin-set). Compare VMs(t+1) to the current count and decide scale-up, scale-down, or hold. Two safety mechanisms:
- CPT (Control Period Time): a minimum gap (~1 minute) between scaling actions to stop oscillation (rapid flip-flopping up/down that wastes resources).
- RRS (Remove Resources Strategy): when scaling down, remove only a fraction of surplus VMs at a time (VMs_surplus = (VMs_t − VMs_{t+1}) × RRS), never dropping below VMs_min. This stabilizes the system and keeps headroom for sudden spikes.
Execute. Issue the scale-up/scale-down command to the effector, which tells Docker/Kubernetes to actually create or remove VMs/containers, then schedules them into the cluster for the next interval. If a scaling action fails or is delayed, Execute talks back to Monitor to re-plan (fault tolerance). The loop repeats indefinitely.

Non-transformer pieces to note: Pearson correlation (classical statistics) for feature selection; the analytic Plan algorithms (CPT + RRS) that turn a forecast into a discrete VM count; and Docker/Kubernetes as the actuation layer. There is no GAN, diffusion, or FFT in the proposed model itself (WGAN-GP, which uses adversarial generator/discriminator losses, appears only as a baseline it’s compared against).

Inputs (what it consumes)¶

Multivariate time series of cloud resource metrics: CPU usage, memory usage, disk read/write, and network throughput, sampled over a lookback window [t−w, t].
The user/workload side: request rates, pending/dropped requests, request size (gathered by the monitor’s user module).
A selected feature pair chosen by Pearson correlation (e.g., network throughput + memory on GWA-T-12; max_per_machine + priority on Google Cluster; SampleCount + AverageAllocatedMb on Azure).
Config: lookback/sequence length, 2 features, batch size 64, etc.
Target variable to forecast: network throughput (KB/s) — used as the proxy for “workload.”

Outputs (what it produces)¶

Primary model output: a one-step-ahead forecast of network throughput N(t+1) (the workload for the next interval).
Derived planning output: a recommended number of VMs for the next control period, and a concrete scaling action (scale-up / scale-down / hold).
Final effect: actual VMs/containers created or removed in Kubernetes/Docker.
Horizon: short-term / next-interval. Evaluation horizons shown are 0–2 hours of workload and 0–100 minute scaling traces; forecast granularities tested include 5/10/30-minute intervals depending on dataset.

How It Fits the Autoscaling Framework (MAPE-K)¶

This paper is one of the rare ones that implements the whole loop, not just the predictor:

MAPE stage	What the paper does
Monitor	Collector + monitor modules gather CPU/mem/disk/net and user-request metrics into matrix `X`.
Analyze	Where the transformer lives. Pearson feature selection → MV-Transformer forecasts next-interval workload.
Plan	Converts forecast to VM count; applies CPT (anti-oscillation) and RRS (gradual scale-in) policies.
Execute	Effector issues commands; Docker/Kubernetes actually scales VMs; loops back to Monitor on failure.

Proactive or reactive? Primarily proactive (forecast-then-scale). The framework can also run reactively, but reactive is used only as a baseline for comparison.
Horizontal or vertical? Horizontal — it changes the number of VMs/containers, not their size.
Does it actuate, or just predict? It actuates. The forecast directly drives the VMs = workload / workload_per_VM calculation, which is executed in a real Kubernetes cluster — so unlike papers that stop at the forecast and hand off to a stock HPA, this one closes the loop end-to-end (though the Plan logic is custom, not the default Kubernetes HPA).

Analogy: Monitor = the store’s traffic counters; Analyze = the forecaster predicting the midnight rush; Plan = deciding “we need 12 cashiers, but add them gradually and don’t yank them the second the line dips”; Execute = actually clocking cashiers in and out.

Evaluation (datasets & metrics, briefly)¶

Datasets: GWA-T-12 (Bitbrains, ~1,250–1,750 VMs), Google Cluster Traces 2011 (~12,500 machines, 29 days), Azure Functions 2019. 80/20 train/validation split.
Forecast-accuracy metrics: MSE, MAE, RMSE, MAPE (lower = better). At 50–100 epochs the MV-Transformer had the lowest errors vs LSTM, Bi-LSTM, and DynEformer. On MAPE it beat the WGAN-GP transformer on every dataset (e.g., GWA-T-12: 10.53 vs 12.28).
Autoscaling/elasticity metrics (SPEC): under-provisioning Θ_U, over-provisioning Θ_O, time-under T_U, time-over T_O (lower = better), and elastic speedup E_na (higher = better).
- GWA-T-12: MV-Transformer Θ_U = 0.289, T_U = 10.67, E_na = 2.98 vs Bi-LSTM 1.32, LSTM 1.02, no-autoscaling 1.00.
- Google Cluster: Θ_U = 0.340, T_U = 13.58, E_na = 2.78.
Real deployment: Minikube + kubectl (v1.28.2), Docker Desktop (v25.0.1) on WSL Ubuntu 22.04, 2 vCPUs / 8 GB RAM. Dashboard screenshots show real scale-up/scale-down events with 1-minute CPT.

Training & pre-training¶

Trained from scratch — no pretrained or foundation model.

The MV-Transformer is a task-specific encoder-decoder forecaster trained from random initialization directly on cloud workload traces (Google Cluster, GWA-T-12, Azure Functions 2019). There is no pretraining, transfer learning, foundation model, or zero-shot component — the whole network is learned end-to-end on these datasets.

Training configuration (paper Table 1):

Architecture/size: 2 encoder layers + 1 decoder layer, d_model 512, linear (feed-forward) dim 2048, 50 hidden units, 2 input features (after Pearson feature selection), dropout 0.2.
Optimization: batch size 64; loss/metrics MSE/MAE/RMSE/MAPE; trained for 10 / 50 / 100 epochs (error keeps dropping as epochs increase); 80/20 train/validation split. (The paper states no optimizer name or learning rate.)

Note: the only “pre-trained” wording anywhere in the paper refers to a cited related work (a diffusion model), not to the proposed MV-Transformer.

Strengths¶

End-to-end and real: actually scales VMs in Kubernetes/Docker, not just an offline forecast.
Multivariate — exploits cross-metric correlations that univariate models miss.
Efficient transformer: captures long-range dependencies with less memory than LSTM/Bi-LSTM; avoids vanishing gradients and overfitting (no overfitting seen in loss curves).
Practical anti-thrash design: CPT and RRS directly address oscillation and abrupt scale-in, a common real-world autoscaler pain point.
Broad evaluation: three real datasets, several strong baselines, both forecast-accuracy and elasticity metrics.
Reproducible: public code and public datasets.

Limitations¶

Forecast horizon is short (essentially one step / next interval). It doesn’t demonstrate long-horizon planning, so a very sudden, unprecedented spike could still outrun it (the authors flag real-time spike/anomaly handling as future work).
The “multivariate” win rests on weak correlations. The selected feature pairs had low Pearson values (e.g., 0.096 on GWA-T-12, 0.015 on Google Cluster), so how much the second feature truly helps is unclear, and only 2 features are used despite the multivariate framing.
Reported error tables are noisy/inconsistent (e.g., some RMSE entries look mis-transcribed, and Google Cluster MSEs barely differ across models), which makes the accuracy margins hard to fully trust from the paper alone.
Dataset- and environment-specific. Authors note results depend on the particular GWA-T-12/Google datasets and on specific Docker/Kubernetes configs; generalization across cloud providers is unvalidated.
Horizontal only. No vertical scaling; workload_per_VM is a fixed admin-set capacity rather than learned.
No energy/cost-of-training analysis beyond the elasticity metrics; sustainability is left to future work.
Custom Plan logic, not standard HPA, so portability of the exact policies to other orchestrators isn’t shown.

Glossary¶

Autoscaling: automatically adding/removing compute resources to match demand.
Reactive vs proactive: react after a threshold is crossed vs forecast and scale ahead.
Horizontal scaling: change the number of instances (what this paper does). Vertical: change instance size.
MAPE-K loop: Monitor → Analyze → Plan → Execute over shared Knowledge; the self-adaptive-systems blueprint.
Transformer / self-attention: neural net where each time-step attends to all others to weigh relevance; no recurrence, trains fast, remembers long-range patterns.
Multivariate forecasting: predicting using several input series at once to capture their interactions.
Pearson correlation: statistic (−1 to +1) measuring linear co-movement of two signals; used here for feature selection.
Under-/over-provisioning (Θ_U/Θ_O): too few / too many resources; lower is better.
Elastic speedup (E_na): how much better the autoscaler performs vs no autoscaling; >1 means it helps.
CPT (Control Period Time): minimum wait between scaling actions to prevent oscillation/flapping.
RRS (Remove Resources Strategy): scale in gradually, removing only a fraction of surplus VMs.
LSTM / Bi-LSTM: recurrent neural nets used as baselines; slower and more memory-hungry on long sequences.
WGAN-GP transformer / DynEformer: rival transformer-based forecasters (baselines); largely univariate / edge-focused respectively.
Provisioning delay / cold start: the lag before a newly requested VM/container is ready to serve.

References¶

Kumar, B., Verma, A., & Verma, P. (2025). A multivariate transformer-based monitor-analyze-plan-execute (MAPE) autoscaling framework for dynamic resource allocation in cloud environment. Computing, 107(3). 10.1007/s00607-025-01426-x