Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

MV-Transformer

A multivariate encoder-decoder transformer running a full MAPE loop to proactively scale real VMs

Kumar et al. (2025) Citations


TL;DR

Cloud apps run on virtual machines (VMs) or containers that cost money. If you give the app too few, it slows down or drops requests; too many, and you waste money. Autoscaling is the automatic act of adding or removing those machines as demand rises and falls. This paper builds an autoscaler that does not just react to load that has already arrived — it predicts the near-future load and scales ahead of time. The prediction engine is a Transformer, used here as a short-term forecaster of cloud workload. Crucially, it is multivariate: instead of looking at one signal like CPU alone, it looks at several metrics together (CPU, memory, disk, network) and learns how they interact. They wrap this predictor inside the classic MAPE control loop (Monitor → Analyze → Plan → Execute) and wire it to real Docker + Kubernetes so it actually spins VMs up and down. On three public workload datasets it forecasts more accurately than LSTM, Bi-LSTM, WGAN-GP-transformer, and DynEformer, and its autoscaling causes less under-provisioning and a ~2.98x “elastic speedup” over running with no autoscaler.


The Problem (and why simple autoscaling isn’t enough)

Imagine an online store the night before a big sale. Traffic is about to explode at midnight.

Two specific gaps this paper targets:

  1. Most prior predictors are univariate — they forecast from one signal (just CPU, or just request rate). But cloud metrics are entangled: a spike in network traffic usually drags memory and CPU with it. Looking at only one signal throws away information.

  2. The popular predictive models have weaknesses. LSTM/Bi-LSTM (recurrent neural nets) are slow on long sequences, hungry for memory, and suffer the vanishing gradient problem (they “forget” patterns from far back). Some transformer rivals (WGAN-GP transformer, DynEformer) are tuned for single-variable input and don’t fully exploit multiple metrics together.

So the research question is: can a multivariate transformer forecast cloud demand more accurately and cheaply, and can we plug it into a full, real-world autoscaling loop that actually moves VMs?


Background

Define the jargon once:


Contribution in Simple Terms

The paper’s genuine novelty is the combination, not any single brand-new neural net:

  1. A multivariate transformer (“MV-Transformer”) as the forecaster. It ingests several cloud metrics at once (CPU, memory, disk read/write, network throughput) and predicts the next value of the chosen target metric. Using self-attention, it captures long-term dependencies and cross-metric interactions while using less memory than LSTM/Bi-LSTM and sidestepping vanishing gradients/overfitting.

  2. Feature selection up front. Before forecasting, they use the Pearson correlation coefficient (a simple statistic measuring how linearly two signals move together, from −1 to +1) to pick the most informative pair of metrics. This trims noise, speeds up training, and reduces overfitting.

  3. A complete, working MAPE autoscaling framework around the predictor — not just an offline forecast. The forecast feeds a planning algorithm that computes how many VMs are needed, with guards against thrashing, and an execute stage that actually scales VMs.

  4. A real Kubernetes + Docker Desktop deployment. They don’t just simulate — they run the loop on Minikube/kubectl with real container scaling, and show it beats default Kubernetes behavior.

In plain terms: they took a smarter, multi-signal forecaster and embedded it in a real control loop that proactively right-sizes a cloud cluster.


How It Works, Step by Step

  1. Collect. A collector module continuously gathers resource-utilization data — CPU usage C(t), memory M(t), disk read/write D(t), network throughput N(t) — over a sliding time window [t−w, t].

  2. Monitor. These readings are stacked into a matrix X (rows = timestamps, columns = the four metrics) and stored. In the live deployment, metrics can come from Docker/Kubernetes built-ins or Prometheus; in the experiments they used default API-gathered data. The monitor also decides if an immediate (reactive) action is needed; otherwise it forwards data on for proactive analysis.

  3. Analyze — feature selection. Compute Pearson correlation between candidate metric pairs and keep the strongest pair. The chosen target metric to forecast is network throughput; the partner feature is the metric most correlated with it (e.g., on GWA-T-12 they pair network throughput with memory).

  4. Analyze — MV-Transformer forecast. The selected multivariate window is fed into the transformer:

    • The input has shape [Batch, Sequence length, Features].

    • Self-attention computes Query/Key/Value matrices and attention scores softmax(QKᵀ/√dk)·V to weigh which past time-steps matter most; multi-head attention does this several ways in parallel.

    • Architecture: an encoder (2 layers: masked multi-head attention → feed-forward → Add & Norm) and a decoder (1 layer: masked multi-head attention → cross-attention with encoder output → feed-forward → Add & Norm). Hidden size 512, 50 hidden units, dropout 0.2, ReLU/SoftMax activations.

    • Output: the predicted next-step value, N(t+1) = f(Xt) — i.e., the forecasted network throughput for the upcoming interval.

  5. Plan. Convert the forecast into a resource decision. Estimate VMs needed as:

    VMs(t+1) = predicted_workload(t+1) / workload_per_VM

    where workload_per_VM is each VM’s max handling capacity (admin-set). Compare VMs(t+1) to the current count and decide scale-up, scale-down, or hold. Two safety mechanisms:

    • CPT (Control Period Time): a minimum gap (~1 minute) between scaling actions to stop oscillation (rapid flip-flopping up/down that wastes resources).

    • RRS (Remove Resources Strategy): when scaling down, remove only a fraction of surplus VMs at a time (VMs_surplus = (VMs_t − VMs_{t+1}) × RRS), never dropping below VMs_min. This stabilizes the system and keeps headroom for sudden spikes.

  6. Execute. Issue the scale-up/scale-down command to the effector, which tells Docker/Kubernetes to actually create or remove VMs/containers, then schedules them into the cluster for the next interval. If a scaling action fails or is delayed, Execute talks back to Monitor to re-plan (fault tolerance). The loop repeats indefinitely.

Non-transformer pieces to note: Pearson correlation (classical statistics) for feature selection; the analytic Plan algorithms (CPT + RRS) that turn a forecast into a discrete VM count; and Docker/Kubernetes as the actuation layer. There is no GAN, diffusion, or FFT in the proposed model itself (WGAN-GP, which uses adversarial generator/discriminator losses, appears only as a baseline it’s compared against).


Inputs (what it consumes)


Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

This paper is one of the rare ones that implements the whole loop, not just the predictor:

MAPE stageWhat the paper does
MonitorCollector + monitor modules gather CPU/mem/disk/net and user-request metrics into matrix X.
AnalyzeWhere the transformer lives. Pearson feature selection → MV-Transformer forecasts next-interval workload.
PlanConverts forecast to VM count; applies CPT (anti-oscillation) and RRS (gradual scale-in) policies.
ExecuteEffector issues commands; Docker/Kubernetes actually scales VMs; loops back to Monitor on failure.

Analogy: Monitor = the store’s traffic counters; Analyze = the forecaster predicting the midnight rush; Plan = deciding “we need 12 cashiers, but add them gradually and don’t yank them the second the line dips”; Execute = actually clocking cashiers in and out.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch — no pretrained or foundation model.

The MV-Transformer is a task-specific encoder-decoder forecaster trained from random initialization directly on cloud workload traces (Google Cluster, GWA-T-12, Azure Functions 2019). There is no pretraining, transfer learning, foundation model, or zero-shot component — the whole network is learned end-to-end on these datasets.

Training configuration (paper Table 1):

Note: the only “pre-trained” wording anywhere in the paper refers to a cited related work (a diffusion model), not to the proposed MV-Transformer.


Strengths


Limitations


Glossary

References
  1. Kumar, B., Verma, A., & Verma, P. (2025). A multivariate transformer-based monitor-analyze-plan-execute (MAPE) autoscaling framework for dynamic resource allocation in cloud environment. Computing, 107(3). 10.1007/s00607-025-01426-x