Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

AdaptiveAutoScaling

An encoder-only transformer forecaster with a cost-optimizing scaling controller, evaluated in simulation

G et al. (2026) Citations


TL;DR

Cloud applications need different amounts of computing power at different times: an online store is quiet at 3 a.m. and slammed during a flash sale. “Auto-scaling” means automatically adding or removing servers to match that demand. Most clouds today do this reactively: they wait until servers are already overloaded (e.g. CPU passes 70%), then react. But booting a new virtual machine (VM) takes time, so the system stays overloaded during that delay - users see slow responses, and the provider breaks its service promises. This paper builds a Transformer to forecast how busy the cloud will be a few steps into the future. Those forecasts feed a small “scaling controller” that adds or removes VMs ahead of time, so capacity is ready before the surge hits. Tested on real Google data-center traces, the approach cut service-level-agreement (SLA) violations by up to 28% and compute cost by 16% under bursty conditions, beating older forecasting methods (ARIMA, Prophet, LSTM).


The Problem (and why simple autoscaling isn’t enough)

Imagine a thermostat that only turns on the heat after the room is already freezing, and the furnace takes 10 minutes to warm up. You’d be cold for those 10 minutes every time. That is exactly how reactive, threshold-based auto-scaling works in clouds like AWS, Google Cloud, and Azure:

The naive fix - just keep lots of extra servers running “just in case” - wastes money (over-provisioning). So there is a genuine tension:

The way out is proactive (predictive) scaling: forecast demand and scale before it arrives, hiding the boot delay. That only works if the forecast is good. Older forecasters fall short:

This paper argues a Transformer fixes those weaknesses, and - crucially - wires the forecast all the way into an actual scaling decision.


Background

Time-series forecasting. Given a history of numbers measured over time (e.g. CPU usage every 5 seconds), predict the next few values. That’s the core task here.

What a Transformer is, in one breath. A neural network built around self-attention. Instead of reading a sequence strictly left-to-right like an RNN/LSTM, self-attention lets every time step “look at” every other time step at once and decide which ones matter. Analogy: when planning today’s staffing for a coffee shop, you don’t just look at yesterday - you also recall “last year’s holiday rush” and weigh it heavily. Self-attention learns those weights automatically. Two more pieces you’ll see:

MAPE-K loop (the standard blueprint for self-managing systems): Monitor (collect metrics) -> Analyze (understand / forecast) -> Plan (decide the action) -> Execute (apply it), all sharing Knowledge. Keep this in mind; we map the paper onto it later.

Horizontal vs vertical scaling. Horizontal = add/remove whole instances (more VMs). Vertical = give one instance more CPU/RAM. This paper does horizontal scaling (changing the number of VMs).


Contribution in Simple Terms

The genuinely useful idea is not “a Transformer” by itself - Transformers for time series already existed. The contribution is an end-to-end pipeline that:

  1. Uses a Transformer to forecast multi-dimensional cloud workload (not just one metric, but several at once: CPU, memory, request rate, queue length).

  2. Feeds those forecasts directly into a scaling controller that turns a predicted load into a concrete decision: “run this many VMs next.”

  3. Wraps the decision in a small optimization that balances three costs - missing the SLA, paying for idle servers, and scaling too jumpily.

  4. Keeps the whole thing lightweight enough to imagine deploying in a real cloud.

The authors position this as among the first works to demonstrate Transformer-driven proactive auto-scaling on a real production trace (Google Cluster Trace), rather than only reporting forecast accuracy on a toy dataset and stopping there. In plain terms: they close the loop from “good prediction” to “good resource decision.”


How It Works, Step by Step

The system has four stages: data -> forecast -> decide -> act.

1. Collect raw metrics (Monitor). Monitoring agents in the data center stream multi-dimensional workload measurements: CPU utilization, memory usage, task/request arrival rate, and queue length. Each time point is a vector x_t (several numbers, not one).

2. Preprocess.

3. Embed + add position. Each input vector is projected into a 128-dimensional embedding (z_i = W_e x_i + b_e), then a sinusoidal positional encoding is added (h_i = z_i + p_i) so the model knows the time order.

4. Transformer forecasting (Analyze). The sequence passes through an encoder-only Transformer: 4 layers, each with multi-head self-attention (8 heads) followed by a position-wise feed-forward network (inner size 512), plus layer normalization and dropout. Multi-head attention means the model runs several attention “views” in parallel and concatenates them - one head might track the daily trend while another watches for sudden bursts. A final linear “prediction head” outputs the forecast: x̂_{t+τ} = f_θ(X_t) - the predicted workload τ steps into the future (multi-step, up to a horizon H).

Training detail: Adam optimizer, learning rate 1e-4, batch size 64, up to 100 epochs with early stopping; loss is multi-step mean squared error (sum of squared differences between predicted and actual over the horizon).

5. Map forecast to required capacity (Plan, part A). A function g(·) converts the predicted load into a target number of VMs/containers: c_{t+1} = g(x̂_{t+τ}). (Conceptually: “this much predicted load needs about this many servers given each server’s capacity.”)

6. Choose the scaling action by optimization (Plan, part B). The controller picks an action a_t (how many VMs to add (+) or remove (-)) that minimizes a combined cost:

cost = α · (SLA violation penalty)   <- being under-provisioned
     + β · (provisioning cost)       <- paying for VMs
     + γ · (size of the scaling move) <- penalize jumpy, frequent scaling

A dampening rule caps how much capacity can change in one step (|c_{t+1} - c_t| <= δ_max) to prevent oscillation (rapid scale-out/scale-in thrashing).

7. Actuate (Execute). The cloud resource manager applies the action - boots or releases VMs - before the predicted surge. Because the forecast gave a head start, the new VMs finish their cold-start boot delay in time to be useful.

This forms a closed loop: prediction -> decision -> scaling, repeated every interval.

The whole algorithm (their pseudocode), in words:

  1. Take the recent history window X_t and current VM count c_t.

  2. x̂_{t+τ} = TransformerPredict(X_t) - forecast.

  3. c_{t+1} = g(x̂_{t+τ}) - convert to needed capacity.

  4. Compute the SLA-penalty and provisioning-cost terms.

  5. Pick a_t that minimizes the combined cost.

  6. Clamp the new capacity to a max (c_{t+1} = min(c_t + a_t, c_max)).

  7. Return the scaling action a_t.


Inputs (what it consumes)

So: a multivariate history in, not a single CPU line.


Outputs (what it produces)

In short: it outputs both the forecast and the concrete “run N VMs” decision that follows from it.


How It Fits the Autoscaling Framework (MAPE-K)

This paper touches the full MAPE-K loop, but its heart is Analyze + Plan, and it makes scaling proactive rather than reactive.

Proactive vs reactive: Firmly proactive/predictive - the entire point is to scale before load arrives, hiding the cold-start gap. This is contrasted directly against reactive threshold scaling.

Horizontal vs vertical: Horizontal - it changes the number of VMs/containers (scale out / scale in). It does not do vertical (resizing a single VM’s CPU/RAM).

Does it drive a real scaler, or just predict? Unlike papers that only predict and hand off to a stock autoscaler (e.g. Kubernetes HPA), this work includes its own scaling controller and optimization, so the forecast genuinely produces the scaling action. Caveat for honesty: the actuation is evaluated in a custom discrete-time simulator (which models VM pools, per-VM capacity κ, SLA/cost penalties, and enforced boot delays), not on a live production cluster or a real Kubernetes/cloud autoscaler. So the prediction-to-decision loop is complete and tested end-to-end, but in simulation.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch - no pretrained or foundation model.

The encoder-only Transformer is trained from random initialization directly on the Google Cluster Trace data; there is no pretraining, fine-tuning, foundation model, or zero-shot transfer. The baselines (ARIMA, Prophet, LSTM-64) are likewise fit from scratch on the same trace, so the comparison is apples-to-apples.


Strengths


Limitations


Glossary

References
  1. G, C. N., O, B. C., R, J. K., Raj B N, P., Naik, P. N., & G, M. B. (2026). Transformer-Based Workload Prediction and Adaptive Auto-Scaling in Cloud Data Centers. 2026 IEEE International Conference for Convergence in Computing Technology (I3CTCON), 1–6. 10.1109/i3ctcon68242.2026.11507247