AdaptiveAutoScaling
An encoder-only transformer forecaster with a cost-optimizing scaling controller, evaluated in simulation
TL;DR¶
Cloud applications need different amounts of computing power at different times: an online store is quiet at 3 a.m. and slammed during a flash sale. “Auto-scaling” means automatically adding or removing servers to match that demand. Most clouds today do this reactively: they wait until servers are already overloaded (e.g. CPU passes 70%), then react. But booting a new virtual machine (VM) takes time, so the system stays overloaded during that delay - users see slow responses, and the provider breaks its service promises. This paper builds a Transformer to forecast how busy the cloud will be a few steps into the future. Those forecasts feed a small “scaling controller” that adds or removes VMs ahead of time, so capacity is ready before the surge hits. Tested on real Google data-center traces, the approach cut service-level-agreement (SLA) violations by up to 28% and compute cost by 16% under bursty conditions, beating older forecasting methods (ARIMA, Prophet, LSTM).
The Problem (and why simple autoscaling isn’t enough)¶
Imagine a thermostat that only turns on the heat after the room is already freezing, and the furnace takes 10 minutes to warm up. You’d be cold for those 10 minutes every time. That is exactly how reactive, threshold-based auto-scaling works in clouds like AWS, Google Cloud, and Azure:
A rule says “if CPU > 70%, add a server.”
By the time CPU crosses 70%, demand is already high.
Booting a new VM has a cold-start / provisioning delay (it takes seconds to minutes to become useful).
During that gap - the paper calls it the “scaling gap” - the system is under-provisioned: requests queue up, latency spikes, and SLA violations occur (an SLA is a contractual promise like “99.9% of requests answered within 200 ms”).
The naive fix - just keep lots of extra servers running “just in case” - wastes money (over-provisioning). So there is a genuine tension:
The way out is proactive (predictive) scaling: forecast demand and scale before it arrives, hiding the boot delay. That only works if the forecast is good. Older forecasters fall short:
ARIMA / Holt-Winters (classical statistics): assume mostly linear, regular patterns; they break on the spiky, irregular bursts that real clouds show.
Prophet (a trend/seasonality decomposition tool): captures daily/weekly cycles but not sudden spikes.
LSTM / GRU (recurrent neural networks): better at sequences, but they process time steps one after another, which makes them slow and weak at connecting events that are far apart in time (long-range dependencies).
This paper argues a Transformer fixes those weaknesses, and - crucially - wires the forecast all the way into an actual scaling decision.
Background¶
Time-series forecasting. Given a history of numbers measured over time (e.g. CPU usage every 5 seconds), predict the next few values. That’s the core task here.
What a Transformer is, in one breath. A neural network built around self-attention. Instead of reading a sequence strictly left-to-right like an RNN/LSTM, self-attention lets every time step “look at” every other time step at once and decide which ones matter. Analogy: when planning today’s staffing for a coffee shop, you don’t just look at yesterday - you also recall “last year’s holiday rush” and weigh it heavily. Self-attention learns those weights automatically. Two more pieces you’ll see:
Embedding: each raw input vector is mapped into a richer numeric space the model works in (here, 128 numbers wide).
Positional encoding: because attention itself ignores order, the model adds a “timestamp signal” so it knows which measurement came first. This paper uses classic sinusoidal (sine-wave-based) positional encodings.
Encoder-only Transformer: this work uses just the encoder half of the original Transformer (good for reading a sequence and producing a prediction), not the decoder half used for generating text.
MAPE-K loop (the standard blueprint for self-managing systems): Monitor (collect metrics) -> Analyze (understand / forecast) -> Plan (decide the action) -> Execute (apply it), all sharing Knowledge. Keep this in mind; we map the paper onto it later.
Horizontal vs vertical scaling. Horizontal = add/remove whole instances (more VMs). Vertical = give one instance more CPU/RAM. This paper does horizontal scaling (changing the number of VMs).
Contribution in Simple Terms¶
The genuinely useful idea is not “a Transformer” by itself - Transformers for time series already existed. The contribution is an end-to-end pipeline that:
Uses a Transformer to forecast multi-dimensional cloud workload (not just one metric, but several at once: CPU, memory, request rate, queue length).
Feeds those forecasts directly into a scaling controller that turns a predicted load into a concrete decision: “run this many VMs next.”
Wraps the decision in a small optimization that balances three costs - missing the SLA, paying for idle servers, and scaling too jumpily.
Keeps the whole thing lightweight enough to imagine deploying in a real cloud.
The authors position this as among the first works to demonstrate Transformer-driven proactive auto-scaling on a real production trace (Google Cluster Trace), rather than only reporting forecast accuracy on a toy dataset and stopping there. In plain terms: they close the loop from “good prediction” to “good resource decision.”
How It Works, Step by Step¶
The system has four stages: data -> forecast -> decide -> act.
1. Collect raw metrics (Monitor).
Monitoring agents in the data center stream multi-dimensional workload measurements: CPU utilization, memory usage, task/request arrival rate, and queue length. Each time point is a vector x_t (several numbers, not one).
2. Preprocess.
Clean out incomplete/noisy log entries.
Fill missing values by forward interpolation, then smooth out anomalous spikes.
Min-max normalize every feature to the range [0, 1] so no metric dominates.
Aggregate everything into fixed 5-second intervals (matching how often real autoscalers make decisions).
Apply a sliding window of length
L(the look-back window): the model sees the lastLtime steps{x_{t-L+1}, ..., x_t}to predict ahead.
3. Embed + add position.
Each input vector is projected into a 128-dimensional embedding (z_i = W_e x_i + b_e), then a sinusoidal positional encoding is added (h_i = z_i + p_i) so the model knows the time order.
4. Transformer forecasting (Analyze).
The sequence passes through an encoder-only Transformer: 4 layers, each with multi-head self-attention (8 heads) followed by a position-wise feed-forward network (inner size 512), plus layer normalization and dropout. Multi-head attention means the model runs several attention “views” in parallel and concatenates them - one head might track the daily trend while another watches for sudden bursts. A final linear “prediction head” outputs the forecast:
x̂_{t+τ} = f_θ(X_t) - the predicted workload τ steps into the future (multi-step, up to a horizon H).
Training detail: Adam optimizer, learning rate 1e-4, batch size 64, up to 100 epochs with early stopping; loss is multi-step mean squared error (sum of squared differences between predicted and actual over the horizon).
5. Map forecast to required capacity (Plan, part A).
A function g(·) converts the predicted load into a target number of VMs/containers: c_{t+1} = g(x̂_{t+τ}). (Conceptually: “this much predicted load needs about this many servers given each server’s capacity.”)
6. Choose the scaling action by optimization (Plan, part B).
The controller picks an action a_t (how many VMs to add (+) or remove (-)) that minimizes a combined cost:
cost = α · (SLA violation penalty) <- being under-provisioned
+ β · (provisioning cost) <- paying for VMs
+ γ · (size of the scaling move) <- penalize jumpy, frequent scalingαpunishes letting demand exceed capacity (SLA breaches).βpunishes running too many VMs (waste).γdiscourages constant flip-flopping (instability).
A dampening rule caps how much capacity can change in one step (|c_{t+1} - c_t| <= δ_max) to prevent oscillation (rapid scale-out/scale-in thrashing).
7. Actuate (Execute). The cloud resource manager applies the action - boots or releases VMs - before the predicted surge. Because the forecast gave a head start, the new VMs finish their cold-start boot delay in time to be useful.
This forms a closed loop: prediction -> decision -> scaling, repeated every interval.
The whole algorithm (their pseudocode), in words:
Take the recent history window
X_tand current VM countc_t.x̂_{t+τ} = TransformerPredict(X_t)- forecast.c_{t+1} = g(x̂_{t+τ})- convert to needed capacity.Compute the SLA-penalty and provisioning-cost terms.
Pick
a_tthat minimizes the combined cost.Clamp the new capacity to a max (
c_{t+1} = min(c_t + a_t, c_max)).Return the scaling action
a_t.
Inputs (what it consumes)¶
Multi-dimensional workload time series from monitoring agents, sampled at 5-second intervals:
CPU utilization
Memory usage
Task / request arrival rate
Queue length (queue depth / occupancy)
A look-back window of the last
Ltime steps (sliding window) as the model’s context.The current capacity
c_t(how many VMs are running now), used by the scaling controller.(Training only) historical labeled sequences split 70% train / 15% validation / 15% test, kept in time order to avoid leakage.
So: a multivariate history in, not a single CPU line.
Outputs (what it produces)¶
Primary model output: a multi-step forecast of future workload
x̂_{t+τ}for horizons τ = 1, 2, ..., H (the next several 5-second steps of CPU/memory/request-rate/queue-length).Derived planning output: a target VM count
c_{t+1}for the next interval.Final actionable output: a scaling action
a_t(a signed integer: positive = scale out / add VMs, negative = scale in / remove VMs, zero = do nothing).
In short: it outputs both the forecast and the concrete “run N VMs” decision that follows from it.
How It Fits the Autoscaling Framework (MAPE-K)¶
This paper touches the full MAPE-K loop, but its heart is Analyze + Plan, and it makes scaling proactive rather than reactive.
Monitor: monitoring agents collect the multivariate metrics.
Analyze: the Transformer lives here - it forecasts upcoming demand. This is the paper’s main contribution.
Plan: the scaling controller maps the forecast to a VM count and runs the α/β/γ cost optimization (plus damping) to choose the action.
Execute: the cloud resource manager applies the action, subject to realistic VM boot delays.
Proactive vs reactive: Firmly proactive/predictive - the entire point is to scale before load arrives, hiding the cold-start gap. This is contrasted directly against reactive threshold scaling.
Horizontal vs vertical: Horizontal - it changes the number of VMs/containers (scale out / scale in). It does not do vertical (resizing a single VM’s CPU/RAM).
Does it drive a real scaler, or just predict? Unlike papers that only predict and hand off to a stock autoscaler (e.g. Kubernetes HPA), this work includes its own scaling controller and optimization, so the forecast genuinely produces the scaling action. Caveat for honesty: the actuation is evaluated in a custom discrete-time simulator (which models VM pools, per-VM capacity κ, SLA/cost penalties, and enforced boot delays), not on a live production cluster or a real Kubernetes/cloud autoscaler. So the prediction-to-decision loop is complete and tested end-to-end, but in simulation.
Evaluation (datasets & metrics, briefly)¶
Dataset: Google Cluster Trace (real production data center traces; bursty, diurnal, irregular). Features extracted: CPU, memory, task arrival rate, queue length; 5-second aggregation.
Baselines: ARIMA (statistical), Prophet (trend/seasonality), LSTM (recurrent deep net, 64 hidden units).
Metrics:
Forecast accuracy: MAE and RMSE (lower = better).
SLA violation rate (% of time steps where demand exceeded capacity).
Operational cost (total compute units provisioned).
Headline results: lowest MAE/RMSE among all methods (gap grows at longer horizons, where LSTM degrades); up to 28% fewer SLA violations and 16% lower compute cost under bursty conditions versus baselines/threshold scaling.
Training & pre-training¶
Trained from scratch - no pretrained or foundation model.
The encoder-only Transformer is trained from random initialization directly on the Google Cluster Trace data; there is no pretraining, fine-tuning, foundation model, or zero-shot transfer. The baselines (ARIMA, Prophet, LSTM-64) are likewise fit from scratch on the same trace, so the comparison is apples-to-apples.
Architecture: d_model = 128, sinusoidal positional encodings, 4 encoder layers, 8 attention heads, feed-forward inner size 512, layer norm + dropout, and a final linear prediction head.
Optimization: Adam, initial learning rate 1e-4, batch size 64, up to 100 epochs with early stopping on validation loss; the loss is a multi-step MSE (L2) summed over the forecast horizon.
Data pipeline: four features (CPU, memory, task arrival rate, queue length) aggregated to 5-second intervals, min-max normalized to [0, 1], cut into sliding windows, and split 70/15/15 train/val/test with temporal ordering preserved (no leakage). The scaling controller is then evaluated in a discrete-time simulator.
Strengths¶
Closes the loop. Goes beyond “we predicted well” to actually choosing VM counts via a clear cost model - the prediction feeds a real decision.
Right tool for the pattern. Self-attention captures long-range dependencies and sudden bursts that ARIMA/Prophet/LSTM miss; multi-head attention can track several patterns at once.
Multivariate input. Uses CPU, memory, request rate, and queue length together, not a single signal.
Stability built in. The α/β/γ optimization plus the damping cap explicitly discourage thrashing and over-provisioning - practical concerns, not just accuracy.
Realistic test conditions. Simulator enforces VM boot delays (cold starts), so the proactive advantage is measured fairly; real production trace used.
Beginner-friendly framing. Modest, well-specified model (4 layers, 8 heads, d=128) rather than an oversized black box.
Limitations¶
Evaluated in simulation, not on a live cloud. The scaling controller runs against the authors’ own discrete-time simulator; no deployment on real Kubernetes/HPA or a production cluster is shown.
Single dataset. Only Google Cluster Trace; generalization to other workloads (web, IoT, ML training) is untested. The authors note accuracy depends on trace quality and granularity.
No exact numbers for accuracy. MAE/RMSE improvements are shown as figures/relative claims rather than a precise table in the text provided; the headline 28%/16% gains are stated without full per-baseline breakdowns in the body text.
Higher training cost. Transformers are heavier to train than ARIMA/Prophet/LSTM; the authors acknowledge this overhead (argued worthwhile only at large scale).
Static, offline model. Trained once; no online learning, so it may drift if workload patterns change. The authors flag online learning and reinforcement-learning-based decisions as future work.
Horizontal only. No vertical scaling, no container-level (pod) specifics, and the
g(·)load->VM mapping is assumed rather than learned.Short-horizon focus. Designed for short-term forecasting; very long-range planning isn’t its target.
Glossary¶
Auto-scaling: automatically adding/removing compute resources to match demand.
Reactive scaling: scale only after a threshold (e.g. CPU > 70%) is crossed - always lagging.
Proactive / predictive scaling: forecast demand and scale before it arrives, hiding boot delay.
Horizontal scaling (scale out/in): change the number of instances/VMs. (This paper.)
Vertical scaling (scale up/down): change CPU/RAM of one instance. (Not this paper.)
Scaling gap: the lag between demand rising and new capacity becoming ready.
Cold start / provisioning delay: time for a new VM to boot and become useful.
SLA (Service Level Agreement): a performance promise (e.g. latency/availability target); an SLA violation is breaking it.
Over-/under-provisioning: running too many (wasteful) / too few (SLA-breaking) resources.
Transformer: attention-based neural network; here used as a time-series forecaster.
Self-attention: mechanism letting each time step weigh every other time step’s relevance.
Multi-head attention: several attention computations in parallel, capturing different patterns.
Encoder-only Transformer: uses just the “reading” half of the original architecture.
Positional encoding: added signal telling the model the order of time steps (sinusoidal here).
Look-back window (L): how many past steps the model sees as input.
Horizon (H, τ): how many steps into the future it predicts.
MAE / RMSE: average prediction-error metrics (lower is better).
MAPE-K: Monitor-Analyze-Plan-Execute over shared Knowledge - the self-adaptive control loop.
ARIMA / Prophet / LSTM: baseline forecasters (statistical / decomposition / recurrent neural net).
Google Cluster Trace: a public real-world record of resource usage from Google’s data centers.
- G, C. N., O, B. C., R, J. K., Raj B N, P., Naik, P. N., & G, M. B. (2026). Transformer-Based Workload Prediction and Adaptive Auto-Scaling in Cloud Data Centers. 2026 IEEE International Conference for Convergence in Computing Technology (I3CTCON), 1–6. 10.1109/i3ctcon68242.2026.11507247