PredictiveAutoscaling - Luca's Research @ PEG

Shrestha & Tuz Sabiha (2025) Citations

TL;DR¶

Cloud apps run on a variable number of “instances” (here, Kubernetes pods) that get added or removed as traffic rises and falls. The usual way to do this is reactive: wait until CPU usage crosses a threshold, then scramble to add pods. The problem is that booting a new pod takes time (the “cold start”), so by the time the new capacity is ready, users have already been hit with slow responses. This paper builds a proactive autoscaler: a small transformer neural network looks at the last several minutes of incoming request rate and forecasts the request rate one minute into the future. A controller turns that forecast into a target that tells KEDA (a Kubernetes autoscaler) how many pods of the ingress controller (the cluster’s front-door traffic router) to run, before the traffic actually arrives. Trained on real Alibaba Cloud trace data and tested on a live Kubernetes cluster with simulated traffic surges, the system reacted to spikes much faster than the standard Kubernetes autoscaler (HPA) and kept response times stable and low (~16 ms), though it did not beat HPA on every single metric.

The Problem (and why simple autoscaling isn’t enough)¶

Autoscaling = automatically adjusting how many resources an app gets so it can serve demand without (a) falling over when busy or (b) burning money on idle capacity when quiet.

Too few resources (under-provisioning) → slow responses, dropped requests, broken SLAs (Service Level Agreements — the promises made to users about speed/uptime).
Too many resources (over-provisioning) → you pay for servers doing nothing.

Kubernetes’ built-in tool, the Horizontal Pod Autoscaler (HPA), is reactive: you set a rule like “keep average CPU at 70%,” and when CPU climbs past it, HPA adds pods. Two issues:

It always lags. It only reacts after load has already risen. Spinning up a new pod takes time (cold start / provisioning delay), so there is a window where the app is overloaded and users suffer.
Thresholds are hard to set. Picking the right CPU/memory threshold needs expertise and still may not match real, shifting traffic patterns.

To avoid the lag, operators often just keep extra pods running all the time (over-provisioning) — which wastes money.

The fix this paper pursues: stop reacting, start predicting. If you can forecast that a surge is coming a minute from now, you can add pods ahead of time so capacity is ready when the traffic hits. This is proactive (predictive) autoscaling.

Analogy: an online store the night before a big sale. Reactive scaling is like hiring extra cashiers only after the checkout line is already out the door. Predictive scaling is like looking at last year’s sale data, seeing the rush is coming at 9 a.m., and having the extra cashiers clocked in at 8:55.

Background¶

A few terms, defined once:

Pod — the smallest deployable unit in Kubernetes (the dominant system for running containerized apps). Think of a pod as one running copy/instance of a service. “Scale out” = add more pods; “scale in” = remove pods. This is horizontal scaling.
Ingress controller — the “front door” of a Kubernetes cluster. All incoming web requests hit it first, and it routes each one to the right internal service (it acts as a load balancer). This paper scales the ingress controller itself, which is unusual — most prior work scales the application pods. Scaling the front door gives centralized control over all incoming traffic.
HPA (Horizontal Pod Autoscaler) — Kubernetes’ standard, reactive, threshold-based pod scaler.
KEDA (Kubernetes Event-Driven Autoscaler) — a more flexible add-on autoscaler. Unlike HPA it can scale based on many custom signals (e.g. a metric from a database) and can even scale to zero pods when idle. In this paper, KEDA is the component that actually executes the scaling. It still drives the underlying HPA, but lets the authors feed it a custom, prediction-driven target.
Transformer — a type of neural network originally built for language (it powers modern NLP and LLMs). Its key trick is attention: when processing a sequence, each element can “look at” and weigh every other element to decide what matters. That makes transformers good at sequences — including time series (a sequence of numbers over time, like “requests per second, minute by minute”). Here the transformer is used purely as a time-series forecaster of future workload.
RPS (requests per second) — the workload signal being predicted.
P99 latency — the 99th-percentile response time. If P99 is 100 ms, 99% of requests finished in under 100 ms. It captures the “worst typical” experience, not just the average — a key quality metric.

Contribution in Simple Terms¶

The genuinely new pieces:

Use a transformer to forecast cloud workload. Transformers are famous in language tasks but, the authors note, are underexplored for cloud workload prediction. They apply one as a short-horizon forecaster of request rate.
Scale the ingress controller, not the app. Almost all prior predictive-autoscaling work scales application pods. Scaling the ingress controller (the centralized traffic gateway) is the unusual, identified gap they target.
Wire the forecast into a real, working autoscaler (KEDA) on a real Kubernetes cluster. This is not just a prediction-accuracy paper — they build the full loop (monitor → predict → compute target → scale) and load-test it live against the standard baselines (no autoscaling, HPA, KEDA-with-fixed-threshold).

In one sentence: they turn a transformer’s one-minute-ahead request-rate forecast into a live, proactive scaling target for the Kubernetes ingress controller.

What they honestly do not claim: that it beats HPA everywhere. HPA remained a strong, balanced baseline. The transformer approach’s win is speed of reaction to surges and promise for volatile traffic, not across-the-board superiority.

How It Works, Step by Step¶

The system has three blocks: the Application block (the app + ingress controller serving traffic), the Autoscaling-Controller (AC) (the brain: monitor + predictor + calculator), and the Autoscaler (KEDA, the muscle that applies the change).

Training the transformer (offline, done once):

Get data. Use the public Alibaba Cloud trace logs — real traffic from a large microservice-based e-commerce platform. Specifically the MS_MCR_RT_Table, which holds call/request and response-rate metrics (24 files, each = 30 minutes of trace; fields include timestamp, microservice name, instance ID, metric type, call-rate value).
Preprocess. Filter to one specific microservice; keep only timestamp and value (requests per second); sort chronologically; format as a univariate time series (one number per time step — just RPS).
Train. Build the transformer with the Darts Python time-series library. Split data 70% train / 30% validation. Key settings: input_chunk_length=12 (look back 12 steps), output_chunk_length=4 (predict ahead — corresponds to 1-minute-ahead), d_model=16, nhead=8 (8 attention heads), 2 encoder + 2 decoder layers, dim_feedforward=128, dropout=0.1, batch_size=32, n_epochs=200. (These are Darts’ recommended defaults — the authors note tuning them could improve results.)
Evaluate forecast accuracy on the validation set with MAE, RMSE, and MAPE (standard error metrics; lower = better).

Running live (the proactive loop, repeating every minute):

Monitor. Prometheus continuously collects the ingress controller’s incoming request-rate metric and stores it as a time series. (Grafana dashboards visualize CPU, memory, pod count, response rate, latency.)
Predict. A Kubernetes CronJob fires every minute, running a Python script that pulls the past hour of request-rate metrics from Prometheus, feeds the recent window to the trained transformer, gets the one-minute-ahead RPS forecast, and writes it to predicted_values.csv.
Calculate the scaling target (Algorithm 1). A script converts the forecast into a KEDA target:
- Constants: max_rps_per_pod = 70 (capacity of one pod, found by benchmarking), desired_utilization = 0.70.
- max_predicted_rps = max(predicted values).
- required_pods = max_predicted_rps / max_rps_per_pod.
- Query Prometheus for current_pods.
- scale_value = (current_pods / required_pods) × desired_utilization.
Execute. The script writes scale_value into the value field of KEDA’s ScaledObject YAML and applies it with kubectl apply. KEDA then drives the underlying HPA to add/remove ingress pods to match — before the predicted traffic arrives.
Loop. One minute later, step 6 runs again with fresh data, so the target continuously re-adjusts to the latest forecast and the actual observed load (runtime feedback).

Note on architecture: the only neural network is the transformer (a standard encoder–decoder transformer for univariate time-series forecasting). There is no GAN, diffusion, FFT/frequency, or similarity-search component. Everything else (Prometheus, KEDA, the calculator) is conventional cloud/control plumbing.

Inputs (what it consumes)¶

Primary signal: the ingress controller’s incoming request rate (requests per second / RPS) as a univariate time series.
Lookback window: input_chunk_length = 12 time steps fed to the model; at inference the script pulls the past hour of metrics from Prometheus.
Training data: Alibaba Cloud MS_MCR_RT_Table trace logs (real e-commerce microservice call/response-rate data).
Runtime state for the calculator: current pod count (from Prometheus), plus the constants max_rps_per_pod = 70 and desired_utilization = 0.70.

It is univariate — only request rate drives the prediction. CPU/memory/latency are monitored for evaluation but are not inputs to the transformer.

Outputs (what it produces)¶

From the transformer: a forecast of request rate (RPS) one minute ahead (output_chunk_length = 4 steps), saved to predicted_values.csv.
From the calculator (Algorithm 1): a single scale_value — the target metric-per-pod that KEDA uses to decide how many ingress pods to run.
End result (after KEDA acts): a changed number of ingress-controller pods (horizontal scale-out / scale-in), applied proactively.

Horizon: one minute ahead. The forecast refreshes every minute via the CronJob.

How It Fits the Autoscaling Framework (MAPE-K)¶

This is a near-textbook MAPE-K loop, with the transformer living in the Analyze stage.

Monitor: Prometheus scrapes the ingress controller’s request rate into a time-series store.
Analyze: the transformer forecasts the next minute’s RPS — this is the heart of the contribution and what makes the whole loop proactive instead of reactive.
Plan: the Calculator / Algorithm 1 converts the forecast (plus current pods and capacity constants) into a concrete scaling target.
Execute: KEDA applies the target via the ScaledObject, driving the underlying HPA to add/remove ingress-controller pods.
Knowledge: historical metrics, the prediction file, and the capacity constants shared across stages.

Reactive vs proactive: Proactive. The system scales on forecast demand, so capacity is in place before the surge — directly addressing HPA’s lag and the cold-start delay.

Horizontal vs vertical: Horizontal (adds/removes pods of the ingress controller). No vertical (per-pod CPU/RAM) scaling.

Does it do the actuation itself, or feed a standard scaler? A bit of both. The transformer + calculator produce the decision; KEDA (event-driven, and ultimately HPA) performs the actual pod changes. So the novel part is the predictive signal and target, while actuation rides on standard Kubernetes autoscaling machinery — a clean, deployable design.

Evaluation (datasets & metrics, briefly)¶

Training data: Alibaba Cloud microservice trace logs (MS_MCR_RT_Table).
Forecast-accuracy metrics: MAE, RMSE, MAPE on a 30% validation split.
Live testbed: university private OpenStack cloud; production-grade Kubernetes via kubeadm (1 master: 16 GB/8 vCPU; 2 workers: 4 GB/2 vCPU each). App = Podinfo (lightweight Go microservice) behind an Nginx ingress controller. Monitoring via kube-prometheus-stack (Prometheus + Grafana). Load generated with Locust.
Baselines compared: (1) no autoscaling, (2) HPA at 70% CPU, (3) KEDA with a fixed threshold (value 0.877), (4) the proposed transformer method.
System metrics watched: CPU %, memory (MB), active pod count, P99 latency.
Key results:
- Experiment 1 (low rate, ramp to 1000 users, ~up to ~330 RPS region): the proposed method scaled aggressively (15 pods within the first minute), kept response time consistently low (~16 ms), then scaled back to 2 pods after ~7 min. HPA lagged — it added a pod only ~2 minutes after the surge. KEDA was CPU-efficient but slow, causing a latency spike.
- Experiment 2 (high rate, 1000 users, >300 RPS, cluster resized): HPA vs proposed only. Both stayed within acceptable latency; the proposed method scaled more conservatively/gradually here and showed more latency fluctuation, adapting to observed load and runtime feedback.
Honest bottom line: the proposed method reacted to surges faster and used resources efficiently, but did not beat HPA on every metric; HPA stayed a solid, balanced baseline.

Training & pre-training¶

Trained from scratch — no pretrained or foundation model.

The forecaster is a plain univariate encoder–decoder transformer — the only neural network in the system, with no GAN, diffusion, FFT, or foundation-model component. It is built with the Darts Python library and trained from scratch on the public Alibaba Cloud microservice trace (one microservice; just timestamp + RPS). There is no pretraining, fine-tuning, transfer learning, foundation model, or zero-shot use anywhere in the method. (“Transfer learning” surfaces only in a cited related-work reference, not in this paper’s own approach.)

Setup: 70/30 train/validation split; Darts default hyperparameters — input_chunk_length=12, output_chunk_length=4 (1-minute-ahead), batch_size=32, n_epochs=200, d_model=16, nhead=8, 2+2 encoder/decoder layers, dim_feedforward=128, dropout=0.1. Forecast accuracy judged by MAE/RMSE/MAPE.
Optimizer, learning rate, and loss are left at Darts’ defaults and not explicitly stated.
The model is trained once offline, then served live via a per-minute CronJob that feeds its forecasts to KEDA.

Strengths¶

Genuinely proactive: forecasts demand and scales ahead of surges, hiding cold-start delay — its standout behavior in Experiment 1 (15 pods in the first minute, ~16 ms latency).
Real, end-to-end deployment: not just offline accuracy — a full monitor→predict→plan→execute loop running live on Kubernetes against real baselines.
Tackles an under-served target: scaling the ingress controller (centralized traffic gateway), which most prior work ignores.
Practical integration: rides on standard KEDA/HPA actuation, so it slots into existing Kubernetes stacks rather than replacing them.
Transformer advantages over RNNs: faster inference and better responsiveness than sequential LSTM/Bi-LSTM models the related work relies on.
Trained on real-world data (Alibaba traces), not synthetic load.

Limitations¶

Did not consistently outperform HPA — promising, not yet superior across the board.
Single application tested (Podinfo) → limited evidence of generality across diverse microservices/workloads.
Univariate input only (request rate); ignores CPU, memory, and other signals that could sharpen predictions.
Default hyperparameters (Darts recommendations); the authors expect tuning would improve results but didn’t do it.
Short horizon: one-minute-ahead forecasts only — limited foresight for slow-to-provision resources.
Inference overhead and cold-start latency of the model itself may hinder real-time use; lighter models (GRUs, temporal CNNs) are flagged as future comparisons.
No fault-tolerance / fallback yet — a misprediction can mis-scale; the authors suggest adding a fallback to HPA or fixed thresholds.
Inconsistent scaling behavior between experiments (aggressive vs conservative), driven by capacity assumptions, hinting at sensitivity to the max_rps_per_pod constant.

Glossary¶

Autoscaling — automatically changing how many resources/instances an app uses to match demand.
Reactive scaling — scale after a metric crosses a threshold (lags behind load).
Proactive / predictive scaling — forecast future load and scale ahead of it.
Horizontal scaling (scale out/in) — add/remove whole instances (pods). (This paper.)
Vertical scaling (scale up/down) — give one instance more/less CPU or RAM. (Not used here.)
Cold start / provisioning delay — the time a new pod/VM needs before it can serve traffic; the main reason reactive scaling lags.
Kubernetes — platform for running and orchestrating containerized apps.
Pod — one running instance/unit in Kubernetes.
Ingress controller — the cluster’s front-door router/load balancer for incoming requests; the thing scaled here.
HPA (Horizontal Pod Autoscaler) — Kubernetes’ built-in reactive, threshold-based pod scaler.
KEDA (Kubernetes Event-Driven Autoscaler) — flexible autoscaler that can scale on custom signals and to zero; executes the scaling in this paper.
ScaledObject — KEDA’s YAML config defining how/when to scale; the system edits its value field each minute.
Prometheus / Grafana — metric collection / dashboard visualization tools.
Locust — load-testing tool that simulates many concurrent users.
Transformer — attention-based neural network; here a univariate time-series forecaster of request rate.
Attention — the mechanism letting a transformer weigh how much each part of a sequence matters to each other part.
RPS (requests per second) — the workload signal being predicted.
P99 latency — the 99th-percentile response time (the “worst typical” user experience).
MAE / RMSE / MAPE — forecast error metrics (lower is better).
SLA — Service Level Agreement; the promised performance/uptime targets.
MAPE-K — the Monitor–Analyze–Plan–Execute (+ shared Knowledge) reference loop for self-adaptive systems.
Darts — Python library for time-series forecasting, used to build/train the transformer.
Alibaba Cloud trace logs — real-world microservice traffic dataset used for training.

References¶

Shrestha, R., & Tuz Sabiha, F. (2025). Enhancing Cloud Resource Utilization with Predictive Autoscaling Using Transformer Models. 2025 9th International Conference on Cloud and Big Data Computing (ICCBDC), 24–29. 10.1109/iccbdc67784.2025.00011