Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PredictiveAutoscaling

Informer-forecast request rates feeding KEDA to proactively scale a live ingress controller

Shrestha & Tuz Sabiha (2025) Citations


TL;DR

Cloud apps run on a variable number of “instances” (here, Kubernetes pods) that get added or removed as traffic rises and falls. The usual way to do this is reactive: wait until CPU usage crosses a threshold, then scramble to add pods. The problem is that booting a new pod takes time (the “cold start”), so by the time the new capacity is ready, users have already been hit with slow responses. This paper builds a proactive autoscaler: a small transformer neural network looks at the last several minutes of incoming request rate and forecasts the request rate one minute into the future. A controller turns that forecast into a target that tells KEDA (a Kubernetes autoscaler) how many pods of the ingress controller (the cluster’s front-door traffic router) to run, before the traffic actually arrives. Trained on real Alibaba Cloud trace data and tested on a live Kubernetes cluster with simulated traffic surges, the system reacted to spikes much faster than the standard Kubernetes autoscaler (HPA) and kept response times stable and low (~16 ms), though it did not beat HPA on every single metric.


The Problem (and why simple autoscaling isn’t enough)

Autoscaling = automatically adjusting how many resources an app gets so it can serve demand without (a) falling over when busy or (b) burning money on idle capacity when quiet.

Kubernetes’ built-in tool, the Horizontal Pod Autoscaler (HPA), is reactive: you set a rule like “keep average CPU at 70%,” and when CPU climbs past it, HPA adds pods. Two issues:

  1. It always lags. It only reacts after load has already risen. Spinning up a new pod takes time (cold start / provisioning delay), so there is a window where the app is overloaded and users suffer.

  2. Thresholds are hard to set. Picking the right CPU/memory threshold needs expertise and still may not match real, shifting traffic patterns.

To avoid the lag, operators often just keep extra pods running all the time (over-provisioning) — which wastes money.

The fix this paper pursues: stop reacting, start predicting. If you can forecast that a surge is coming a minute from now, you can add pods ahead of time so capacity is ready when the traffic hits. This is proactive (predictive) autoscaling.

Analogy: an online store the night before a big sale. Reactive scaling is like hiring extra cashiers only after the checkout line is already out the door. Predictive scaling is like looking at last year’s sale data, seeing the rush is coming at 9 a.m., and having the extra cashiers clocked in at 8:55.


Background

A few terms, defined once:


Contribution in Simple Terms

The genuinely new pieces:

  1. Use a transformer to forecast cloud workload. Transformers are famous in language tasks but, the authors note, are underexplored for cloud workload prediction. They apply one as a short-horizon forecaster of request rate.

  2. Scale the ingress controller, not the app. Almost all prior predictive-autoscaling work scales application pods. Scaling the ingress controller (the centralized traffic gateway) is the unusual, identified gap they target.

  3. Wire the forecast into a real, working autoscaler (KEDA) on a real Kubernetes cluster. This is not just a prediction-accuracy paper — they build the full loop (monitor → predict → compute target → scale) and load-test it live against the standard baselines (no autoscaling, HPA, KEDA-with-fixed-threshold).

In one sentence: they turn a transformer’s one-minute-ahead request-rate forecast into a live, proactive scaling target for the Kubernetes ingress controller.

What they honestly do not claim: that it beats HPA everywhere. HPA remained a strong, balanced baseline. The transformer approach’s win is speed of reaction to surges and promise for volatile traffic, not across-the-board superiority.


How It Works, Step by Step

The system has three blocks: the Application block (the app + ingress controller serving traffic), the Autoscaling-Controller (AC) (the brain: monitor + predictor + calculator), and the Autoscaler (KEDA, the muscle that applies the change).

Training the transformer (offline, done once):

  1. Get data. Use the public Alibaba Cloud trace logs — real traffic from a large microservice-based e-commerce platform. Specifically the MS_MCR_RT_Table, which holds call/request and response-rate metrics (24 files, each = 30 minutes of trace; fields include timestamp, microservice name, instance ID, metric type, call-rate value).

  2. Preprocess. Filter to one specific microservice; keep only timestamp and value (requests per second); sort chronologically; format as a univariate time series (one number per time step — just RPS).

  3. Train. Build the transformer with the Darts Python time-series library. Split data 70% train / 30% validation. Key settings: input_chunk_length=12 (look back 12 steps), output_chunk_length=4 (predict ahead — corresponds to 1-minute-ahead), d_model=16, nhead=8 (8 attention heads), 2 encoder + 2 decoder layers, dim_feedforward=128, dropout=0.1, batch_size=32, n_epochs=200. (These are Darts’ recommended defaults — the authors note tuning them could improve results.)

  4. Evaluate forecast accuracy on the validation set with MAE, RMSE, and MAPE (standard error metrics; lower = better).

Running live (the proactive loop, repeating every minute):

  1. Monitor. Prometheus continuously collects the ingress controller’s incoming request-rate metric and stores it as a time series. (Grafana dashboards visualize CPU, memory, pod count, response rate, latency.)

  2. Predict. A Kubernetes CronJob fires every minute, running a Python script that pulls the past hour of request-rate metrics from Prometheus, feeds the recent window to the trained transformer, gets the one-minute-ahead RPS forecast, and writes it to predicted_values.csv.

  3. Calculate the scaling target (Algorithm 1). A script converts the forecast into a KEDA target:

    • Constants: max_rps_per_pod = 70 (capacity of one pod, found by benchmarking), desired_utilization = 0.70.

    • max_predicted_rps = max(predicted values).

    • required_pods = max_predicted_rps / max_rps_per_pod.

    • Query Prometheus for current_pods.

    • scale_value = (current_pods / required_pods) × desired_utilization.

  4. Execute. The script writes scale_value into the value field of KEDA’s ScaledObject YAML and applies it with kubectl apply. KEDA then drives the underlying HPA to add/remove ingress pods to match — before the predicted traffic arrives.

  5. Loop. One minute later, step 6 runs again with fresh data, so the target continuously re-adjusts to the latest forecast and the actual observed load (runtime feedback).

Note on architecture: the only neural network is the transformer (a standard encoder–decoder transformer for univariate time-series forecasting). There is no GAN, diffusion, FFT/frequency, or similarity-search component. Everything else (Prometheus, KEDA, the calculator) is conventional cloud/control plumbing.


Inputs (what it consumes)

It is univariate — only request rate drives the prediction. CPU/memory/latency are monitored for evaluation but are not inputs to the transformer.


Outputs (what it produces)

Horizon: one minute ahead. The forecast refreshes every minute via the CronJob.


How It Fits the Autoscaling Framework (MAPE-K)

This is a near-textbook MAPE-K loop, with the transformer living in the Analyze stage.

Reactive vs proactive: Proactive. The system scales on forecast demand, so capacity is in place before the surge — directly addressing HPA’s lag and the cold-start delay.

Horizontal vs vertical: Horizontal (adds/removes pods of the ingress controller). No vertical (per-pod CPU/RAM) scaling.

Does it do the actuation itself, or feed a standard scaler? A bit of both. The transformer + calculator produce the decision; KEDA (event-driven, and ultimately HPA) performs the actual pod changes. So the novel part is the predictive signal and target, while actuation rides on standard Kubernetes autoscaling machinery — a clean, deployable design.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch — no pretrained or foundation model.

The forecaster is a plain univariate encoder–decoder transformer — the only neural network in the system, with no GAN, diffusion, FFT, or foundation-model component. It is built with the Darts Python library and trained from scratch on the public Alibaba Cloud microservice trace (one microservice; just timestamp + RPS). There is no pretraining, fine-tuning, transfer learning, foundation model, or zero-shot use anywhere in the method. (“Transfer learning” surfaces only in a cited related-work reference, not in this paper’s own approach.)


Strengths

Limitations


Glossary

References
  1. Shrestha, R., & Tuz Sabiha, F. (2025). Enhancing Cloud Resource Utilization with Predictive Autoscaling Using Transformer Models. 2025 9th International Conference on Cloud and Big Data Computing (ICCBDC), 24–29. 10.1109/iccbdc67784.2025.00011