Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PredictiveK8s

An early Informer-based predictive auto-scaler for Kubernetes, validated by request-log replay

Shim et al. (2023) Citations


TL;DR

Cloud apps run inside small, fast-to-start units called containers (in Kubernetes these are grouped into pods). When more users show up, you need more pods; when they leave, you want fewer so you don’t pay for idle machines. Kubernetes’ built-in autoscaler is reactive: it waits until the system is already overloaded (e.g. CPU passes 70%) and only then adds pods — but starting a pod takes time, so users feel the lag. This paper builds a proactive (forecast-ahead) autoscaler instead. It trains a Transformer neural network — specifically a memory-efficient variant called Informer — to look at the last 10 minutes of incoming web-request traffic and predict the traffic for the next minute. A small script then converts that predicted traffic into “how many pods we’ll need” and scales before the load arrives. On two real web-traffic datasets (NASA web server logs and the 1998 FIFA World Cup website logs), the Transformer predicted future load more accurately than the older models the authors compared against (ARIMA, LSTM, Bi-LSTM) and led to noticeably less over- and under-provisioning of pods.


The Problem (and why simple autoscaling isn’t enough)

Imagine an online store the night before a big sale. If you wait until customers flood in before adding servers, the first wave hits an under-powered system: pages load slowly, requests get dropped, and you violate your SLA (Service Level Agreement — the promise you made customers about speed and uptime). If instead you keep tons of servers running “just in case” all year, you burn money and power on idle machines. The autoscaler’s job is to hit the sweet spot automatically.

Kubernetes ships with a tool called the Horizontal Pod Autoscaler (HPA) that does this by reacting to thresholds: “if average CPU goes above X%, add pods.” Two things make this inadequate:

  1. It’s always late. A threshold only trips after load has already climbed. And because spinning up a new pod takes real time (the “provisioning delay” or cold start), by the time the new capacity is ready the spike may already have hurt users.

  2. It’s blind to the future. It only knows “right now.” It can’t see that a predictable rush is 2 minutes away.

Another common style is feedback-based scaling, which sizes resources from recent past usage. The authors lump this in with reactive approaches: it still over- or under-provisions because it never actually forecasts what’s coming.

The fix the field has converged on is proactive (predictive) autoscaling: use machine learning to forecast the upcoming workload, then scale ahead of time so the new pods are already warm when the load hits. This paper’s specific bet is that a Transformer is the best forecaster for this job.


Background

A few terms, defined once:

The models being compared (so the results table makes sense):


Contribution in Simple Terms

The paper’s core claim: a Transformer (Informer) beats the previous best models at forecasting cloud web-traffic, and that better forecast translates into a better autoscaler. Concretely, the contributions are:

  1. Bringing the Transformer/Informer model into cloud autoscaling. Transformers were mostly used in language and vision; the authors show they also excel at predicting how much workload a cloud app will face next. This is the genuinely new ingredient versus the prior Bi-LSTM-based work.

  2. A complete proactive autoscaling framework, not just a model: a Kubernetes setup where Prometheus collects metrics, the Informer model forecasts the next minute’s load inside the MAPE loop’s Analyze stage, and a simple Plan script turns the forecast into a concrete number of pods and issues the scale command.

  3. A head-to-head comparison of ARIMA vs LSTM vs Bi-LSTM vs Transformer on two real-world datasets, measured both by raw prediction accuracy and by how well the resulting pod counts match actual need (provisioning accuracy).

In plain terms: instead of waiting for the system to choke and then scrambling, the autoscaler looks at the recent traffic pattern, predicts the next minute, and pre-orders exactly the number of pods it expects to need.


How It Works, Step by Step

  1. Collect metrics (Monitor). In the deployed system, a Prometheus metrics server continuously gathers metrics from the Kubernetes cluster — request throughput, response time, autoscaling metrics — at regular intervals and stores them in a metrics database. (In the offline experiments, the “metrics” are real web-server logs instead — see Inputs.)

  2. Turn raw logs into a clean time series. Each HTTP request log line (host, timestamp, method/route, reply code, bytes) is aggregated by minute: count how many requests arrived in each one-minute bucket. The result is a single number per minute = the workload. For the FIFA dataset, values are normalized to the [0, 1] range because the raw counts vary wildly and that makes learning hard.

  3. Frame the forecasting task. It’s set up as univariate -> univariate, one-step-ahead: feed in the last 10 minutes of workload (w_{t-10}, ..., w_{t-1}) and predict the single next minute w_t. (“Univariate” = one input variable, the request count, predicting that same one variable.)

  4. Predict with the Informer Transformer (Analyze). The sequence of 10 values goes into Informer, which outputs the predicted request count for the next minute. Internally Informer is an encoder–decoder Transformer with two efficiency tricks:

    • ProbSparse self-attention (encoder): ordinary self-attention compares every position with every other position — cost grows quadratically with sequence length and eats memory. ProbSparse instead keeps only the few dominant queries (the ones that actually carry most of the attention weight, the “long tail”), slashing cost.

    • Self-attention distilling: between encoder layers, the model trims down to just the strongest attention features, shrinking the data as it flows deeper.

    • Decoder (generative): produces the forecast in one shot rather than step-by-step. It reuses ProbSparse attention with masking (future positions set to negative infinity) so it can’t “cheat” by looking ahead — and so it stays fast even for long predictions.

    • Training uses Mean Squared Error loss with early stopping (halt when the training loss stops improving) to avoid overfitting. NASA: 10 epochs; FIFA: 4 epochs.

  5. Convert the forecast into a pod count (Plan). A short Python script (the paper’s Pods Requirement Manager, Algorithm 1) runs every 60 seconds and computes:

    pods_{t+1} = predicted_requests_{t+1} / requests_per_pod

    i.e. divide the predicted incoming requests by how many requests a single pod can handle, giving the number of pods needed for the next interval.

  6. Apply the scaling (Execute). The script issues the Kubernetes scale command with that pod count, so the cluster grows or shrinks before the predicted load arrives. This loop repeats continuously while the system runs.


Workload Modeling & Prediction Pipeline (full-text deep read)

This section zooms in on two things only: (A) what “the workload” actually is and how it’s turned into numbers, and (B) the exact path from raw data to a prediction. Everything here is from this paper’s full text; where the paper is vague, that’s flagged.

(A) How the workload is modeled and characterized

What real-world quantity is “the workload”? It is the number of HTTP requests arriving at a web server, counted per minute — i.e. the arrival/request rate of incoming traffic. This is the demand the app must serve. (The paper mentions CPU usage in passing — it tuned the ARIMA baseline on CPU data — but every reported forecasting experiment uses request counts, not CPU or memory. So treat “workload = requests per minute” as the real signal.)

One signal or many? (univariate vs multivariate.) It is univariate — a single number per time step (requests in that minute). The paper states the forecasting task explicitly as “univariate to univariate”: one input variable in, the same one variable out. There is no joint modeling of CPU + memory + latency together; it is purely the request-count series predicting itself.

Time granularity. The series is sampled at one value per minute. Raw logs are bucketed into one-minute bins and the requests in each bin are counted, producing one workload number per minute. In the live system the Monitor phase collects Prometheus metrics “at regular intervals,” but the experiments run on the per-minute aggregated series.

What the raw data looks like, and what makes it hard. The raw input is web-server request logs — one line per HTTP request, each line carrying host/IP, a timestamp, the request method and route, the HTTP reply code, and the bytes returned. Two real datasets are used:

What makes this hard, per the paper: the traffic is bursty and highly variable. The paper is actually self-contradictory about FIFA: when describing the raw data it says the FIFA workload “has high variation in values, causing predictability to be hard” (which is why it normalizes it), yet when explaining the results it calls the aggregated FIFA workload “simple” with “less variation” — and on that smoother aggregated series the simpler LSTM/Bi-LSTM models matched or beat the Transformer, whereas the Transformer’s clear win came on NASA. (The paper does not directly state which dataset is “more variable” overall, so treat the NASA-is-harder reading as an inference from where the Transformer helps, not a stated fact.) So the difficulty is mainly burstiness / wide swings in request volume plus real gaps/outages in the raw logs. (The paper does not do an explicit periodicity or seasonality decomposition of the workload, and it does not discuss long-sequence length as a problem for its 10-minute window — long-sequence concerns are only mentioned as the motivation for choosing Informer in general.)

How the workload is represented numerically. Three concrete moves, and nothing fancier:

  1. Aggregation into a per-minute count — the logs become a clean 1-dimensional time series of integers (requests per minute).

  2. Normalization to the [0, 1] range — done for the FIFA dataset specifically, because its values vary so widely that learning is hard; scaling everything into 0–1 keeps the numbers in a friendly range. (NASA results are reported in raw request units; FIFA results are reported in the normalized scale, which is why FIFA’s error numbers look tiny.)

  3. Sliding window of the last 10 values — the model is fed a window of 10 consecutive per-minute counts as its input. There is no patching, no frequency/Fourier transform, no trend/seasonal/residual decomposition, and no learned token embedding described in this paper; the “representation” is simply the raw (or normalized) 10-number window handed to the network. This is a key faithfulness point: the paper does not apply the kind of decomposition or patch-embedding tricks seen in other Transformer-forecasting work.

(B) The step-by-step prediction pipeline

  1. Collect / preprocess the raw signal. Start from raw web-server request logs. Group every log line into the one-minute bucket its timestamp falls in, and count requests per minute. This converts millions of irregular log lines into one tidy number-per-minute series — the workload. (Analogy: instead of recording every single customer walking through a shop door, you just write down “how many came in this minute” — far easier to forecast from.)

  2. Split into train and test by date. For NASA, train on 1 Jul–14 Aug 1995 and test on the last 16 days. For FIFA, train on 30 Apr–30 Jun 1998 and test on the last 16 days. (The split is by calendar date, not random shuffling, which is the correct way to evaluate a forecaster — you only ever predict the future from the past.)

  3. Normalize where needed. For FIFA, rescale the per-minute counts into [0, 1] because of the wide value range; this is what makes the model trainable on that noisy series. NASA is used in raw request units.

  4. Frame the learning problem: 10-in, 1-out, univariate. Define the task as input sequence length = 10, output sequence length = 1. Each training example is “here are the last 10 minutes of request counts (w_{t-10} ... w_{t-1}); predict the count for the next minute (w_t).” This is one-step-ahead forecasting (horizon = a single future minute), and it is univariate -> univariate (request count predicting request count). The choice of a 10-minute lookback is taken directly from the prior Bi-LSTM work the authors compare against.

  5. Feed the window into the Informer encoder, and why. The 10-value window enters Informer, an efficient Transformer for time series. The encoder uses ProbSparse self-attention — ordinary self-attention (a mechanism that lets every time step compare itself with every other to find which past moments matter) costs grow quadratically with sequence length and uses a lot of memory; ProbSparse keeps only the handful of dominant “queries” (the few comparisons that actually carry most of the weight) and ignores the long tail, so it does the same job far more cheaply. Between encoder layers it also applies self-attention distilling, which trims the data down to just the strongest features as it flows deeper — why: it shrinks memory and keeps only the signal that matters. (Intuition: rather than letting all 10 minutes shout at each other equally, the model lets only the few most informative minutes drive the prediction.)

  6. Decode to a forecast, and why this style. The decoder is generative: it produces the forecast in one shot rather than re-running step-by-step, which keeps it fast even when predictions get long. It reuses ProbSparse attention but with masking — future positions are set to negative infinity so the model cannot peek ahead at answers it shouldn’t see yet. Why: one-shot generation plus masking avoids the slow, error-accumulating loop of step-by-step decoding.

  7. Train with MSE and early stopping. The network is trained from scratch using Mean Squared Error (the average squared gap between predicted and actual counts) as the loss, with early stopping (halt when training loss stops improving) to prevent overfitting. NASA: 10 epochs; FIFA: 4 epochs. (For reference, the Bi-LSTM baseline used input size 10, 30 hidden units, one output cell, batch size 64, 50 epochs, ReLU, early-stopping patience 5; ARIMA used p=5, d=1, q=0.)

  8. The prediction result. The model outputs a single number: the predicted count of HTTP requests in the very next minute, w_t. It is a point forecast (one value, not a probability distribution or confidence interval), a one-step / one-minute horizon, in units of requests per minute (raw count for NASA; normalized 0–1 value for FIFA). On the harder NASA data this prediction was markedly more accurate than ARIMA/LSTM/Bi-LSTM (e.g. MSE 77.3 vs Bi-LSTM’s 186.1, R² 0.83 vs 0.71); on the smoother FIFA data it was competitive but not the best.

How the result is consumed (briefly, outside the workload->prediction focus). A 60-second Plan script divides the predicted requests by a fixed per-pod capacity (pods_{t+1} = w_t / requests_per_pod) and issues a Kubernetes scale command — so the cluster is resized before the predicted traffic arrives. But the prediction result itself ends at step 8: the next minute’s request count.


Model Parameters & How They Were Chosen

This section reports only the hyperparameters the paper states, and the basis on which each was selected. A recurring theme: the paper is detailed about the task framing and the baselines, but almost silent about the internal architecture and the optimizer settings of its own Informer/Transformer. Where the paper gives no value, that is marked “not reported”; no typical defaults are imported.

(A) What are the model parameters?

Architecture (Transformer / Informer). The paper describes the Informer structure only qualitatively (an encoder built from stacked ProbSparse self-attention blocks with self-attention distilling between layers, plus a generative masked decoder that also uses ProbSparse attention), citing the original Informer paper for the mechanism. It reports no concrete numeric architecture hyperparameters.

ParameterValue in paper
Encoder layersnot reported
Decoder layersnot reported
Attention headsnot reported (only “masked multi head attention” mentioned, no count)
Model / hidden dimension (d_model)not reported
Feed-forward dimensionnot reported
Dropoutnot reported
Embedding dimension / token embeddingnot reported (no embedding scheme described)
Activationnot reported (for the Transformer)
ProbSparse factor (the setting that controls how many “dominant queries” u are kept)not reported (described in words only: it “focuses on those u dominant queries which belong in the long tail”)
Self-attention distillingpresent between encoder layers (no numeric setting given)
Decoder stylegenerative, one-shot; future positions masked to negative infinity (“masked dot-products are set to negative infinity”)
Diffusion / GAN / RevIN / FFT settingsnot applicable (none of these are used)
Number of output values1 (single next-minute value; see Data/windowing)

Architecture (baselines). The baselines are specified more concretely than the proposed model.

ModelParameters reported
ARIMAautoregressive order p = 5, degree of differencing d = 1, moving-average order q = 0
LSTMnothing reported (described conceptually as an RNN with forget/input/output gates; no layer count, hidden size, or training settings given)
Bi-LSTMinput size 10, 30 hidden units, 1 output neural cell, ReLU activation

Training.

ParameterTransformer (Informer)Bi-LSTMARIMA / LSTM
Loss functionMean Squared Error (MSE)MSEnot reported
Epochs10 (NASA), 4 (FIFA)50not reported
Early stoppingyes, “if training loss is consistent” (no patience value given for the Transformer)yes, patience 5not reported
Batch sizenot reported64not applicable / not reported
Optimizernot reportednot reportednot applicable (ARIMA)
Learning rate (and schedule)not reportednot reportednot reported
Weight decaynot reportednot reportednot reported
Loss weightsnot applicable (single MSE term)not applicablenot reported
Hardwarenot reportednot reportednot reported

(The MSE loss with early stopping is stated to apply to LSTM, Bi-LSTM, and the Transformer collectively: “Mean Squared Error loss function is used in LSTM, Bi-LSTM and transformer preventing over-fitting and early stopping if training loss is consistent.”)

Data / windowing.

ParameterValue
Forecasting taskunivariate to univariate (one input variable, the request count, predicting that same variable)
Lookback window (input sequence length)10 minutes (w_{t-10} ... w_{t-1})
Forecast horizon (output sequence length)1 (one step, the next minute w_t)
Sampling interval1 minute (raw logs aggregated into one-minute buckets)
NormalizationFIFA workload scaled to [0, 1] (because of its wide value range); NASA used in raw request units
Train/test split (NASA)train 1 Jul–14 Aug 1995; test the last 16 days
Train/test split (FIFA)train 30 Apr–30 Jun 1998; test the last 16 days
Validation splitnot reported (no separate validation set is described; early stopping is stated against training loss)

Total parameter count / model size. Not reported for any model. The paper notes that “the balance between the quality of prediction and the computation time” matters in practice but gives no parameter counts, memory footprints, or training/inference times.

Planning constant (downstream of the model). requests_per_pod (per-pod request capacity) is used by the Plan script to convert predicted requests into a pod count; its numeric value is not reported.

(B) How were the parameters chosen?

The paper does not describe any systematic hyperparameter search (no grid search, no random search, no cross-validation, no reported search ranges) for its proposed Transformer. The selection basis, where stated, is as follows:


Inputs (what it consumes)


Outputs (what it produces)

So the chain is: forecasted requests/minute -> recommended pod count -> issued scaling action. The paper carries the idea all the way to an actuation command (not just a prediction), though the heavy lifting and the novelty are in the forecast.


How It Fits the Autoscaling Framework (MAPE-K)

The whole design is explicitly a MAPE loop (adapted from prior Bi-LSTM work [11]), and the paper is clear about which stage each piece lives in:

Mapping to the framing questions:


Evaluation (datasets & metrics, briefly)

Datasets (both real-world web traffic, processed to per-minute workload):

Accuracy metrics: MSE, RMSE, MAE (lower = better) and R² (closer to 1 = better).

NASA results — Transformer wins on every metric:

MetricARIMALSTMBi-LSTMTransformer
MSE196.91184.29186.0777.34
RMSE14.0313.5813.648.79
MAE10.5710.2710.386.59
0.6930.7120.7100.830

FIFA results — here the data is simpler/smoother, so LSTM and Bi-LSTM edge out the Transformer; the Transformer is still very close but not the best. (Note: the paper’s FIFA table reports only LSTM, Bi-LSTM, and Transformer — no ARIMA column — and all values are on the normalized [0,1] scale, which is why the errors look tiny.)

MetricLSTMBi-LSTMTransformer
MSE0.00001061900.00001114310.0000380358
RMSE0.00325867530.00333813510.0061673189
MAE0.00173310590.00170189330.0025280349
0.99871045350.99864679570.9950739941

Honest finding: the Transformer’s advantage shows up on the harder, more variable workload (NASA), not the easy one (FIFA).

Provisioning accuracy (SPEC-inspired): Θ_U = under-provisioning %, Θ_O = over-provisioning % (0 is best). On NASA the Transformer’s forecasts gave the best pod provisioning:

MetricARIMALSTMBi-LSTMTransformer
Θ_U [%] (under)9.969.038.224.49
Θ_O [%] (over)22.7323.9225.8417.27

So the better forecast translated into both fewer SLA-risking shortfalls and less wasted over-provisioning.


Training & pre-training

Trained from scratch — no pretrained or foundation model.

The authors borrow only the architecture of Informer (from Zhou et al. 2021) and train it from scratch — together with the ARIMA, LSTM, and Bi-LSTM baselines — directly on the per-minute web-request traces (NASA Kennedy Space Center 1995 and FIFA World Cup 1998). The paper’s conclusion frames this plainly as supervised learning: there is no pretraining, fine-tuning, foundation model, or zero-shot transfer anywhere in the pipeline.

Training setup, in brief:


Strengths

Limitations


Glossary

References
  1. Shim, S., Dhokariya, A., Doshi, D., Upadhye, S., Patwari, V., & Park, J.-Y. (2023). Predictive Auto-scaler for Kubernetes Cloud. 2023 IEEE International Systems Conference (SysCon), 1–8. 10.1109/syscon53073.2023.10131106