PredictiveK8s - Luca's Research @ PEG

Shim et al. (2023) Citations

TL;DR¶

Cloud apps run inside small, fast-to-start units called containers (in Kubernetes these are grouped into pods). When more users show up, you need more pods; when they leave, you want fewer so you don’t pay for idle machines. Kubernetes’ built-in autoscaler is reactive: it waits until the system is already overloaded (e.g. CPU passes 70%) and only then adds pods — but starting a pod takes time, so users feel the lag. This paper builds a proactive (forecast-ahead) autoscaler instead. It trains a Transformer neural network — specifically a memory-efficient variant called Informer — to look at the last 10 minutes of incoming web-request traffic and predict the traffic for the next minute. A small script then converts that predicted traffic into “how many pods we’ll need” and scales before the load arrives. On two real web-traffic datasets (NASA web server logs and the 1998 FIFA World Cup website logs), the Transformer predicted future load more accurately than the older models the authors compared against (ARIMA, LSTM, Bi-LSTM) and led to noticeably less over- and under-provisioning of pods.

The Problem (and why simple autoscaling isn’t enough)¶

Imagine an online store the night before a big sale. If you wait until customers flood in before adding servers, the first wave hits an under-powered system: pages load slowly, requests get dropped, and you violate your SLA (Service Level Agreement — the promise you made customers about speed and uptime). If instead you keep tons of servers running “just in case” all year, you burn money and power on idle machines. The autoscaler’s job is to hit the sweet spot automatically.

Kubernetes ships with a tool called the Horizontal Pod Autoscaler (HPA) that does this by reacting to thresholds: “if average CPU goes above X%, add pods.” Two things make this inadequate:

It’s always late. A threshold only trips after load has already climbed. And because spinning up a new pod takes real time (the “provisioning delay” or cold start), by the time the new capacity is ready the spike may already have hurt users.
It’s blind to the future. It only knows “right now.” It can’t see that a predictable rush is 2 minutes away.

Another common style is feedback-based scaling, which sizes resources from recent past usage. The authors lump this in with reactive approaches: it still over- or under-provisions because it never actually forecasts what’s coming.

The fix the field has converged on is proactive (predictive) autoscaling: use machine learning to forecast the upcoming workload, then scale ahead of time so the new pods are already warm when the load hits. This paper’s specific bet is that a Transformer is the best forecaster for this job.

Background¶

A few terms, defined once:

Container / pod: A container is a lightweight, self-contained package of an app plus everything it needs to run. Kubernetes runs them in pods (a pod is one or more containers treated as a unit). Containers start faster and use fewer resources than full virtual machines, which is why fast autoscaling is even feasible.
Horizontal scaling (scale out/in): add or remove whole pods. (Contrast: vertical scaling gives a single pod more CPU/RAM.) This paper does horizontal scaling.
Elasticity: the system’s ability to grow and shrink its resources in real time as load changes.
Time-series forecasting: predicting the next value(s) of a quantity that changes over time (here: web requests per minute) from its recent history.
Transformer: a neural-network architecture (from the 2017 paper “Attention Is All You Need”) built around self-attention — a mechanism that lets the model weigh which earlier points in a sequence matter most for predicting the next one. Famous for language (NLP) and vision, here it’s repurposed as a workload forecaster. The authors specifically use Informer, a Transformer variant designed for long sequences that is cheaper in memory and computation than a vanilla Transformer.
MAPE loop (MAPE-K): the standard blueprint for self-managing systems — Monitor (collect metrics) -> Analyze (detect/forecast) -> Plan (decide the action) -> Execute (apply it), all sharing a common Knowledge base. Keep this in mind; the paper’s custom autoscaler is built directly on it.

The models being compared (so the results table makes sense):

ARIMA — a classic statistical forecaster (no neural network). It predicts the future as a weighted combination of past values, with knobs p (autoregression), d (differencing, to remove trends/seasonality), q (moving average). The authors used p=5, d=1, q=0.
LSTM — a Recurrent Neural Network with internal “memory” (a cell state) and gates (forget/input/output) that decide what to remember or discard. Good at sequences; solves the “vanishing gradient” problem that plagued plain RNNs.
Bi-LSTM — an LSTM that reads the sequence both forward and backward. This was the prior state of the art for Kubernetes workload prediction (from reference [11], which this paper directly builds on and compares against).
Transformer (Informer) — the paper’s proposed approach.

Contribution in Simple Terms¶

The paper’s core claim: a Transformer (Informer) beats the previous best models at forecasting cloud web-traffic, and that better forecast translates into a better autoscaler. Concretely, the contributions are:

Bringing the Transformer/Informer model into cloud autoscaling. Transformers were mostly used in language and vision; the authors show they also excel at predicting how much workload a cloud app will face next. This is the genuinely new ingredient versus the prior Bi-LSTM-based work.
A complete proactive autoscaling framework, not just a model: a Kubernetes setup where Prometheus collects metrics, the Informer model forecasts the next minute’s load inside the MAPE loop’s Analyze stage, and a simple Plan script turns the forecast into a concrete number of pods and issues the scale command.
A head-to-head comparison of ARIMA vs LSTM vs Bi-LSTM vs Transformer on two real-world datasets, measured both by raw prediction accuracy and by how well the resulting pod counts match actual need (provisioning accuracy).

In plain terms: instead of waiting for the system to choke and then scrambling, the autoscaler looks at the recent traffic pattern, predicts the next minute, and pre-orders exactly the number of pods it expects to need.

How It Works, Step by Step¶

Collect metrics (Monitor). In the deployed system, a Prometheus metrics server continuously gathers metrics from the Kubernetes cluster — request throughput, response time, autoscaling metrics — at regular intervals and stores them in a metrics database. (In the offline experiments, the “metrics” are real web-server logs instead — see Inputs.)
Turn raw logs into a clean time series. Each HTTP request log line (host, timestamp, method/route, reply code, bytes) is aggregated by minute: count how many requests arrived in each one-minute bucket. The result is a single number per minute = the workload. For the FIFA dataset, values are normalized to the [0, 1] range because the raw counts vary wildly and that makes learning hard.
Frame the forecasting task. It’s set up as univariate -> univariate, one-step-ahead: feed in the last 10 minutes of workload (w_{t-10}, ..., w_{t-1}) and predict the single next minute w_t. (“Univariate” = one input variable, the request count, predicting that same one variable.)
Predict with the Informer Transformer (Analyze). The sequence of 10 values goes into Informer, which outputs the predicted request count for the next minute. Internally Informer is an encoder–decoder Transformer with two efficiency tricks:
- ProbSparse self-attention (encoder): ordinary self-attention compares every position with every other position — cost grows quadratically with sequence length and eats memory. ProbSparse instead keeps only the few dominant queries (the ones that actually carry most of the attention weight, the “long tail”), slashing cost.
- Self-attention distilling: between encoder layers, the model trims down to just the strongest attention features, shrinking the data as it flows deeper.
- Decoder (generative): produces the forecast in one shot rather than step-by-step. It reuses ProbSparse attention with masking (future positions set to negative infinity) so it can’t “cheat” by looking ahead — and so it stays fast even for long predictions.
- Training uses Mean Squared Error loss with early stopping (halt when the training loss stops improving) to avoid overfitting. NASA: 10 epochs; FIFA: 4 epochs.
Convert the forecast into a pod count (Plan). A short Python script (the paper’s Pods Requirement Manager, Algorithm 1) runs every 60 seconds and computes:
```
pods_{t+1} = predicted_requests_{t+1} / requests_per_pod
```
i.e. divide the predicted incoming requests by how many requests a single pod can handle, giving the number of pods needed for the next interval.
Apply the scaling (Execute). The script issues the Kubernetes scale command with that pod count, so the cluster grows or shrinks before the predicted load arrives. This loop repeats continuously while the system runs.

Workload Modeling & Prediction Pipeline (full-text deep read)¶

This section zooms in on two things only: (A) what “the workload” actually is and how it’s turned into numbers, and (B) the exact path from raw data to a prediction. Everything here is from this paper’s full text; where the paper is vague, that’s flagged.

(A) How the workload is modeled and characterized¶

What real-world quantity is “the workload”? It is the number of HTTP requests arriving at a web server, counted per minute — i.e. the arrival/request rate of incoming traffic. This is the demand the app must serve. (The paper mentions CPU usage in passing — it tuned the ARIMA baseline on CPU data — but every reported forecasting experiment uses request counts, not CPU or memory. So treat “workload = requests per minute” as the real signal.)

One signal or many? (univariate vs multivariate.) It is univariate — a single number per time step (requests in that minute). The paper states the forecasting task explicitly as “univariate to univariate”: one input variable in, the same one variable out. There is no joint modeling of CPU + memory + latency together; it is purely the request-count series predicting itself.

Time granularity. The series is sampled at one value per minute. Raw logs are bucketed into one-minute bins and the requests in each bin are counted, producing one workload number per minute. In the live system the Monitor phase collects Prometheus metrics “at regular intervals,” but the experiments run on the per-minute aggregated series.

What the raw data looks like, and what makes it hard. The raw input is web-server request logs — one line per HTTP request, each line carrying host/IP, a timestamp, the request method and route, the HTTP reply code, and the bytes returned. Two real datasets are used:

NASA Kennedy Space Center logs (1 Jul–31 Aug 1995): ~3,461,612 requests. The data has real-world gaps — logs missing 28–31 Jul, and a multi-day outage 1–3 Aug because the servers were shut down for Hurricane Erin.
FIFA World Cup 1998 site logs (30 Apr–26 Jul 1998): ~1,352,804,107 requests, spread across 245 raw files over 88 days.

What makes this hard, per the paper: the traffic is bursty and highly variable. The paper is actually self-contradictory about FIFA: when describing the raw data it says the FIFA workload “has high variation in values, causing predictability to be hard” (which is why it normalizes it), yet when explaining the results it calls the aggregated FIFA workload “simple” with “less variation” — and on that smoother aggregated series the simpler LSTM/Bi-LSTM models matched or beat the Transformer, whereas the Transformer’s clear win came on NASA. (The paper does not directly state which dataset is “more variable” overall, so treat the NASA-is-harder reading as an inference from where the Transformer helps, not a stated fact.) So the difficulty is mainly burstiness / wide swings in request volume plus real gaps/outages in the raw logs. (The paper does not do an explicit periodicity or seasonality decomposition of the workload, and it does not discuss long-sequence length as a problem for its 10-minute window — long-sequence concerns are only mentioned as the motivation for choosing Informer in general.)

How the workload is represented numerically. Three concrete moves, and nothing fancier:

Aggregation into a per-minute count — the logs become a clean 1-dimensional time series of integers (requests per minute).
Normalization to the [0, 1] range — done for the FIFA dataset specifically, because its values vary so widely that learning is hard; scaling everything into 0–1 keeps the numbers in a friendly range. (NASA results are reported in raw request units; FIFA results are reported in the normalized scale, which is why FIFA’s error numbers look tiny.)
Sliding window of the last 10 values — the model is fed a window of 10 consecutive per-minute counts as its input. There is no patching, no frequency/Fourier transform, no trend/seasonal/residual decomposition, and no learned token embedding described in this paper; the “representation” is simply the raw (or normalized) 10-number window handed to the network. This is a key faithfulness point: the paper does not apply the kind of decomposition or patch-embedding tricks seen in other Transformer-forecasting work.

(B) The step-by-step prediction pipeline¶

Collect / preprocess the raw signal. Start from raw web-server request logs. Group every log line into the one-minute bucket its timestamp falls in, and count requests per minute. This converts millions of irregular log lines into one tidy number-per-minute series — the workload. (Analogy: instead of recording every single customer walking through a shop door, you just write down “how many came in this minute” — far easier to forecast from.)
Split into train and test by date. For NASA, train on 1 Jul–14 Aug 1995 and test on the last 16 days. For FIFA, train on 30 Apr–30 Jun 1998 and test on the last 16 days. (The split is by calendar date, not random shuffling, which is the correct way to evaluate a forecaster — you only ever predict the future from the past.)
Normalize where needed. For FIFA, rescale the per-minute counts into [0, 1] because of the wide value range; this is what makes the model trainable on that noisy series. NASA is used in raw request units.
Frame the learning problem: 10-in, 1-out, univariate. Define the task as input sequence length = 10, output sequence length = 1. Each training example is “here are the last 10 minutes of request counts (w_{t-10} ... w_{t-1}); predict the count for the next minute (w_t).” This is one-step-ahead forecasting (horizon = a single future minute), and it is univariate -> univariate (request count predicting request count). The choice of a 10-minute lookback is taken directly from the prior Bi-LSTM work the authors compare against.
Feed the window into the Informer encoder, and why. The 10-value window enters Informer, an efficient Transformer for time series. The encoder uses ProbSparse self-attention — ordinary self-attention (a mechanism that lets every time step compare itself with every other to find which past moments matter) costs grow quadratically with sequence length and uses a lot of memory; ProbSparse keeps only the handful of dominant “queries” (the few comparisons that actually carry most of the weight) and ignores the long tail, so it does the same job far more cheaply. Between encoder layers it also applies self-attention distilling, which trims the data down to just the strongest features as it flows deeper — why: it shrinks memory and keeps only the signal that matters. (Intuition: rather than letting all 10 minutes shout at each other equally, the model lets only the few most informative minutes drive the prediction.)
Decode to a forecast, and why this style. The decoder is generative: it produces the forecast in one shot rather than re-running step-by-step, which keeps it fast even when predictions get long. It reuses ProbSparse attention but with masking — future positions are set to negative infinity so the model cannot peek ahead at answers it shouldn’t see yet. Why: one-shot generation plus masking avoids the slow, error-accumulating loop of step-by-step decoding.
Train with MSE and early stopping. The network is trained from scratch using Mean Squared Error (the average squared gap between predicted and actual counts) as the loss, with early stopping (halt when training loss stops improving) to prevent overfitting. NASA: 10 epochs; FIFA: 4 epochs. (For reference, the Bi-LSTM baseline used input size 10, 30 hidden units, one output cell, batch size 64, 50 epochs, ReLU, early-stopping patience 5; ARIMA used p=5, d=1, q=0.)
The prediction result. The model outputs a single number: the predicted count of HTTP requests in the very next minute, w_t. It is a point forecast (one value, not a probability distribution or confidence interval), a one-step / one-minute horizon, in units of requests per minute (raw count for NASA; normalized 0–1 value for FIFA). On the harder NASA data this prediction was markedly more accurate than ARIMA/LSTM/Bi-LSTM (e.g. MSE 77.3 vs Bi-LSTM’s 186.1, R² 0.83 vs 0.71); on the smoother FIFA data it was competitive but not the best.

How the result is consumed (briefly, outside the workload->prediction focus). A 60-second Plan script divides the predicted requests by a fixed per-pod capacity (pods_{t+1} = w_t / requests_per_pod) and issues a Kubernetes scale command — so the cluster is resized before the predicted traffic arrives. But the prediction result itself ends at step 8: the next minute’s request count.

Model Parameters & How They Were Chosen¶

This section reports only the hyperparameters the paper states, and the basis on which each was selected. A recurring theme: the paper is detailed about the task framing and the baselines, but almost silent about the internal architecture and the optimizer settings of its own Informer/Transformer. Where the paper gives no value, that is marked “not reported”; no typical defaults are imported.

(A) What are the model parameters?¶

Architecture (Transformer / Informer). The paper describes the Informer structure only qualitatively (an encoder built from stacked ProbSparse self-attention blocks with self-attention distilling between layers, plus a generative masked decoder that also uses ProbSparse attention), citing the original Informer paper for the mechanism. It reports no concrete numeric architecture hyperparameters.

Parameter	Value in paper
Encoder layers	not reported
Decoder layers	not reported
Attention heads	not reported (only “masked multi head attention” mentioned, no count)
Model / hidden dimension (d_model)	not reported
Feed-forward dimension	not reported
Dropout	not reported
Embedding dimension / token embedding	not reported (no embedding scheme described)
Activation	not reported (for the Transformer)
ProbSparse factor (the setting that controls how many “dominant queries” `u` are kept)	not reported (described in words only: it “focuses on those `u` dominant queries which belong in the long tail”)
Self-attention distilling	present between encoder layers (no numeric setting given)
Decoder style	generative, one-shot; future positions masked to negative infinity (“masked dot-products are set to negative infinity”)
Diffusion / GAN / RevIN / FFT settings	not applicable (none of these are used)
Number of output values	1 (single next-minute value; see Data/windowing)

Architecture (baselines). The baselines are specified more concretely than the proposed model.

Model	Parameters reported
ARIMA	autoregressive order p = 5, degree of differencing d = 1, moving-average order q = 0
LSTM	nothing reported (described conceptually as an RNN with forget/input/output gates; no layer count, hidden size, or training settings given)
Bi-LSTM	input size 10, 30 hidden units, 1 output neural cell, ReLU activation

Training.

Parameter	Transformer (Informer)	Bi-LSTM	ARIMA / LSTM
Loss function	Mean Squared Error (MSE)	MSE	not reported
Epochs	10 (NASA), 4 (FIFA)	50	not reported
Early stopping	yes, “if training loss is consistent” (no patience value given for the Transformer)	yes, patience 5	not reported
Batch size	not reported	64	not applicable / not reported
Optimizer	not reported	not reported	not applicable (ARIMA)
Learning rate (and schedule)	not reported	not reported	not reported
Weight decay	not reported	not reported	not reported
Loss weights	not applicable (single MSE term)	not applicable	not reported
Hardware	not reported	not reported	not reported

(The MSE loss with early stopping is stated to apply to LSTM, Bi-LSTM, and the Transformer collectively: “Mean Squared Error loss function is used in LSTM, Bi-LSTM and transformer preventing over-fitting and early stopping if training loss is consistent.”)

Data / windowing.

Parameter	Value
Forecasting task	univariate to univariate (one input variable, the request count, predicting that same variable)
Lookback window (input sequence length)	10 minutes (`w_{t-10} ... w_{t-1}`)
Forecast horizon (output sequence length)	1 (one step, the next minute `w_t`)
Sampling interval	1 minute (raw logs aggregated into one-minute buckets)
Normalization	FIFA workload scaled to [0, 1] (because of its wide value range); NASA used in raw request units
Train/test split (NASA)	train 1 Jul–14 Aug 1995; test the last 16 days
Train/test split (FIFA)	train 30 Apr–30 Jun 1998; test the last 16 days
Validation split	not reported (no separate validation set is described; early stopping is stated against training loss)

Total parameter count / model size. Not reported for any model. The paper notes that “the balance between the quality of prediction and the computation time” matters in practice but gives no parameter counts, memory footprints, or training/inference times.

Planning constant (downstream of the model). requests_per_pod (per-pod request capacity) is used by the Plan script to convert predicted requests into a pod count; its numeric value is not reported.

(B) How were the parameters chosen?¶

The paper does not describe any systematic hyperparameter search (no grid search, no random search, no cross-validation, no reported search ranges) for its proposed Transformer. The selection basis, where stated, is as follows:

Lookback window = 10, horizon = 1, univariate-to-univariate framing. Taken from prior work: “Inspired by the authors of [11], we have experimented the our model using the last 10 minutes workload for one-step prediction.” Reference [11] is the Bi-LSTM Kubernetes autoscaler of Dang-Quang and Yoo (2021), which this paper directly builds on and compares against. The 10-minute window is therefore inherited, not tuned here.
Informer architecture (ProbSparse attention, distilling, generative decoder). Adopted wholesale from the original Informer paper, reference [5] (Zhou et al., AAAI 2021), with the implementation pointed at the authors’ public Informer2020 repository (reference [14]). The specific internal dimensions are neither restated nor reported, so the chosen values cannot be recovered from this paper.
Bi-LSTM configuration (input 10, 30 hidden units, 1 output cell, batch 64, 50 epochs, early-stopping patience 5, ReLU). Stated to be “inspired by the implementation in [11],” i.e. taken from the same prior Bi-LSTM work rather than independently tuned.
ARIMA (p = 5, d = 1, q = 0). Chosen by informal manual observation, not a documented search: “We have used P-value as 5, D value as 1, and Q value as 0 ... We observed that it gives better prediction for our data set (which is CPU usage).” Note the stated tuning signal here is CPU usage, whereas the forecasting experiments themselves use request counts; the paper does not reconcile this.
Epochs (Transformer: 10 for NASA, 4 for FIFA). No rationale is given for the specific epoch counts or for why FIFA uses fewer; the paper simply states the values. Early stopping (“if training loss is consistent”) is the stated overfitting control, but its trigger threshold/patience for the Transformer is not specified.
All remaining Transformer hyperparameters (layers, heads, d_model, feed-forward size, dropout, ProbSparse factor, optimizer, learning rate, batch size). The paper is silent on both their values and how they were chosen. No rationale is invented here.

Inputs (what it consumes)¶

Primary signal: a univariate time series of HTTP request volume per minute (workload), derived by aggregating raw web-server request logs into one-minute buckets.
Lookback window: the last 10 minutes of that series (w_{t-10} ... w_{t-1}) per prediction.
In the live system (Monitor phase): Prometheus-collected cluster metrics — application throughput, response time, autoscaling metrics — stored in a metrics database at regular intervals. (The experiments, however, are driven purely by the historical request-count series, not CPU/memory.)
One configuration constant for planning: requests_per_pod — how many requests a single pod can serve — used to convert predicted requests into a pod count.
Note: although the paper mentions CPU usage (and even used CPU data when tuning ARIMA), the reported forecasting experiments are about request counts, not multivariate CPU/memory metrics.

Outputs (what it produces)¶

From the model: a one-step-ahead forecast of the number of HTTP requests in the next minute (w_t). Horizon = 1 minute ahead; single value (one-step prediction).
From the Plan script: a recommended number of pods for the next interval (pods_{t+1}).
From the Execute step: an actual Kubernetes scale command that adds or removes pods.

So the chain is: forecasted requests/minute -> recommended pod count -> issued scaling action. The paper carries the idea all the way to an actuation command (not just a prediction), though the heavy lifting and the novelty are in the forecast.

How It Fits the Autoscaling Framework (MAPE-K)¶

The whole design is explicitly a MAPE loop (adapted from prior Bi-LSTM work [11]), and the paper is clear about which stage each piece lives in:

Monitor: Prometheus gathers and stores cluster metrics at intervals. (Collect.)
Analyze: This is where the Transformer lives. The Informer model takes the stored recent workload and predicts the next minute. (Forecast.)
Plan: Algorithm 1 converts the predicted requests into a pod count every 60 seconds. (Decide.)
Execute: the script issues the Kubernetes scale command. (Act.)

Mapping to the framing questions:

Reactive or proactive? Proactive. The defining move is forecasting the next interval and scaling ahead of the load, hiding the pod startup delay that cripples Kubernetes’ default reactive HPA.
Horizontal or vertical? Horizontal — it adds/removes whole pods (replicas), not CPU/RAM per pod.
Does it actuate, or just predict? It does both, but lightly on the actuation side: the contribution is the forecaster, and the planning rule (requests ÷ per-pod capacity) and scaling command are deliberately simple. It is best understood as a predictive front-end that replaces the threshold-trigger in the Analyze stage, feeding a straightforward horizontal scaler. It is a custom autoscaler rather than the stock Kubernetes HPA, but the planning logic is much simpler than the model.

Evaluation (datasets & metrics, briefly)¶

Datasets (both real-world web traffic, processed to per-minute workload):

NASA Kennedy Space Center web-server logs (Jul–Aug 1995, ~3.46M requests). Train: through 14 Aug; test: last 16 days.
FIFA World Cup 1998 website access logs (Apr–Jul 1998, ~1.35 billion requests). Normalized to [0,1]. Train through 30 Jun; test: last 16 days.

Accuracy metrics: MSE, RMSE, MAE (lower = better) and R² (closer to 1 = better).

NASA results — Transformer wins on every metric:

Metric	ARIMA	LSTM	Bi-LSTM	Transformer
MSE	196.91	184.29	186.07	77.34
RMSE	14.03	13.58	13.64	8.79
MAE	10.57	10.27	10.38	6.59
R²	0.693	0.712	0.710	0.830

FIFA results — here the data is simpler/smoother, so LSTM and Bi-LSTM edge out the Transformer; the Transformer is still very close but not the best. (Note: the paper’s FIFA table reports only LSTM, Bi-LSTM, and Transformer — no ARIMA column — and all values are on the normalized [0,1] scale, which is why the errors look tiny.)

Metric	LSTM	Bi-LSTM	Transformer
MSE	0.0000106190	0.0000111431	0.0000380358
RMSE	0.0032586753	0.0033381351	0.0061673189
MAE	0.0017331059	0.0017018933	0.0025280349
R²	0.9987104535	0.9986467957	0.9950739941

Honest finding: the Transformer’s advantage shows up on the harder, more variable workload (NASA), not the easy one (FIFA).

Provisioning accuracy (SPEC-inspired): Θ_U = under-provisioning %, Θ_O = over-provisioning % (0 is best). On NASA the Transformer’s forecasts gave the best pod provisioning:

Metric	ARIMA	LSTM	Bi-LSTM	Transformer
Θ_U [%] (under)	9.96	9.03	8.22	4.49
Θ_O [%] (over)	22.73	23.92	25.84	17.27

So the better forecast translated into both fewer SLA-risking shortfalls and less wasted over-provisioning.

Training & pre-training¶

Trained from scratch — no pretrained or foundation model.

The authors borrow only the architecture of Informer (from Zhou et al. 2021) and train it from scratch — together with the ARIMA, LSTM, and Bi-LSTM baselines — directly on the per-minute web-request traces (NASA Kennedy Space Center 1995 and FIFA World Cup 1998). The paper’s conclusion frames this plainly as supervised learning: there is no pretraining, fine-tuning, foundation model, or zero-shot transfer anywhere in the pipeline.

Training setup, in brief:

Task: univariate -> univariate, one-step-ahead (input length 10, output length 1).
Transformer: 10 epochs (NASA) / 4 epochs (FIFA); MSE loss with early stopping to curb overfitting.
Bi-LSTM: 50 epochs, batch size 64, ReLU activation, early stopping (patience 5).
ARIMA: p=5, d=1, q=0.
Metrics: MSE / RMSE / MAE / R² plus provisioning accuracy (Θ_U, Θ_O).
(No optimizer or learning rate is reported.)

Strengths¶

Clear, beginner-friendly win for the core idea: on the harder dataset the Transformer cut prediction error roughly in half versus the prior Bi-LSTM state of the art, and reduced both under- and over-provisioning.
End-to-end story: not just a model in isolation — it’s slotted into a concrete MAPE-based Kubernetes autoscaler with monitoring (Prometheus) and a real scaling action.
Uses Informer, not vanilla Transformer, so it’s mindful of the memory/compute cost that usually makes attention expensive on long sequences.
Fair comparison against four model families on two well-known public datasets, with both ML metrics and provisioning-accuracy metrics.
Proactive by design, directly addressing the pod-startup-lag weakness of Kubernetes’ default reactive HPA.

Limitations¶

Not always the best model: on the smooth FIFA dataset, plain LSTM/Bi-LSTM beat the Transformer. The Transformer only clearly wins on more variable workloads — so it isn’t a universal upgrade.
Very short horizon, one step: it predicts only 1 minute ahead, one value at a time. Whether 60 seconds is enough lead time to fully hide pod cold-start under a sharp spike is not deeply analyzed.
Univariate only: it forecasts from request counts alone. It does not use multivariate signals (CPU, memory, latency together), even though the live monitor collects them — so it may miss resource dynamics that request count doesn’t capture.
Simplistic planning rule: pods = predicted requests ÷ per-pod capacity assumes a fixed, known capacity per pod and ignores warm-up, minimum replicas, cool-down, and request heterogeneity.
Offline / log-replay evaluation: results come from historical datasets, not a live cluster under real traffic; no end-to-end latency, SLA, or cost numbers from a running deployment.
Single-machine view: the authors themselves note real clouds are distributed across many machines/microservices, and the model would need extending to capture inter- and intra-machine behavior (they sketch a future “action vector” per machine).
Compute cost vs accuracy trade-off is acknowledged but not quantified — Transformers are heavier to train/run than ARIMA or LSTM.

Glossary¶

Autoscaling — automatically adjusting how many resources an app gets so it meets demand without waste.
Reactive scaling — scale only after a metric crosses a threshold; always lagging.
Proactive (predictive) scaling — forecast future demand and scale ahead of it.
Kubernetes — the dominant system for running containerized apps across a cluster.
Pod — Kubernetes’ smallest deployable unit (one or more containers); horizontal scaling adds/removes pods.
HPA (Horizontal Pod Autoscaler) — Kubernetes’ built-in, threshold-based reactive autoscaler.
Horizontal scaling — add/remove whole instances (pods). Vertical scaling — give one instance more/less CPU/RAM.
Provisioning delay / cold start — the time it takes a new pod to become ready; the main reason reactive scaling feels slow.
SLA / SLO — the promised (Agreement) / targeted (Objective) quality of service, e.g. max response time; violating it is the cost of under-provisioning.
Over-/under-provisioning — having too many pods (wasted money) / too few pods (slow, dropped requests).
MAPE loop — Monitor -> Analyze -> Plan -> Execute control cycle for self-managing systems.
Prometheus — the metrics-collection tool used in the Monitor phase.
Time-series forecasting — predicting future values of a quantity from its past.
Self-attention — the Transformer mechanism that weighs which past points matter most for the prediction.
Transformer — attention-based neural network; here, a workload forecaster.
Informer — an efficient Transformer for long time series, using ProbSparse attention + distilling to cut memory/compute.
ProbSparse self-attention — keeps only the few dominant attention “queries,” making attention cheaper than the usual quadratic cost.
ARIMA — classic statistical time-series forecaster (no neural net).
LSTM / Bi-LSTM — recurrent neural nets with memory; Bi-LSTM reads sequences both directions and was the prior state of the art here.
Univariate, one-step-ahead — one input variable predicting that same variable, for exactly the next time step.
MSE / RMSE / MAE / R² — accuracy metrics (first three: lower is better; R²: closer to 1 is better).

References¶

Shim, S., Dhokariya, A., Doshi, D., Upadhye, S., Patwari, V., & Park, J.-Y. (2023). Predictive Auto-scaler for Kubernetes Cloud. 2023 IEEE International Systems Conference (SysCon), 1–8. 10.1109/syscon53073.2023.10131106