Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

InformerAutoScale

Proactive Kubernetes pod autoscaling driven by an Informer forecaster and a custom scaling manager

Kumar et al. (2025) Citations


TL;DR

Imagine an online store that runs in the cloud. When lots of users show up, it needs more “workers” (copies of the app) to keep things fast; when traffic drops, it should release workers so it stops paying for idle capacity. Most systems do this reactively — they wait until the servers are already overloaded and only then start adding workers, which is slow because spinning up a new worker takes time. This paper builds a proactive system: it uses a modern AI time-series forecasting model called Informer (an efficient cousin of the Transformer) to predict the traffic a minute ahead and add or remove workers before the spike hits. In Kubernetes, those workers are called pods. The headline result: to handle the same web traffic, the old reactive method needed up to 150 pods, while InformerAutoScale needed only 14 — a claimed 90.66% improvement in scaling efficiency, meaning far less wasted money while still keeping the app responsive.


The Problem (and why simple autoscaling isn’t enough)

Cloud applications face constantly changing demand. You want to allocate just enough computing resources:

The default tool in Kubernetes is the Horizontal Pod Autoscaler (HPA), which is reactive: it watches a metric like CPU usage and adds pods once it crosses a threshold (e.g. “CPU > 70%, add a pod”). Two problems:

  1. It always lags. By the time CPU is high, users are already experiencing slowness.

  2. Booting a pod isn’t instant (the “cold-start” delay). So even after the autoscaler decides to act, there’s a gap before the new pod can take traffic.

In the paper’s real experiment, this reactive lag caused wildly unstable behavior — pod counts swinging between 0 and 150 — because the system kept over-reacting to spikes.

The fix is proactive (predictive) autoscaling: forecast the near-future workload and scale ahead of time, hiding the cold-start delay. To forecast well you need a good time-series model. Prior work tried LSTMs, Bi-LSTMs, and standard Transformers, but those struggle with long sequences (LSTMs forget; standard Transformers get slow and memory-hungry because their attention cost grows with the square of the sequence length). This paper’s bet: the Informer model fixes exactly those weaknesses.


Background

A few terms, defined once:

Why Informer over a plain Transformer?

The paper’s Table 2 contrasts them. In plain terms:

IssuePlain TransformerInformer
Attention costFull attention, O(n²) — slow & memory-heavy on long inputsProbSparse attention, O(n log n) — only attends to the most informative time steps
How far it predictsOne step at a timeMany steps at once (multi-step / long-range)
Encoding of timeGeneric positional encodingTime2Vec + temporal embedding — better at periodic/daily patterns
Decoder inputNeeds the full previous target sequenceUses a distilled (compressed) summary from the encoder
Memory useHighLow

The “many steps ahead” property is the one that matters for autoscaling: you want to look into the future, not just one tick ahead.


Contribution in Simple Terms

The genuinely new thing is bringing the Informer model into a working Kubernetes autoscaler and wiring it into a MAPE control loop, then proving on real workloads that it scales with far fewer pods than reactive HPA and beats other predictors (ARIMA, LSTM, Bi-LSTM, vanilla Transformer).

Concretely the paper contributes:

  1. InformerAutoScale — an Informer-based proactive autoscaler that forecasts the next interval’s workload and pre-computes how many pods are needed.

  2. A real implementation on Docker Desktop + Kubernetes (not just a simulation), driven by two small control algorithms:

    • an Adaptation Manager Service that converts a workload forecast into a scale-out / scale-in decision, and

    • a Pod Requirement Management Service that recomputes the needed pod count every 60 seconds from the observed request rate.

  3. A Resource Removal Strategy (RRS) that removes surplus pods gradually (60% at a time) rather than all at once, to avoid oscillation and keep a safety reserve against sudden spikes.

  4. An evaluation showing the headline 90.66% efficiency gain (150 -> 14 max pods) plus better forecasting error and provisioning accuracy than four baseline models.

Think of it as the difference between a shop manager who only calls in extra staff after a queue forms (reactive) versus one who looks at last week’s pattern and rosters extra staff the night before a big sale (proactive InformerAutoScale).


How It Works, Step by Step

The system runs a continuous loop. Here is the full pipeline from raw logs to actual pods being created or removed (following the paper’s Figs. 3 and 4):

  1. Collect input data (Monitor). Pull historical and live metrics — CPU usage, memory, and request patterns — from Kubernetes-native tools (the Metrics Server and cAdvisor), exposed via an API and written into .yaml config files. Two sub-streams are gathered: a resource module (CPU, memory, etc. from worker nodes) and a user module (request size, pending/dropped requests, request arrival rate).

  2. Queue and pre-process. Incoming workload data goes through a central queue that organizes and forwards it for scheduling and forecasting.

  3. Forecast with the Informer (Analyze). The cleaned time series is fed into the Informer encoder–decoder:

    • Time encoding: raw inputs are augmented with Time2Vec + temporal embeddings to capture periodic/daily rhythms, then concatenated with the input.

    • ProbSparse self-attention: the input Xt is projected into Query/Key/Value matrices (Q = XtWq, K = XtWk, V = XtWv); attention is computed as Softmax(QKᵀ/√dk)·V, but only for the most informative queries, cutting cost from O(n²) to O(n log n).

    • Distilled decoder input: the encoder’s output is compressed and handed to the decoder, which produces the forecast N(t+1) = f(Xt) — the predicted workload/resource demand for the next interval. (ReLU adds non-linearity in hidden layers; SoftMax normalizes the final output.)

  4. Evaluate provisioning (still Analyze). Compare predicted vs. actual to compute forecasting error (MSE, RMSE, MAE) and the provisioning metrics ΘU (under) and ΘO (over).

  5. Decision engine. Combine the forecast with provisioning insights, scheduling logic, and demand/supply observers to finalize a scaling action.

  6. Compute required pods (Plan). Turn the forecast into a pod count:

    • Pdesired = D̂(t+Δt) / Cpod — forecasted demand divided by the capacity of one pod.

    • Equivalently Pod Count = N(t+1) / (Pod Resource Allocation).

    • Clamp to limits: Pnew = max(Pmin, min(Pdesired, Pmax)).

  7. Adaptation Manager Service (the scaling brain). For each deployment target, compare predicted pods podst+1 with current pods podst:

    • If more are needed -> scale out (add pods).

    • If fewer are needed -> scale in, but only gradually using the Resource Removal Strategy (RRS): podssurplus = (podst − podst+1) × RRS, with RRS = 0.60 (remove 60% of the surplus, keep the rest as a buffer). A floor podsmin guarantees baseline availability.

  8. Pod Requirement Management Service (every 60 s). On a fixed Container Deployment Target (CDT) cadence of 60 seconds, recompute pods from the live request rate: podst+1 = Requestst+1 / requestsPerPod, and issue the scaling command. This 60 s delay also damps oscillation and helps absorb cold-start latency.

  9. Execute. The resource manager applies the decision: pods are added/removed in real time via Docker containers, Kubernetes, and Python scripts, with .yaml definitions. Changes are visible live in the Docker Desktop dashboard.

  10. Loop / feedback. The system continuously compares desired vs. current resources: ΔR(t) = Rdesired(t) − Rcurrent(t), and scales up if ΔR>0, down if ΔR<0, nothing if ΔR=0 — so it only acts when genuinely needed, avoiding flapping.

Non-transformer pieces worth noting: there is no diffusion/GAN/FFT here. The “extra” machinery beyond the Informer is classical control logic: the queue/scheduler, the two pod-sizing formulas, the RRS gradual-removal heuristic, the 60-second CDT timer, and the MAPE feedback loop.


Workload Modeling & Prediction Pipeline (full-text deep read)

This section zooms in on two things only: (A) what the paper actually treats as “the workload” and how it turns that into numbers a model can eat, and (B) the exact path from raw server logs to a prediction. Where the paper is vague or self-contradictory, that is called out rather than smoothed over.

(A) How the workload is modeled and characterized

What real-world quantity is “the workload”? The primary signal is the request arrival rate — how many web requests hit the application per unit time. Everything downstream keys off this: the core forecast is described as predicting “the workload” Workload(t+1) / N(t+1), and the pod math literally divides a request count by a per-pod capacity (podst+1 = Requests(t+1) / requestsPerPod). Alongside the request rate, the paper also monitors CPU usage and memory utilization (and mentions bandwidth and network throughput as resources of interest). So the thing being managed is resource demand, but the thing being counted and predicted in the concrete experiments is the request volume per minute.

One signal or many (univariate vs multivariate)? It is set up as multivariate in principle but thin in practice. The Monitor stage explicitly splits into a resource module (CPU, memory, other node-level metrics) and a user module (request size, pending requests, dropped requests, request arrival rate) — that is many signals. But the model’s configured input size is just 3 features (Table 3 says “Input size 3 / Features per input”; elsewhere the prose says “two features” and “an input size of three neural cells and two features,” which is internally inconsistent — see the conflict note below). So the design is multivariate, the runtime configuration is a handful of features, and the quantity being forecast and acted on is effectively the single request-rate series.

Time granularity. Sampling and prediction happen on a one-minute grid. Concretely:

So one time step = one minute, and the model reasons in minutes, not seconds or hours.

What the raw data looks like, and what makes it hard.

How it is represented numerically for the model. This paper is light on heavy signal processing — there is no explicit trend/seasonal/residual decomposition, no Fourier/frequency transform, no patching. The representation is:

  1. Windowing. Raw per-minute metrics are turned into an input sequence Xt of shape L × d_model, where the lookback window length L = 10 time steps (Table 3: “Sequence length 10 / Input window size”). That is a deliberately short context — “captures recent trends.”

  2. Time encoding / embedding. Rather than raw clock values, timestamps are turned into learned vectors using Time2Vec (a learnable encoding of time that captures periodicity — e.g. recurring cycles; the paper says “periodicity and time-related patterns” without specifying the period) plus a temporal embedding (timeF). These embeddings are concatenated with the raw input before the encoder, so the model sees both the values and when they occurred.

  3. Linear projections into attention space. Inside the model the windowed input is multiplied by trainable weight matrices to form Query/Key/Value matrices (Q = XtWq, K = XtWk, V = XtWv), each of shape L × dk.

  4. Normalization-ish steps. Attention scores are scaled by √dk (a numerical-stability divisor), and the final layer uses SoftMax (turns raw scores into a probability-like distribution) while hidden layers use ReLU (passes only positive values, adds non-linearity). The paper does not describe a separate min-max/z-score normalization of the raw workload values — if it happened it is not stated, so treat input normalization as unspecified.

In short: the workload is modeled as a short (10-step), minute-resolution, mostly-request-rate time series, dressed with learned time embeddings, and fed straight into an attention model — no decomposition or spectral tricks.

(B) Step-by-step prediction pipeline (raw signal -> prediction result)

  1. Collect / preprocess the raw signal. Pull live and historical metrics through Kubernetes-native monitoring — the Metrics Server and cAdvisor (container metrics agent) — exposed via an API and written into .yaml config. The Monitor stage gathers the resource module (CPU, memory) and the user module (request size, pending/dropped requests, arrival rate). Historical traces come from NASA-HTTP and Google Cluster. Granularity: per-minute after bucketing.

  2. Queue and organize. Incoming workload data passes through a central queue that holds and forwards it for scheduling and forecasting (Step 2 of the paper’s block diagram). This is plumbing, not modeling — it just serializes the stream for the next stage.

  3. Build the input window. Form the input sequence Xt from the last L = 10 minutes of metrics, shaped L × d_model, using ~3 features per step. Why: the Informer needs a fixed-length recent context to attend over; 10 steps is chosen to capture recent trend cheaply.

  4. Encode time. Concatenate Time2Vec + temporal embeddings onto the window so the model knows the periodic/time structure, not just the raw magnitudes. Why: web traffic has periodic rhythms (the paper’s stated motivation for Time2Vec) that plain values do not expose.

  5. Encoder with ProbSparse attention + distilling. The sequence enters the Informer encoder (the paper configures 2 encoder layers). It uses ProbSparse self-attention — instead of letting every minute attend to every other minute (full attention, cost grows with the square of the window), it keeps only the most informative queries, cutting cost from O(n²) to O(n log n). A distilling step then compresses the encoder’s representation to cut redundancy and memory. Why: this is the whole point of Informer — handle long sequences cheaply and keep only the signal that matters.

  6. Decoder consumes a distilled encoder summary. The decoder (the paper configures 1 decoder layer) receives a distilled (compressed) version of the encoder output rather than the full previous target sequence. Why: this lets Informer emit the forecast in essentially one shot (multi-step capable) instead of slow step-by-step autoregression, which is what makes it fast enough for real-time autoscaling.

  7. Produce the prediction. The model outputs N(t+1) = f(Xt) — the predicted workload (request demand) for the next time interval. It is a point forecast (a single number per step, not a probability distribution or confidence interval), in units of workload / request volume for the next minute. The configured output length = 1 (single-step, next minute) — the paper repeatedly notes the architecture can do multi-step / long-range horizons but the experiments use a one-step output. So the headline “long-sequence” strength here refers to a long input context being handled efficiently, while the output used is one step ahead.

  8. (Prediction result reached.) Everything after this is consumption, not prediction: the forecast N(t+1) is divided by per-pod capacity to get a target pod count (Pdesired = D̂(t+Δt)/Cpod, or Pod Count = N(t+1)/Pod Resource Allocation), clamped to [Pmin, Pmax], and turned into a scale-out/scale-in action by the Adaptation Manager Service (with the 60% RRS gradual-removal rule), executed against Kubernetes via YAML/Python on the 60-second CDT cadence. The forecast’s quality is judged by MSE / RMSE / MAE against actuals (NASA-HTTP: MSE 9.185, RMSE 3.03, MAE 2.42), and provisioning by ΘU / ΘO.

Concrete numbers in one place: lookback window 10 steps; sampling/decision interval 60 s (1 min); output horizon 1 step; 2 encoder + 1 decoder layers; 4 attention heads; ProbSparse attention; result horizon shown over 0–200 min; datasets 3.46M NASA-HTTP requests and ~41 GB / 29-day / ~12,500-machine Google Cluster trace; 10/20/50 epochs; Adam @ lr 0.001.

Caveats the paper carries about its own configuration (flagged, not smoothed over):


Model Parameters & How They Were Chosen

This section reports only the concrete hyperparameters the paper states and the method used to choose them. The paper’s authoritative source for the numeric configuration is Table 3 (“Hyperparameter configuration used in experiments”); where the narrative prose (Sects. 3.3 and 5.1) gives a different number for the same parameter, both values are shown and the conflict is flagged. Following the instruction to trust the primary source, Table 3 is treated as the configured value.

(A) What are the model parameters?

Architecture (Informer forecaster). These are the structural settings of the network (how deep, how wide, how the attention works).

ParameterValueSource / note
Encoder layers2Table 3 (“Layers: 2/1 Informer Encoder/Decoder”) and prose agree
Decoder layers1Table 3 and prose agree
Attention heads (parallel attention sub-units)4Table 3 (“Multi-head attention, Transformer/Informer”)
d_model (embedding/model dimension, the width of each token’s vector)64 (Table 3) vs. 512 (prose, Sects. 3.3/5.1)Internal conflict in the paper; Table 3 = 64 is the explicit config. The justification paragraph also says “embedding size (d_model) of 64”
Feed-forward / linear transformation dimension2048 (both Table 3 “Dimension 2048” and prose)The hidden width of the position-wise feed-forward block. Note this 2048 sits beside a d_model of 64 in Table 3, an unusually large ratio that the paper does not comment on
Dropout (fraction of activations randomly zeroed for regularization)0.1 (Table 3, justification text) vs. 0.2 (prose Sect. 5.1)Internal conflict; Table 3 = 0.1
Attention typeProbSparse self-attention (attends only to the most informative queries, cost O(n log n) instead of O(n²))Table 3 (“Informer-specific attention”); the ProbSparse sparsity/sampling factor c is not reported
Time encodingTime2Vec + temporal embedding (timeF), concatenated with the raw inputSect. 4.1; the Time2Vec output dimension and the assumed period are not reported
DistillationEncoder uses a distilling step; decoder consumes a distilled encoder summarySect. 3.2/4.1; no distilling-specific hyperparameter (e.g. pooling stride) is reported
ActivationsReLU in hidden layers, SoftMax at the output layerTable 3 and Sect. 4.1
Input size (features per step)3 (Table 3 “Features per input”) vs. 2 (“two features”) and “three neural cells and two features” (prose Sect. 5.1)Internal conflict; the predicted signal is effectively the single request-rate series regardless
Number of output classesnot reported (the task is point regression of next-interval workload; SoftMax-over-classes is described generically and no class count is given)Sect. 4.1
Total parameter count / model sizenot reportedThe paper gives no parameter count, FLOPs, or model-size figure for any model

Model-specific structural parameters of the baselines (reported for completeness, since they share the experimental table):

BaselineReported settings
ARIMAorder (p, d, q) = (2, 1, 2)
LSTMhidden size 50, 1 layer
Bi-LSTMhidden size 50, bidirectional = True
Transformer2 layers, d_model 64, 4 heads (shares the Transformer/Informer columns of Table 3)

There is no diffusion, GAN, RevIN, or explicit FFT/frequency module in this model; the only non-standard structural pieces are ProbSparse attention, Time2Vec, and encoder distillation. The hidden size of 50 applies to the LSTM-family baselines, not to the Informer (whose width is d_model).

Training. How the model’s weights are fitted.

ParameterValueSource / note
OptimizerAdamTable 3, justification text
Learning rate0.001 (fixed)Table 3; no schedule (warmup/decay) is reported
Batch size16 (Table 3, “Fixed for all DL models” and justification text) vs. 64 (prose Sect. 5.1)Internal conflict; Table 3 = 16
Epochs10, 20, 50 (all three run and compared)Table 3 (“Tested for performance”)
Loss function(s)MSE, MAE, RMSE (also used as evaluation metrics)Table 3; no loss weighting is reported (the three are reported side by side, not summed into a weighted objective)
Early-stopping rulenot reportedNo patience/criterion is described; the epoch counts are swept rather than early-stopped
Weight decaynot reportedNot mentioned for Adam
Hardware2 vCPU, 8 GB RAM, integrated GPU (Intel UHD Graphics 770); host CPU Intel at 2.7 GHz (3.7 GHz turbo) or 2.6 GHzSect. 5.1; no dedicated training GPU/TPU
Software stackTensorFlow 2.11, PyTorch 1.12; Kubernetes Python client, Prometheus API client, NumPy, Pandas, APScheduler; served via FastAPI; kubectl v1.28.2, Docker 25.0.1, on WSL/Hyper-VSect. 5.1

Data / windowing.

ParameterValueSource / note
Lookback window (sequence length)10 time stepsTable 3 (“Input window size”)
Forecast horizon (output length)1 step (single-step; “can be extended for longer horizons”)Table 3 (“Forecast steps”)
Sampling / decision interval60 s (1 minute); logs bucketed into one-minute intervals; scaling recomputed every 60 s (the Container Deployment Target, CDT)Sects. 4.2, 5.2
Train/test split (NASA-HTTP)Train: Jul 1, 1995 00:00:00 to Aug 14, 1995 23:59:59; Test: the remaining 16 days (to Aug 31). A separate validation split is not reported (the prose mentions “training and validation” but does not give validation dates or fractions)Sect. 5.2
Train/test split (Google Cluster)not reported as explicit dates/fractions; the full 29-day, ~41 GB 2011-2 trace is used to assess accuracy offlineSect. 5.2

Autoscaling-control parameters (not network weights, but part of the configured system):

ParameterValueSource
RRS (Resource Removal Strategy fraction)0.60 (remove 60% of surplus pods per step)Sect. 4.2
CDT cadence60 sSect. 4.2
Pod bounds P_min / P_maxreferenced as system-defined constraints (Eq. 2); specific numeric values not reportedSect. 3.4

(B) How were the parameters chosen?

The paper devotes one paragraph (Sect. 5.1, immediately before Table 3) to justifying the configuration, and the dominant method is adoption from prior work / common forecasting defaults rather than a fresh search on this problem. Table 3’s caption explicitly cites references [5, 8, 16, 22, 34] as the provenance of “several key hyperparameters,” i.e. the values are carried over from the authors’ own earlier autoscaling papers and related forecasting work, not tuned de novo here.

Per-parameter selection method, with the paper’s stated rationale quoted where given:

Summary of method. Aside from the ARIMA order (AIC/BIC selection) and the epoch sweep (10/20/50, a coarse sensitivity comparison), no grid search, random search, or cross-validation over the Informer’s hyperparameters is reported. The Informer configuration is manual / default-driven, taken largely from prior work (refs [5, 8, 16, 22, 34]), with each value accompanied by a qualitative rationale rather than an empirical search over ranges. The paper does not report search ranges for any deep-learning hyperparameter, and is silent on how P_min/P_max and the Time2Vec/ProbSparse internal settings were chosen.


Inputs (what it consumes)

So in MAPE terms, the raw fuel is multivariate monitoring data, with request rate as the main predicted signal.


Outputs (what it produces)

In the real run, the output that matters is the live pod count: it stayed in a tight, efficient 0–14 pod band versus the reactive method’s chaotic 0–150 band.


How It Fits the Autoscaling Framework (MAPE-K)

InformerAutoScale is an end-to-end proactive autoscaler mapped cleanly onto MAPE — it does not stop at prediction, it also actuates:

Reactive or proactive? Firmly proactive — it scales on forecasts, not on current thresholds. (The paper explicitly benchmarks against a reactive baseline.)

Horizontal or vertical? Horizontal — it changes the number of pods. It does not do vertical scaling (the paper even notes integrating a Vertical Pod Autoscaler as future-style improvement when discussing related work).

Does it actuate, or just predict? Unlike papers that only output a forecast and hand off to a stock HPA, InformerAutoScale drives the actuation itself through its own Adaptation Manager Service and Pod Requirement Service rather than the default Kubernetes HPA — it runs the whole loop and writes the scaling actions to Kubernetes via YAML and Python scripts.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch — no pretrained or foundation model.

The Informer forecaster is trained from random initialization on cloud workload traces (NASA-HTTP and Google Cluster 2011) as ordinary supervised forecasting, and the deep-learning baselines (LSTM, Bi-LSTM, vanilla Transformer) are trained the same way under identical conditions. There is no pretraining, transfer learning, foundation model, or zero-shot component anywhere in the pipeline.

Training setup, for the record (numbers taken from Table 3, the explicit hyperparameter table; where the narrative prose disagrees, the conflict is flagged — see Model Parameters):


Strengths

Limitations


Glossary

References
  1. Kumar, B., Verma, A., Verma, P., & Bennour, A. (2025). Optimizing resource allocation in cloud-native applications through proactive autoscaling with the InformerAutoScale model. The Journal of Supercomputing, 81(9). 10.1007/s11227-025-07500-7