InformerAutoScale
Proactive Kubernetes pod autoscaling driven by an Informer forecaster and a custom scaling manager
TL;DR¶
Imagine an online store that runs in the cloud. When lots of users show up, it needs more “workers” (copies of the app) to keep things fast; when traffic drops, it should release workers so it stops paying for idle capacity. Most systems do this reactively — they wait until the servers are already overloaded and only then start adding workers, which is slow because spinning up a new worker takes time. This paper builds a proactive system: it uses a modern AI time-series forecasting model called Informer (an efficient cousin of the Transformer) to predict the traffic a minute ahead and add or remove workers before the spike hits. In Kubernetes, those workers are called pods. The headline result: to handle the same web traffic, the old reactive method needed up to 150 pods, while InformerAutoScale needed only 14 — a claimed 90.66% improvement in scaling efficiency, meaning far less wasted money while still keeping the app responsive.
The Problem (and why simple autoscaling isn’t enough)¶
Cloud applications face constantly changing demand. You want to allocate just enough computing resources:
Under-provisioning (too few resources) -> slow responses, dropped requests, broken service-level promises (SLAs/SLOs). The paper writes this as ΘU.
Over-provisioning (too many resources) -> you are paying for machines doing nothing. The paper writes this as ΘO.
The default tool in Kubernetes is the Horizontal Pod Autoscaler (HPA), which is reactive: it watches a metric like CPU usage and adds pods once it crosses a threshold (e.g. “CPU > 70%, add a pod”). Two problems:
It always lags. By the time CPU is high, users are already experiencing slowness.
Booting a pod isn’t instant (the “cold-start” delay). So even after the autoscaler decides to act, there’s a gap before the new pod can take traffic.
In the paper’s real experiment, this reactive lag caused wildly unstable behavior — pod counts swinging between 0 and 150 — because the system kept over-reacting to spikes.
The fix is proactive (predictive) autoscaling: forecast the near-future workload and scale ahead of time, hiding the cold-start delay. To forecast well you need a good time-series model. Prior work tried LSTMs, Bi-LSTMs, and standard Transformers, but those struggle with long sequences (LSTMs forget; standard Transformers get slow and memory-hungry because their attention cost grows with the square of the sequence length). This paper’s bet: the Informer model fixes exactly those weaknesses.
Background¶
A few terms, defined once:
Container / pod. A container packages an app plus everything it needs to run. In Kubernetes, the smallest deployable unit is a pod (a wrapper around one or more containers). “Scaling out” = run more pods; “scaling in” = remove pods. Docker is the tool that builds and runs containers.
Horizontal vs. vertical scaling. Horizontal = add/remove whole pods (what this paper does). Vertical = give an existing pod more CPU/RAM. This paper is horizontal only.
Reactive vs. proactive. Reactive = respond after load crosses a threshold. Proactive = forecast future load and act in advance. This paper is proactive.
Transformer. A neural network built on an “attention” mechanism that learns which past time steps matter for predicting the future. Originally for language; here it’s used as a time-series forecaster of workload.
Informer. A Transformer variant designed for long-sequence forecasting. Its key tricks (explained below) make it faster and lighter than a vanilla Transformer while predicting many steps ahead instead of just one.
MAPE(-K) loop. The standard blueprint for self-managing systems: Monitor -> Analyze -> Plan -> Execute over shared Knowledge. The forecasting model lives in the Analyze stage.
Why Informer over a plain Transformer?¶
The paper’s Table 2 contrasts them. In plain terms:
| Issue | Plain Transformer | Informer |
|---|---|---|
| Attention cost | Full attention, O(n²) — slow & memory-heavy on long inputs | ProbSparse attention, O(n log n) — only attends to the most informative time steps |
| How far it predicts | One step at a time | Many steps at once (multi-step / long-range) |
| Encoding of time | Generic positional encoding | Time2Vec + temporal embedding — better at periodic/daily patterns |
| Decoder input | Needs the full previous target sequence | Uses a distilled (compressed) summary from the encoder |
| Memory use | High | Low |
The “many steps ahead” property is the one that matters for autoscaling: you want to look into the future, not just one tick ahead.
Contribution in Simple Terms¶
The genuinely new thing is bringing the Informer model into a working Kubernetes autoscaler and wiring it into a MAPE control loop, then proving on real workloads that it scales with far fewer pods than reactive HPA and beats other predictors (ARIMA, LSTM, Bi-LSTM, vanilla Transformer).
Concretely the paper contributes:
InformerAutoScale — an Informer-based proactive autoscaler that forecasts the next interval’s workload and pre-computes how many pods are needed.
A real implementation on Docker Desktop + Kubernetes (not just a simulation), driven by two small control algorithms:
an Adaptation Manager Service that converts a workload forecast into a scale-out / scale-in decision, and
a Pod Requirement Management Service that recomputes the needed pod count every 60 seconds from the observed request rate.
A Resource Removal Strategy (RRS) that removes surplus pods gradually (60% at a time) rather than all at once, to avoid oscillation and keep a safety reserve against sudden spikes.
An evaluation showing the headline 90.66% efficiency gain (150 -> 14 max pods) plus better forecasting error and provisioning accuracy than four baseline models.
Think of it as the difference between a shop manager who only calls in extra staff after a queue forms (reactive) versus one who looks at last week’s pattern and rosters extra staff the night before a big sale (proactive InformerAutoScale).
How It Works, Step by Step¶
The system runs a continuous loop. Here is the full pipeline from raw logs to actual pods being created or removed (following the paper’s Figs. 3 and 4):
Collect input data (Monitor). Pull historical and live metrics — CPU usage, memory, and request patterns — from Kubernetes-native tools (the Metrics Server and cAdvisor), exposed via an API and written into
.yamlconfig files. Two sub-streams are gathered: a resource module (CPU, memory, etc. from worker nodes) and a user module (request size, pending/dropped requests, request arrival rate).Queue and pre-process. Incoming workload data goes through a central queue that organizes and forwards it for scheduling and forecasting.
Forecast with the Informer (Analyze). The cleaned time series is fed into the Informer encoder–decoder:
Time encoding: raw inputs are augmented with Time2Vec + temporal embeddings to capture periodic/daily rhythms, then concatenated with the input.
ProbSparse self-attention: the input
Xtis projected into Query/Key/Value matrices (Q = XtWq,K = XtWk,V = XtWv); attention is computed asSoftmax(QKᵀ/√dk)·V, but only for the most informative queries, cutting cost from O(n²) to O(n log n).Distilled decoder input: the encoder’s output is compressed and handed to the decoder, which produces the forecast
N(t+1) = f(Xt)— the predicted workload/resource demand for the next interval. (ReLU adds non-linearity in hidden layers; SoftMax normalizes the final output.)
Evaluate provisioning (still Analyze). Compare predicted vs. actual to compute forecasting error (MSE, RMSE, MAE) and the provisioning metrics ΘU (under) and ΘO (over).
Decision engine. Combine the forecast with provisioning insights, scheduling logic, and demand/supply observers to finalize a scaling action.
Compute required pods (Plan). Turn the forecast into a pod count:
Pdesired = D̂(t+Δt) / Cpod— forecasted demand divided by the capacity of one pod.Equivalently
Pod Count = N(t+1) / (Pod Resource Allocation).Clamp to limits:
Pnew = max(Pmin, min(Pdesired, Pmax)).
Adaptation Manager Service (the scaling brain). For each deployment target, compare predicted pods
podst+1with current podspodst:If more are needed -> scale out (add pods).
If fewer are needed -> scale in, but only gradually using the Resource Removal Strategy (RRS):
podssurplus = (podst − podst+1) × RRS, with RRS = 0.60 (remove 60% of the surplus, keep the rest as a buffer). A floorpodsminguarantees baseline availability.
Pod Requirement Management Service (every 60 s). On a fixed Container Deployment Target (CDT) cadence of 60 seconds, recompute pods from the live request rate:
podst+1 = Requestst+1 / requestsPerPod, and issue the scaling command. This 60 s delay also damps oscillation and helps absorb cold-start latency.Execute. The resource manager applies the decision: pods are added/removed in real time via Docker containers, Kubernetes, and Python scripts, with
.yamldefinitions. Changes are visible live in the Docker Desktop dashboard.Loop / feedback. The system continuously compares desired vs. current resources:
ΔR(t) = Rdesired(t) − Rcurrent(t), and scales up if ΔR>0, down if ΔR<0, nothing if ΔR=0 — so it only acts when genuinely needed, avoiding flapping.
Non-transformer pieces worth noting: there is no diffusion/GAN/FFT here. The “extra” machinery beyond the Informer is classical control logic: the queue/scheduler, the two pod-sizing formulas, the RRS gradual-removal heuristic, the 60-second CDT timer, and the MAPE feedback loop.
Workload Modeling & Prediction Pipeline (full-text deep read)¶
This section zooms in on two things only: (A) what the paper actually treats as “the workload” and how it turns that into numbers a model can eat, and (B) the exact path from raw server logs to a prediction. Where the paper is vague or self-contradictory, that is called out rather than smoothed over.
(A) How the workload is modeled and characterized¶
What real-world quantity is “the workload”? The primary signal is the request arrival rate — how many web requests hit the application per unit time. Everything downstream keys off this: the core forecast is described as predicting “the workload” Workload(t+1) / N(t+1), and the pod math literally divides a request count by a per-pod capacity (podst+1 = Requests(t+1) / requestsPerPod). Alongside the request rate, the paper also monitors CPU usage and memory utilization (and mentions bandwidth and network throughput as resources of interest). So the thing being managed is resource demand, but the thing being counted and predicted in the concrete experiments is the request volume per minute.
One signal or many (univariate vs multivariate)? It is set up as multivariate in principle but thin in practice. The Monitor stage explicitly splits into a resource module (CPU, memory, other node-level metrics) and a user module (request size, pending requests, dropped requests, request arrival rate) — that is many signals. But the model’s configured input size is just 3 features (Table 3 says “Input size 3 / Features per input”; elsewhere the prose says “two features” and “an input size of three neural cells and two features,” which is internally inconsistent — see the conflict note below). So the design is multivariate, the runtime configuration is a handful of features, and the quantity being forecast and acted on is effectively the single request-rate series.
Time granularity. Sampling and prediction happen on a one-minute grid. Concretely:
The NASA-HTTP logs are grouped into one-minute buckets by timestamp before modeling.
Scaling decisions are recomputed on a fixed 60-second cadence called the Container Deployment Target (CDT) — “captures data every minute.”
A “time step” in the results = “a fixed interval (in minutes)”; result plots run over a 0–200 minute window with 20 plotted points.
So one time step = one minute, and the model reasons in minutes, not seconds or hours.
What the raw data looks like, and what makes it hard.
NASA-HTTP (the dataset used for the real deployment): raw HTTP access logs from NASA’s Kennedy Space Center web server, 3,461,612 requests spanning Jul 1 – Aug 31, 1995. Each log line is a request: method+path (e.g.
GET /images/ksclogo-medium.gif HTTP/1.0), hostname/IP (source address), HTTP status code, timestamp, and response size in bytes. The paper indexes these as “source addresses” numbered roughly 0 to 140,000 for its scaling demonstration.Google Cluster Traces 2011-2 (used for offline accuracy assessment): 29 days of trace from a ~12,500-machine Borg cell, ~41 GB. From it the authors pull scheduling class, priority, type, scheduler, collection type/ID, vertical-scaling indicators, and start-after collection IDs.
What makes it hard: the paper repeatedly stresses long sequences (the whole reason for choosing Informer) and burstiness / traffic surges / CPU spikes that reactive systems over-react to. It also flags data gaps in NASA-HTTP (missing Jul 28–31, and a chunk of Aug 1–3) — i.e. a non-uniform real-world series with holes. The web traffic is bursty and spiky; that unpredictability is exactly what they say makes accurate monitoring/forecasting difficult.
How it is represented numerically for the model. This paper is light on heavy signal processing — there is no explicit trend/seasonal/residual decomposition, no Fourier/frequency transform, no patching. The representation is:
Windowing. Raw per-minute metrics are turned into an input sequence
Xtof shape L × d_model, where the lookback window length L = 10 time steps (Table 3: “Sequence length 10 / Input window size”). That is a deliberately short context — “captures recent trends.”Time encoding / embedding. Rather than raw clock values, timestamps are turned into learned vectors using Time2Vec (a learnable encoding of time that captures periodicity — e.g. recurring cycles; the paper says “periodicity and time-related patterns” without specifying the period) plus a temporal embedding (timeF). These embeddings are concatenated with the raw input before the encoder, so the model sees both the values and when they occurred.
Linear projections into attention space. Inside the model the windowed input is multiplied by trainable weight matrices to form Query/Key/Value matrices (
Q = XtWq,K = XtWk,V = XtWv), each of shape L × dk.Normalization-ish steps. Attention scores are scaled by
√dk(a numerical-stability divisor), and the final layer uses SoftMax (turns raw scores into a probability-like distribution) while hidden layers use ReLU (passes only positive values, adds non-linearity). The paper does not describe a separate min-max/z-score normalization of the raw workload values — if it happened it is not stated, so treat input normalization as unspecified.
In short: the workload is modeled as a short (10-step), minute-resolution, mostly-request-rate time series, dressed with learned time embeddings, and fed straight into an attention model — no decomposition or spectral tricks.
(B) Step-by-step prediction pipeline (raw signal -> prediction result)¶
Collect / preprocess the raw signal. Pull live and historical metrics through Kubernetes-native monitoring — the Metrics Server and cAdvisor (container metrics agent) — exposed via an API and written into
.yamlconfig. The Monitor stage gathers the resource module (CPU, memory) and the user module (request size, pending/dropped requests, arrival rate). Historical traces come from NASA-HTTP and Google Cluster. Granularity: per-minute after bucketing.Queue and organize. Incoming workload data passes through a central queue that holds and forwards it for scheduling and forecasting (Step 2 of the paper’s block diagram). This is plumbing, not modeling — it just serializes the stream for the next stage.
Build the input window. Form the input sequence
Xtfrom the last L = 10 minutes of metrics, shaped L × d_model, using ~3 features per step. Why: the Informer needs a fixed-length recent context to attend over; 10 steps is chosen to capture recent trend cheaply.Encode time. Concatenate Time2Vec + temporal embeddings onto the window so the model knows the periodic/time structure, not just the raw magnitudes. Why: web traffic has periodic rhythms (the paper’s stated motivation for Time2Vec) that plain values do not expose.
Encoder with ProbSparse attention + distilling. The sequence enters the Informer encoder (the paper configures 2 encoder layers). It uses ProbSparse self-attention — instead of letting every minute attend to every other minute (full attention, cost grows with the square of the window), it keeps only the most informative queries, cutting cost from O(n²) to O(n log n). A distilling step then compresses the encoder’s representation to cut redundancy and memory. Why: this is the whole point of Informer — handle long sequences cheaply and keep only the signal that matters.
Decoder consumes a distilled encoder summary. The decoder (the paper configures 1 decoder layer) receives a distilled (compressed) version of the encoder output rather than the full previous target sequence. Why: this lets Informer emit the forecast in essentially one shot (multi-step capable) instead of slow step-by-step autoregression, which is what makes it fast enough for real-time autoscaling.
Produce the prediction. The model outputs
N(t+1) = f(Xt)— the predicted workload (request demand) for the next time interval. It is a point forecast (a single number per step, not a probability distribution or confidence interval), in units of workload / request volume for the next minute. The configured output length = 1 (single-step, next minute) — the paper repeatedly notes the architecture can do multi-step / long-range horizons but the experiments use a one-step output. So the headline “long-sequence” strength here refers to a long input context being handled efficiently, while the output used is one step ahead.(Prediction result reached.) Everything after this is consumption, not prediction: the forecast
N(t+1)is divided by per-pod capacity to get a target pod count (Pdesired = D̂(t+Δt)/Cpod, orPod Count = N(t+1)/Pod Resource Allocation), clamped to[Pmin, Pmax], and turned into a scale-out/scale-in action by the Adaptation Manager Service (with the 60% RRS gradual-removal rule), executed against Kubernetes via YAML/Python on the 60-second CDT cadence. The forecast’s quality is judged by MSE / RMSE / MAE against actuals (NASA-HTTP: MSE 9.185, RMSE 3.03, MAE 2.42), and provisioning by ΘU / ΘO.
Concrete numbers in one place: lookback window 10 steps; sampling/decision interval 60 s (1 min); output horizon 1 step; 2 encoder + 1 decoder layers; 4 attention heads; ProbSparse attention; result horizon shown over 0–200 min; datasets 3.46M NASA-HTTP requests and ~41 GB / 29-day / ~12,500-machine Google Cluster trace; 10/20/50 epochs; Adam @ lr 0.001.
Caveats the paper carries about its own configuration (flagged, not smoothed over):
Model dimension and batch size conflict inside the paper itself. The prose (Sects. 3.3/5.1) says d_model = 512, batch size 64, dropout 0.2; but Table 3 (the explicit hyperparameter table) says d_model = 64, batch size 16, dropout 0.1 (sequence length 10, output length 1, 4 attention heads). Both versions agree on feed-forward / linear-transformation dimension 2048. Per the instruction to trust the primary source, Table 3 is the configured value (d_model 64, batch 16, dropout 0.1); the 512 / batch-64 / dropout-0.2 figures appear only in narrative text and are most plausibly inherited defaults. This is a genuine inconsistency in the paper, not a transcription slip. Note also that a feed-forward width of 2048 beside a d_model of 64 is an unusually large ratio that the paper never comments on.
Feature count conflict. Table 3 says input size 3 features; the prose says “two features” and, confusingly, “an input size of three neural cells and two features.” The series being predicted is effectively the single request-rate signal regardless.
“Long sequence” wording. The paper sells Informer for “long-sequence forecasting,” but the configured output length is 1 (single-step). The long-sequence benefit as used here is mainly about efficiently ingesting a long input context, not emitting a long multi-step horizon.
Normalization of raw workload values is not described; do not assume z-score/min-max.
Model Parameters & How They Were Chosen¶
This section reports only the concrete hyperparameters the paper states and the method used to choose them. The paper’s authoritative source for the numeric configuration is Table 3 (“Hyperparameter configuration used in experiments”); where the narrative prose (Sects. 3.3 and 5.1) gives a different number for the same parameter, both values are shown and the conflict is flagged. Following the instruction to trust the primary source, Table 3 is treated as the configured value.
(A) What are the model parameters?¶
Architecture (Informer forecaster). These are the structural settings of the network (how deep, how wide, how the attention works).
| Parameter | Value | Source / note |
|---|---|---|
| Encoder layers | 2 | Table 3 (“Layers: 2/1 Informer Encoder/Decoder”) and prose agree |
| Decoder layers | 1 | Table 3 and prose agree |
| Attention heads (parallel attention sub-units) | 4 | Table 3 (“Multi-head attention, Transformer/Informer”) |
| d_model (embedding/model dimension, the width of each token’s vector) | 64 (Table 3) vs. 512 (prose, Sects. 3.3/5.1) | Internal conflict in the paper; Table 3 = 64 is the explicit config. The justification paragraph also says “embedding size (d_model) of 64” |
| Feed-forward / linear transformation dimension | 2048 (both Table 3 “Dimension 2048” and prose) | The hidden width of the position-wise feed-forward block. Note this 2048 sits beside a d_model of 64 in Table 3, an unusually large ratio that the paper does not comment on |
| Dropout (fraction of activations randomly zeroed for regularization) | 0.1 (Table 3, justification text) vs. 0.2 (prose Sect. 5.1) | Internal conflict; Table 3 = 0.1 |
| Attention type | ProbSparse self-attention (attends only to the most informative queries, cost O(n log n) instead of O(n²)) | Table 3 (“Informer-specific attention”); the ProbSparse sparsity/sampling factor c is not reported |
| Time encoding | Time2Vec + temporal embedding (timeF), concatenated with the raw input | Sect. 4.1; the Time2Vec output dimension and the assumed period are not reported |
| Distillation | Encoder uses a distilling step; decoder consumes a distilled encoder summary | Sect. 3.2/4.1; no distilling-specific hyperparameter (e.g. pooling stride) is reported |
| Activations | ReLU in hidden layers, SoftMax at the output layer | Table 3 and Sect. 4.1 |
| Input size (features per step) | 3 (Table 3 “Features per input”) vs. 2 (“two features”) and “three neural cells and two features” (prose Sect. 5.1) | Internal conflict; the predicted signal is effectively the single request-rate series regardless |
| Number of output classes | not reported (the task is point regression of next-interval workload; SoftMax-over-classes is described generically and no class count is given) | Sect. 4.1 |
| Total parameter count / model size | not reported | The paper gives no parameter count, FLOPs, or model-size figure for any model |
Model-specific structural parameters of the baselines (reported for completeness, since they share the experimental table):
| Baseline | Reported settings |
|---|---|
| ARIMA | order (p, d, q) = (2, 1, 2) |
| LSTM | hidden size 50, 1 layer |
| Bi-LSTM | hidden size 50, bidirectional = True |
| Transformer | 2 layers, d_model 64, 4 heads (shares the Transformer/Informer columns of Table 3) |
There is no diffusion, GAN, RevIN, or explicit FFT/frequency module in this model; the only non-standard structural pieces are ProbSparse attention, Time2Vec, and encoder distillation. The hidden size of 50 applies to the LSTM-family baselines, not to the Informer (whose width is d_model).
Training. How the model’s weights are fitted.
| Parameter | Value | Source / note |
|---|---|---|
| Optimizer | Adam | Table 3, justification text |
| Learning rate | 0.001 (fixed) | Table 3; no schedule (warmup/decay) is reported |
| Batch size | 16 (Table 3, “Fixed for all DL models” and justification text) vs. 64 (prose Sect. 5.1) | Internal conflict; Table 3 = 16 |
| Epochs | 10, 20, 50 (all three run and compared) | Table 3 (“Tested for performance”) |
| Loss function(s) | MSE, MAE, RMSE (also used as evaluation metrics) | Table 3; no loss weighting is reported (the three are reported side by side, not summed into a weighted objective) |
| Early-stopping rule | not reported | No patience/criterion is described; the epoch counts are swept rather than early-stopped |
| Weight decay | not reported | Not mentioned for Adam |
| Hardware | 2 vCPU, 8 GB RAM, integrated GPU (Intel UHD Graphics 770); host CPU Intel at 2.7 GHz (3.7 GHz turbo) or 2.6 GHz | Sect. 5.1; no dedicated training GPU/TPU |
| Software stack | TensorFlow 2.11, PyTorch 1.12; Kubernetes Python client, Prometheus API client, NumPy, Pandas, APScheduler; served via FastAPI; kubectl v1.28.2, Docker 25.0.1, on WSL/Hyper-V | Sect. 5.1 |
Data / windowing.
| Parameter | Value | Source / note |
|---|---|---|
| Lookback window (sequence length) | 10 time steps | Table 3 (“Input window size”) |
| Forecast horizon (output length) | 1 step (single-step; “can be extended for longer horizons”) | Table 3 (“Forecast steps”) |
| Sampling / decision interval | 60 s (1 minute); logs bucketed into one-minute intervals; scaling recomputed every 60 s (the Container Deployment Target, CDT) | Sects. 4.2, 5.2 |
| Train/test split (NASA-HTTP) | Train: Jul 1, 1995 00:00:00 to Aug 14, 1995 23:59:59; Test: the remaining 16 days (to Aug 31). A separate validation split is not reported (the prose mentions “training and validation” but does not give validation dates or fractions) | Sect. 5.2 |
| Train/test split (Google Cluster) | not reported as explicit dates/fractions; the full 29-day, ~41 GB 2011-2 trace is used to assess accuracy offline | Sect. 5.2 |
Autoscaling-control parameters (not network weights, but part of the configured system):
| Parameter | Value | Source |
|---|---|---|
| RRS (Resource Removal Strategy fraction) | 0.60 (remove 60% of surplus pods per step) | Sect. 4.2 |
| CDT cadence | 60 s | Sect. 4.2 |
| Pod bounds P_min / P_max | referenced as system-defined constraints (Eq. 2); specific numeric values not reported | Sect. 3.4 |
(B) How were the parameters chosen?¶
The paper devotes one paragraph (Sect. 5.1, immediately before Table 3) to justifying the configuration, and the dominant method is adoption from prior work / common forecasting defaults rather than a fresh search on this problem. Table 3’s caption explicitly cites references [5, 8, 16, 22, 34] as the provenance of “several key hyperparameters,” i.e. the values are carried over from the authors’ own earlier autoscaling papers and related forecasting work, not tuned de novo here.
Per-parameter selection method, with the paper’s stated rationale quoted where given:
ARIMA order (2, 1, 2): chosen by an information-criterion search, the only formal model-selection procedure in the paper. The text states it was “selected via AIC and BIC” (Akaike and Bayesian Information Criterion) to “capture short-term dependencies and handle non-stationarity.” This is a genuine search/selection step, but it applies only to the ARIMA baseline, not to the Informer.
Epochs (10, 20, 50): chosen by a small sweep / sensitivity comparison: “Training across 10, 20, and 50 epochs allows assessment of convergence over different forecasting lengths.” All three are run and reported rather than one being selected as final.
d_model = 64 and 4 attention heads: justified as a manual design trade-off, “to balance expressiveness and efficiency.” No search range is given.
Sequence length 10 and output length 1: justified manually by purpose, sequence length 10 “captures recent trends” and output length 1 “enables single-step predictions, which can be extended for longer horizons.” No alternatives were searched.
Hidden size 50 (LSTM/Bi-LSTM): justified manually, “provides enough capacity to learn meaningful patterns without overfitting.” No search.
Number of layers (1 LSTM / 2 Transformer / 2-1 Informer): justified manually as “the complexity needed for each model to effectively process temporal data.” No search.
Dropout 0.1 and Adam @ lr 0.001: justified manually as standard regularization/stabilization choices (“help regularize and stabilize training”). No range, schedule, or search.
Batch size 16: justified manually, “ensures manageable training time and generalization.” No search.
Loss = MSE/MAE/RMSE: chosen “to provide balanced performance insights”; they double as the evaluation metrics. No learned or weighted combination.
d_model 512 / feed-forward 2048 (prose figures): the prose presents these as giving “expressive capacity for complex patterns,” but, as noted, they conflict with Table 3’s d_model of 64 and are most plausibly inherited defaults; the paper gives no search justifying 512.
RRS = 0.60: stated as the configured experimental value (“the RRS value is configured at 0.60”) with a qualitative rationale (keep a buffer for spikes, remove surplus gradually). No sensitivity study or search over RRS is reported, so the specific 0.60 figure is presented without a tuning procedure.
Summary of method. Aside from the ARIMA order (AIC/BIC selection) and the epoch sweep (10/20/50, a coarse sensitivity comparison), no grid search, random search, or cross-validation over the Informer’s hyperparameters is reported. The Informer configuration is manual / default-driven, taken largely from prior work (refs [5, 8, 16, 22, 34]), with each value accompanied by a qualitative rationale rather than an empirical search over ranges. The paper does not report search ranges for any deep-learning hyperparameter, and is silent on how P_min/P_max and the Time2Vec/ProbSparse internal settings were chosen.
Inputs (what it consumes)¶
Time-series workload metrics, primarily the request arrival rate / HTTP request volume, plus CPU usage and memory utilization sampled from worker nodes.
User-side signals: request size, pending and dropped requests, arrival rate.
Dataset-specific features. From NASA-HTTP: source address, timestamp, offset, HTTP request, status code, domain name, response size. From Google Cluster: scheduling class, priority, type, scheduler, collection type/ID, vertical-scaling indicators, start-after collection IDs.
Lookback window: the Informer uses a sequence (input) length of 10 time steps; each time step is a fixed interval in minutes (logs are grouped into one-minute buckets).
Engineered features: Time2Vec + temporal embeddings derived from timestamps. The configured input uses ~3 features.
So in MAPE terms, the raw fuel is multivariate monitoring data, with request rate as the main predicted signal.
Outputs (what it produces)¶
A workload forecast
N(t+1)— the predicted resource demand / request volume for the next time interval (the configured output length is 1 step, i.e. the next minute, though the Informer architecture supports longer multi-step horizons).A recommended pod count (
Pdesired/Pod Count) derived from that forecast.A concrete scaling action: scale-out, scale-in (gradual via RRS), or no-op — actually executed against Kubernetes every 60 seconds.
Provisioning/quality metrics as evaluation outputs: ΘU, ΘO, MSE, RMSE, MAE.
In the real run, the output that matters is the live pod count: it stayed in a tight, efficient 0–14 pod band versus the reactive method’s chaotic 0–150 band.
How It Fits the Autoscaling Framework (MAPE-K)¶
InformerAutoScale is an end-to-end proactive autoscaler mapped cleanly onto MAPE — it does not stop at prediction, it also actuates:
Monitor: collects CPU, memory, and request-rate metrics from Kubernetes (Metrics Server, cAdvisor).
Analyze (★ where the transformer lives): the Informer forecasts next-interval demand and the system computes ΘU/ΘO.
Plan: the Adaptation Manager + Pod Requirement services turn the forecast into a target pod count and decide scale-out/in.
Execute: Kubernetes/Docker actually create or delete pods.
Reactive or proactive? Firmly proactive — it scales on forecasts, not on current thresholds. (The paper explicitly benchmarks against a reactive baseline.)
Horizontal or vertical? Horizontal — it changes the number of pods. It does not do vertical scaling (the paper even notes integrating a Vertical Pod Autoscaler as future-style improvement when discussing related work).
Does it actuate, or just predict? Unlike papers that only output a forecast and hand off to a stock HPA, InformerAutoScale drives the actuation itself through its own Adaptation Manager Service and Pod Requirement Service rather than the default Kubernetes HPA — it runs the whole loop and writes the scaling actions to Kubernetes via YAML and Python scripts.
Evaluation (datasets & metrics, briefly)¶
Datasets: NASA-HTTP (3.46M HTTP request logs from NASA’s Kennedy Space Center web server, Jul–Aug 1995; used for the real Docker/Kubernetes deployment) and Google Cluster Traces 2011 (29 days, ~12,500 machines, ~41 GB; used to simulate/assess predictive accuracy).
Setup: Docker Desktop + Kubernetes (kubectl v1.28.2) on WSL/Hyper-V, 2 vCPU / 8 GB RAM; models built with TensorFlow 2.11 / PyTorch 1.12 and served via FastAPI.
Forecasting error (NASA-HTTP): Informer MSE 9.185, RMSE 3.03, MAE 2.42 — far below Transformer (77.3 / 8.79 / 6.59), Bi-LSTM, LSTM, ARIMA (worst, MSE ~197).
Provisioning accuracy (NASA-HTTP): Informer ΘO 4.28, ΘU 11.205 — lowest (best) of all models.
Scale pod count: across 0–200 minutes Informer stayed in a 10–40 band vs. Transformer 20–60, Bi-LSTM 30–70, LSTM 40–80, ARIMA 60–90.
Headline: reactive max 150 pods vs. proactive max 14 pods -> 90.66% scaling-efficiency gain (computed as
(X−Y)/X × 100).
Training & pre-training¶
Trained from scratch — no pretrained or foundation model.
The Informer forecaster is trained from random initialization on cloud workload traces (NASA-HTTP and Google Cluster 2011) as ordinary supervised forecasting, and the deep-learning baselines (LSTM, Bi-LSTM, vanilla Transformer) are trained the same way under identical conditions. There is no pretraining, transfer learning, foundation model, or zero-shot component anywhere in the pipeline.
Training setup, for the record (numbers taken from Table 3, the explicit hyperparameter table; where the narrative prose disagrees, the conflict is flagged — see Model Parameters):
Optimization: Adam, learning rate 0.001 (no schedule reported), batch size 16 (Table 3, held fixed across all DL models; the prose says 64), dropout 0.1 (Table 3; the prose says 0.2); trained for 10 / 20 / 50 epochs (all three run and compared).
Architecture: d_model 64 (Table 3; the prose says 512), feed-forward / linear-transformation dimension 2048, 4 attention heads, ProbSparse attention, input length 10, output length 1 (single-step, though the architecture is multi-step capable).
Metrics: MSE, MAE, RMSE plus the over/under-provisioning Θ measures.
Stack: built with TensorFlow 2.11 / PyTorch 1.12; served via FastAPI on Docker + Kubernetes.
Strengths¶
Genuinely proactive and end-to-end: forecasts and actuates, validated on a real Docker/Kubernetes deployment, not only in simulation.
Big, concrete efficiency win: ~10x fewer pods for the same traffic (150 -> 14), with the lowest forecasting error and best ΘU/ΘO among the compared models.
Right tool for long sequences: Informer’s ProbSparse attention (O(n log n)) and multi-step forecasting fit autoscaling better than O(n²) Transformers or forgetful LSTMs.
Stability built in: the 60-second CDT cadence + gradual RRS removal + ΔR(t) decision logic prevent oscillation/flapping and reserve a buffer for sudden spikes.
Practical stack: uses standard Kubernetes-native monitoring (Metrics Server, cAdvisor) and a FastAPI serving layer.
Limitations¶
Leans on historical patterns: poor at truly unpredictable events (viral spikes, DDoS) where a fast reactive system might respond sooner; the authors suggest a hybrid predictive+reactive design as future work.
Tuned for long, smooth sequences: the model may be less effective for short-duration, bursty workloads.
Lab environment: evaluated in a virtualized Docker Desktop / WSL setup with no real physical-hardware cluster, which may limit generalizability.
Operational fragility: frequent Kubernetes config changes can affect stability and reproducibility.
Horizontal only: no vertical scaling (CPU/RAM resizing of existing pods); single-step output length (1) was used even though the architecture supports longer horizons.
Authors’ own caveat: may struggle with highly unpredictable, real-world workloads.
Glossary¶
Pod: smallest deployable unit in Kubernetes; wraps one or more containers (an app instance).
HPA (Horizontal Pod Autoscaler): Kubernetes’ built-in, reactive, threshold-based pod scaler.
Reactive / Proactive autoscaling: act after load crosses a threshold vs. forecast and act in advance.
Cold start: the delay before a freshly created pod is ready to serve traffic.
ΘU / ΘO: under-provisioning (too few resources -> SLA risk) / over-provisioning (too many -> wasted cost); 0 is ideal.
MAPE(-K) loop: Monitor -> Analyze -> Plan -> Execute over shared Knowledge — the control loop for self-adaptive systems.
Transformer / attention: neural net that learns which past time steps matter for predicting the future.
Informer: long-sequence-optimized Transformer variant using ProbSparse attention (O(n log n)), multi-step forecasting, Time2Vec time encoding, and encoder distillation.
ProbSparse self-attention: attention that focuses only on the most informative queries, cutting cost from O(n²) to O(n log n).
Time2Vec: a learnable way to encode time so the model captures periodicity (e.g. daily cycles).
CDT (Container Deployment Target): the fixed 60-second cadence on which pod counts are recomputed and applied.
RRS (Resource Removal Strategy): remove only a fraction (here 60%) of surplus pods per step to stay stable and keep a buffer.
Metrics Server / cAdvisor / Prometheus: Kubernetes monitoring tools that expose CPU/memory/request metrics.
MSE / RMSE / MAE: standard forecasting error metrics (lower is better).
- Kumar, B., Verma, A., Verma, P., & Bennour, A. (2025). Optimizing resource allocation in cloud-native applications through proactive autoscaling with the InformerAutoScale model. The Journal of Supercomputing, 81(9). 10.1007/s11227-025-07500-7