Suggested Reading Order
A pedagogical path for a newcomer.
The logic of this ordering¶
The order goes concepts → clearest end-to-end example → forecaster variants grouped by technique → specialized/advanced ideas, not chronological. We start with the papers that teach the core mental model (proactive vs. reactive, the MAPE-K loop, transformer-as-forecaster) in the simplest possible setting, then add one new idea at a time. Forecaster “tricks” (Informer, GAN, diffusion, frequency-domain, similarity-aware, convolution-augmented) are grouped so related ideas reinforce each other, and the most specialized or conceptually surprising papers come last. Years are noted but are not the sorting key.
The ordered list¶
1. Predictive Auto-scaler for Kubernetes — Shim et al., 2023¶
Why here: The clearest, simplest end-to-end story in the set — the ideal on-ramp. It frames the whole problem (reactive HPA is too slow → forecast the next minute → scale ahead) with a clean univariate, one-step setup and an explicit MAPE loop. You will learn: the core proactive-vs-reactive argument, the MAPE-K loop in practice, transformer-as-workload-forecaster, and your first efficient variant (Informer, with ProbSparse attention) — plus an honest result (the transformer only clearly wins on the harder, more variable dataset).
2. Predictive Autoscaling with Transformers — Shrestha & Tuz Sabiha, 2025¶
Why here: Second because it’s another near-textbook MAPE loop, but now truly live on Kubernetes via KEDA, and it introduces a twist (scaling the ingress controller, the cluster’s front door). It cements the loop while showing real deployment plumbing. You will learn: how a forecast becomes a real scaling target through standard tooling (Prometheus → transformer → calculator → KEDA → HPA), and the honesty that a predictive scaler doesn’t always beat plain HPA.
3. InformerAutoScale — Kumar, Verma, Verma & Bennour, 2025¶
Why here: Builds directly on the Informer idea from #1 but completes the loop into a full deployable autoscaler that actuates pods itself. Natural next step from “forecast” to “forecast + act.” You will learn: how Informer’s long-sequence, multi-step strengths fit autoscaling; how a Plan/Execute layer (Adaptation Manager, gradual RRS removal, 60s cadence) is built; and how to read a flashy headline number critically (the 90.66% gain is a single-trace max-pod comparison).
4. MV-Transformer MAPE Framework — Kumar, Verma & Verma, 2025¶
Why here: A sibling of #3 (shared authors, same Docker/Kubernetes testbed) that swaps the new lesson from efficient attention to multivariate input — forecasting from several entangled metrics at once, with Pearson feature selection. You will learn: why multivariate forecasting can beat univariate (and the caveat that here the correlations were weak), and practical anti-thrashing controls (CPT, RRS). Reading #3 and #4 back-to-back is efficient — they overlap heavily.
5. Transformer Workload Prediction & Adaptive Auto-Scaling — Naik et al., 2026¶
Why here: Rounds out the “end-to-end framework” cluster by adding the missing PLAN idea: a real cost optimizer (α SLA-penalty + β provisioning-cost + γ scaling-jumpiness) plus a damping cap, instead of trivial division. You will learn: how to turn a forecast into a principled scaling decision that balances SLA, cost, and stability — and the caveat that this one is validated in a simulator, not a live cluster.
6. CATScaler — Meng, Dai, Cong, Zhu & Zhao, 2025¶
Why here: The capstone of the framework camp and the most complete Plan stage. It introduces convolution-augmented attention (catch local spikes + global trends) and replaces the naive forecast→pod formula with a learned LightGBM mapping. You will learn: how convolution complements attention, why the nonlinear forecast→pod-count mapping matters, and what a polished, retraining, live-cluster autoscaler looks like (with the caveat that abstract numbers differ from per-experiment ones).
7. WGAN-gp Transformer — Arbat, Jayakumar, Lee, Wang & Kim, 2022¶
Why here: Now we shift from systems to forecasting techniques. This is the gateway to the special-technique forecasters: same job-arrival-rate target as several others, but trained adversarially (GAN). The oldest paper, placed late on purpose — its value is the technique, not the loop. You will learn: adversarial training (generator vs. critic), why WGAN-gp is more stable than plain GANs, that the transformer is ~5x faster than LSTM, and the honest finding that it loses to LSTM on strongly seasonal workloads.
8. Diffusion Autoformer (DAF) — Kumar, Chauhan et al., 2025¶
Why here: The natural successor to #7 in the “generative forecasters” thread — it even uses WGAN-gp as a baseline. It raises the bar from a single guess to a probabilistic forecast with a confidence band, fusing trend/seasonal decomposition + diffusion + context. You will learn: series decomposition (Autoformer), diffusion models for forecasting, why uncertainty matters (provision to the upper band for SLA safety or the mean for cost), and the trend-guided trick that keeps diffusion fast enough (~68 ms).
9. Fremer — Ye, Chen et al. (ByteDance), 2025¶
Why here: A third distinct forecasting paradigm — the frequency domain (FFT). Placed after the generative pair because it’s a bigger conceptual leap (forecast a spectrum, not a curve), and it reintroduces the scale/efficiency concern at industrial volume (100k+ forecasts/hour). You will learn: time vs. frequency domain, why periodic cloud workloads are a “chord” of cycles, complex-valued attention, and how a tiny model (~0.57M params) can beat heavy SOTA — validated end-to-end on Kubernetes HPA.
10. FELT — Ding, Feng & Yu, 2025¶
Why here: The scale-out-the-model idea: one shared transformer for thousands of containers via similarity-aware attention. It’s a clever, less-obvious repurposing of transformer machinery (position encoding and attention masks encode container similarity, not time), so it rewards the context built by earlier papers. You will learn: the per-container model-explosion problem, blending historical + recent similarity, attention masks for grouping, and discretizing output into coarse “levels” — a different axis of innovation (efficiency at fleet scale) than accuracy alone.
11. CloudFormer — Shahbazinia, Huang, Costero, Atienza, 2026¶
Why here: Deliberately last because it bends the frame: it predicts performance degradation (slowdown from noisy neighbors), not future workload. Once you’ve internalized the standard “forecast demand → scale” pattern, this paper makes you see the Analyze stage more broadly. You will learn: the noisy-neighbor / interference problem, black-box (host-only) prediction, a dual-branch design (time view + cross-metric view), regression-of-slowdown vs. yes/no detection, and how a non-workload predictor still slots into MAPE-K’s Analyze stage to drive migration/placement/scaling.
If you only read three, read these¶
Shim (#1) for the clearest end-to-end mental model, CATScaler (#6) for the most complete real framework (best Plan/Execute), and Fremer (#9) for the most distinctive, efficient forecasting idea.