Papers Comparison
A map for newcomers. Read this alongside the individual paper summaries.
Common backbone: the MAPE-K loop¶
Almost every paper in this review innovates in Analyze (a better forecaster) and reuses or hand-waves PLAN and EXECUTE. A minority build the whole loop.
Taxonomy¶
Pure forecasters¶
Analyze only; hand off to an external scaler.
Contribution is forecast accuracy/efficiency, not actuation.
DAF (probabilistic job-arrival)
FELT (one model, all containers)
Fremer (frequency-domain)
WGAN-gp (adversarial; thin “1 job = 1 VM” scaler).
Then “a standard HPA does the rest”.
End-to-end frameworks¶
Implement the full MAPE-K loop.
MV-Transformer (scales VMs, live)
AdaptiveAutoScaling (cost-optimizing controller, simulator)
PredictiveAutoscaling (feeds KEDA, live, scales ingress)
PredictiveK8s (Informer + simple rule, log-replay)
CATScaler (+ LightGBM scaler, live pods)
InformerAutoScale (Informer + own manager, live).
These papers carry all four MAPE steps and loop every 30–60 s.
Performance predictors¶
Predict slowdown, not workload.
CloudFormer: VM degradation ratio from black-box host metrics; feeds placement/migration/scaling.
Papers notable for a special technique¶
The technique is often the whole reason to read it.
| Technique | Paper(s) | What it provides |
|---|---|---|
| Diffusion (generative, probabilistic) | DAF | Outputs a confidence band, not a point forecast; tune for SLA-safety vs. cost. |
| Adversarial / GAN (WGAN-gp) | WGAN-gp | A critic pushes forecasts to look realistic; better on bursty traffic, ~5x faster than LSTM. |
| Frequency-domain (FFT) | Fremer | Reads repeating cycles instead of the raw curve; ~12x smaller, ~3x faster, handles multi-period workloads. |
| Convolution-augmented | CATScaler (and CloudFormer’s system branch) | Convolution catches short local spikes the attention smooths over. |
| Similarity-aware (shared model) | FELT | One model serves thousands of containers by grouping look-alikes via attention masks. |
| Efficient long-sequence attention (Informer / ProbSparse) | PredictiveK8s, InformerAutoScale | Cuts attention cost from O(n²) to ~O(n log n) for long inputs and multi-step horizons. |
| Dual-branch (time + cross-metric) | CloudFormer | One branch reads time, one reads the metrics; generalizes to unseen apps. |
Comparison table¶
CloudFormer predicts degradation of an ongoing run (to pre-empt a QoS violation), not a fixed time-ahead forecast.
Model Training¶
Every paper in this review trains its model from scratch on cloud workload/trace data. None uses a pretrained time-series foundation model: no borrowed backbone, no fine-tuning, no zero-shot transfer. The contribution is always a task-specific architecture fit directly to the traces.
| Paper | Optimizer (LR) | Loss | Epochs | Train/val/test split |
|---|---|---|---|---|
| MV-Transformer | not stated | MSE/MAE/RMSE/MAPE | 10 / 50 / 100 | 80 / 20 |
| AdaptiveAutoScaling | Adam (1e-4) | multi-step MSE | ≤100, early stop | 70 / 15 / 15 |
| PredictiveAutoscaling | Darts defaults (unstated) | unstated | 200 | 70 / 30 |
| PredictiveK8s | not stated | MSE + early stop | 10 (NASA) / 4 (FIFA) | train + last-16-days test |
| DAF | AdamW (1e-4) | denoising + MSE recon | early stop | train + val split |
| FELT | not stated | (classification) | not stated | 7 : 1 : 2 |
| Fremer | Adam (1e-3) | MSE | not stated | 8 : 2 |
| WGAN-gp | MADGRAD (1e-3) | MAE + WGAN-gp critic | 1000 | 60 / 20 / 20 |
| CATScaler | Adam | SmoothL1 (Alibaba) / MSE (Huawei) | early stop (5) | 80 / 10 / 10 |
| InformerAutoScale | Adam (1e-3) | MSE/MAE/RMSE | 10 / 20 / 50 | not stated |
| CloudFormer | Adam (1e-5) | log-cosh | not stated | 7 train / 4 unseen apps |
Two “false friends” worth flagging — wording that looks like pretraining but isn’t:
Fremer uses the words “zero-shot” and “foundational backbone”, but these describe Fremer itself being trained once and transferring across datasets (via channel independence), and its potential as a future shared backbone — not the use of any external pretrained model.
DAF calls itself a “hybrid” model, but “hybrid” means combining decomposition + diffusion + exogenous attention — not mixing pretrained and from-scratch training.
Other notables:
WGAN-gp is the odd one out: trained adversarially, per workload (it does not generalize across traces), and with MADGRAD rather than Adam — an ablation shows that optimizer choice is the key accuracy driver.
The closest thing to “train once, then serve” is the deploy pattern in CATScaler and PredictiveAutoscaling: train offline, serve, then periodically retrain (e.g. a Kubernetes CronJob) to fight distribution shift — still from-scratch retraining, not foundation-model fine-tuning.
MAPE Stage-by-stage¶
Monitor¶
No novelties, but richness varies. The table below lists the metrics each paper actually feeds the model and where they come from.
| Paper | Input metrics | Uni/Multi | Source |
|---|---|---|---|
| MV-Transformer | CPU, memory, disk R/W, network throughput (Pearson selects throughput↔memory; predicts throughput) | Multi | Trace files (Bitbrains, Google, Azure Functions) |
| AdaptiveAutoScaling | CPU, memory, request arrival rate, queue length | Multi | Google Cluster trace, in simulator |
| CATScaler | CPU, memory, RPS per API + machine specs | Multi | Prometheus; Alibaba & Huawei traces |
| InformerAutoScale | CPU, memory, request rate (→ one aggregated workload target) | Multi | Metrics Server + cAdvisor (Prometheus secondary) |
| CloudFormer | 206 host metrics = 103 × (target VM + neighbors); each 103 = 53 VM (libvirt) + 38 Linux perf counters + 12 Intel Top-Down | Multi | libvirt API, Linux perf, Intel Top-Down |
| PredictiveAutoscaling | Ingress request rate (RPS) — CPU/mem/latency tracked but not model inputs | Uni | Prometheus + Grafana (live, via KEDA) |
| PredictiveK8s | HTTP requests/min | Uni | Prometheus architecturally; experiments on offline NASA/FIFA logs |
| DAF | Job-arrival rate + hour/day context | Uni | Traces (Google, Azure, Alibaba, Facebook, Wikipedia) |
| WGAN-gp | Job-arrival rate | Uni | Traces (Facebook, Alibaba, Google, Wiki, Azure) |
| Fremer | CPU or QPS per instance (CPU for IaaS/PaaS, QPS for FaaS/RDS) | Uni per-series | ByteDance + Materna traces |
| FELT | CPU only, compressed into 6 features/container: min, Q1, median, Q3, max + 1 waveform class | Uni (per-source) | Alibaba microservices, Fisher |
Takeaways:
Prometheus recurs (PredictiveAutoscaling, CATScaler, and architecturally PredictiveK8s); InformerAutoScale instead reads Metrics Server + cAdvisor.
Most “univariate” papers are workload-only (request/job-arrival rate); the multivariate ones add resource counters (CPU, memory, ...).
CloudFormer is the outlier — black-box host hardware counters (206), not workload at all.
Analyze¶
Plain transformer, made multivariate/looped: MV-Transformer, AdaptiveAutoScaling.
Efficient long-sequence (Informer): PredictiveK8s, InformerAutoScale.
Generative/probabilistic (diffusion): DAF, the only one with uncertainty.
Adversarial (GAN): WGAN-gp, the only one with a critic on forecast realism.
Frequency-domain (FFT): Fremer, the only one forecasting a spectrum.
Convolution-augmented: CATScaler, local-spike + global-trend.
Similarity-aware shared model: FELT, one model for thousands.
Performance predictor: CloudFormer, predicts slowdown, not workload.
Plan¶
Where approaches diverge most sharply:
Trivial / assumed
WGAN-gp (1 job = 1 VM)
InformerAutoScale (pods = load / capacity)
DAF (linear, left to HPA).
Designed controllers
AdaptiveAutoScaling (α/β/γ cost optimization + damping cap)
CATScaler (LightGBM learns the nonlinear forecast→pod mapping, the most sophisticated Plan)
PredictiveAutoscaling (RPS → KEDA target).
“Anti-flapping” logic
Cooldown / stabilization window: MV-Transformer’s CPT (Cool-down Period Time), CATScaler’s cooldown.
Reactive rescaling guard (RRS): only apply a scale-in if the lower demand persists / passes a check, rather than reacting to a single dip. MV-Transformer, InformerAutoScale (on a 60 s cadence).
Queue + cooldown: CATScaler runs scale requests through a FIFO queue plus a cooldown so bursts of decisions collapse into one.
Damping cap: AdaptiveAutoScaling limits how much capacity can change per step , so it can’t swing wildly.
Not implemented
Execute¶
Live K8s/Docker
PredictiveAutoscaling (KEDA → HPA)
CATScaler (pod replicas)
InformerAutoScale & MV-Transformer (Docker Desktop + K8s)
Simulator / log-replay
AdaptiveAutoScaling (discrete-time sim)
PredictiveK8s (offline replay)
Real cloud VMs
WGAN-gp (Google Cloud e2-medium)
No actuation (forecast-only)
Fremer (except 24h HPA demo)
In one sentence each¶
MV-Transformer — multivariate forecaster, full loop, scales real VMs.
AdaptiveAutoScaling — forecaster + cost-optimizing controller, in simulation.
PredictiveAutoscaling — scales the ingress controller via KEDA, live.
PredictiveK8s — first to bring Informer to K8s autoscaling.
DAF — probabilistic forecast with a confidence band.
FELT — one tiny model for thousands of containers via similarity-aware attention.
Fremer — frequency-domain, ~12x smaller, multi-period workloads.
WGAN-gp — adversarially-trained, fast, bursty-traffic forecaster.
CATScaler — convolution-augmented forecast + LightGBM pod model, full live loop.
InformerAutoScale — Informer forecaster + own actuator, ~10x fewer pods than reactive.
CloudFormer — predicts slowdown from black-box host metrics, not workload.
Acronyms & short names¶
Models / methods named in this review¶
| Short name | Stands for | What it is |
|---|---|---|
| DAF | Diffusion Autoformer | Autoformer (trend+seasonal decomposition) with a diffusion decoder → probabilistic JAR forecast with a confidence band. |
| MV-Transformer | Multivariate Transformer | Encoder-decoder transformer fed several metrics at once (CPU, mem, disk, net) to exploit cross-metric correlations. |
| WGAN-gp | Wasserstein Generative Adversarial Network with gradient penalty | A stable GAN variant: a transformer generator forecasts the series; an MLP critic scores realism via Wasserstein (earth-mover’s) distance, kept 1-Lipschitz by a gradient penalty. |
| FELT | (model name) | Encoder-only, similarity-aware transformer; one shared model classifies workload for thousands of containers. |
| Fremer | Frequency transformer | Frequency-domain (FFT) transformer that forecasts a spectrum instead of the raw curve. |
| CATScaler | Convolution-Augmented Transformer Scaler | Convolution-augmented transformer forecaster + LightGBM pod calculator, full live loop. |
| CloudFormer | (model name) | Dual-branch transformer predicting performance-degradation ratio from black-box host metrics. |
| Informer | (model name) | Efficient long-sequence transformer using ProbSparse attention (O(n log n)). |
| LightGBM | Light Gradient-Boosting Machine | A fast gradient-boosting decision-tree model (not a neural net), strong at learning nonlinear tabular input→output mappings — used as CATScaler’s “how many pods?” calculator. |
| RevIN | Reversible Instance Normalization | Normalize each input window, then exactly reverse it on the output; cheap defense against distribution shift. |
| Time2Vec | Time-to-Vector | Learnable encoding of timestamps as features. |
Domain & infrastructure terms¶
| Acronym | Stands for | Meaning |
|---|---|---|
| JAR | Job Arrival Rate | Incoming jobs/requests per unit time. |
| HPA | Horizontal Pod Autoscaler | Kubernetes’ built-in autoscaler that adds/removes pods. |
| KEDA | Kubernetes Event-Driven Autoscaling | Extends HPA to scale on custom/external metrics. |
| RPS / QPS | Requests / Queries Per Second | Traffic-rate inputs. |
| SLA / QoS | Service-Level Agreement / Quality of Service | The performance guarantees autoscaling tries not to violate. |
| RRS / CPT | Reactive Rescaling System / Cool-down Period Time | Anti-thrashing guards in the Plan stage. |
- Kumar, B., Verma, A., Verma, P., & Bennour, A. (2025). Optimizing resource allocation in cloud-native applications through proactive autoscaling with the InformerAutoScale model. The Journal of Supercomputing, 81(9). 10.1007/s11227-025-07500-7
- Ding, Z., Feng, B., & Yu, W. (2025). FELT: Large-Scale Cloud Workload Prediction Through Adaptive Feature-Enhanced and Similarity-Aware Transformer. Tsinghua Science and Technology. 10.26599/tst.2025.9010102
- Ye, H., Chen, J., Jiang, F., He, X., Zhang, T., Chen, J., & Gao, X. (2025). Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services. Proceedings of the VLDB Endowment, 18(11), 3812–3825. 10.14778/3749646.3749656
- Arbat, S., Jayakumar, V. K., Lee, J., Wang, W., & Kim, I. K. (2022). Wasserstein Adversarial Transformer for Cloud Workload Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 36(11), 12433–12439. 10.1609/aaai.v36i11.21509
- G, C. N., O, B. C., R, J. K., Raj B N, P., Naik, P. N., & G, M. B. (2026). Transformer-Based Workload Prediction and Adaptive Auto-Scaling in Cloud Data Centers. 2026 IEEE International Conference for Convergence in Computing Technology (I3CTCON), 1–6. 10.1109/i3ctcon68242.2026.11507247
- Shrestha, R., & Tuz Sabiha, F. (2025). Enhancing Cloud Resource Utilization with Predictive Autoscaling Using Transformer Models. 2025 9th International Conference on Cloud and Big Data Computing (ICCBDC), 24–29. 10.1109/iccbdc67784.2025.00011
- Shim, S., Dhokariya, A., Doshi, D., Upadhye, S., Patwari, V., & Park, J.-Y. (2023). Predictive Auto-scaler for Kubernetes Cloud. 2023 IEEE International Systems Conference (SysCon), 1–8. 10.1109/syscon53073.2023.10131106
- Meng, F., Dai, H., Cong, G., Zhu, B., & Zhao, H. (2025). CATScaler: A Convolution-Augmented Transformer Scaling Framework for Cloud-Native Applications. IEEE Transactions on Services Computing, 18(5), 2659–2672. 10.1109/tsc.2025.3592383
- Shahbazinia, A., Huang, D., Costero, L., & Atienza, D. (2025). CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload. arXiv. 10.48550/ARXIV.2509.03394