Papers Comparison - Luca's Research @ PEG

Common backbone: the MAPE-K loop¶

Almost every paper in this review innovates in Analyze (a better forecaster) and reuses or hand-waves PLAN and EXECUTE. A minority build the whole loop.

Taxonomy¶

Pure forecasters¶

Analyze only; hand off to an external scaler.

Contribution is forecast accuracy/efficiency, not actuation.

DAF (probabilistic job-arrival)
FELT (one model, all containers)
Fremer (frequency-domain)
WGAN-gp (adversarial; thin “1 job = 1 VM” scaler).

Then “a standard HPA does the rest”.

End-to-end frameworks¶

Implement the full MAPE-K loop.

MV-Transformer (scales VMs, live)
AdaptiveAutoScaling (cost-optimizing controller, simulator)
PredictiveAutoscaling (feeds KEDA, live, scales ingress)
PredictiveK8s (Informer + simple rule, log-replay)
CATScaler (+ LightGBM scaler, live pods)
InformerAutoScale (Informer + own manager, live).

These papers carry all four MAPE steps and loop every 30–60 s.

Performance predictors¶

Predict slowdown, not workload.

CloudFormer: VM degradation ratio from black-box host metrics; feeds placement/migration/scaling.

Papers notable for a special technique¶

The technique is often the whole reason to read it.

Technique	Paper(s)	What it provides
Diffusion (generative, probabilistic)	DAF	Outputs a confidence band, not a point forecast; tune for SLA-safety vs. cost.
Adversarial / GAN (WGAN-gp)	WGAN-gp	A critic pushes forecasts to look realistic; better on bursty traffic, ~5x faster than LSTM.
Frequency-domain (FFT)	Fremer	Reads repeating cycles instead of the raw curve; ~12x smaller, ~3x faster, handles multi-period workloads.
Convolution-augmented	CATScaler (and CloudFormer’s system branch)	Convolution catches short local spikes the attention smooths over.
Similarity-aware (shared model)	FELT	One model serves thousands of containers by grouping look-alikes via attention masks.
Efficient long-sequence attention (Informer / ProbSparse)	PredictiveK8s, InformerAutoScale	Cuts attention cost from O(n²) to ~O(n log n) for long inputs and multi-step horizons.
Dual-branch (time + cross-metric)	CloudFormer	One branch reads time, one reads the metrics; generalizes to unseen apps.

Comparison table¶

Paper (short name)	Model / transformer variant	Inputs	Outputs	Forecast horizon	MAPE-K stage(s)	Proactive/Reactive & H/V	Year
MV-Transformer	Multivariate encoder-decoder transformer + Pearson feature selection	Multivariate metrics (CPU, mem, disk, net); target = network throughput	Next-interval throughput → VM count → scale action	1 step (next interval)	Full MAPE	Proactive · Horizontal	2025
AdaptiveAutoScaling	Encoder-only transformer (4 layers, 8 heads) + cost optimizer	Multivariate (CPU, mem, req rate, queue len) @ 5s	Multi-step forecast → VM count → signed scale action	Several steps (H)	Full MAPE (in simulator)	Proactive · Horizontal	2026
PredictiveAutoscaling	Informer	Ingress request rate (RPS)	1-min-ahead RPS → KEDA scale_value → pods	1 minute	Full MAPE (live, via KEDA)	Proactive · Horizontal	2025
PredictiveK8s	Informer	Univariate HTTP requests/min, 10-min lookback	Next-min request count → pod count → scale cmd	1 minute	Full MAPE (log-replay)	Proactive · Horizontal	2023
DAF	Autoformer encoder + diffusion decoder + exogenous attention	Univariate job-arrival rate (96 steps) + hour/day context	Probabilistic JAR forecast + confidence band	24 steps (~2 h)	Analyze only	Proactive · Horizontal (HPA target)	2025
FELT	Encoder-only, similarity-aware transformer (2 variants)	6 engineered features/container (CPU); similarity matrices	5-level workload class for every container	1 step	Analyze only	Proactive · H or V (agnostic)	2025
Fremer	Encoder-only frequency-domain transformer (complex attention)	Univariate CPU/QPS per instance, 5-day lookback	Forecast spectrum → next-day workload curve	1 day (288 steps)	Analyze only	Proactive · Horizontal	2025
WGAN-gp	Encoder-decoder transformer as GAN generator + MLP critic	Univariate job-arrival rate, lookback n	1-step JAR forecast (= target VM count)	1 step (5–60 min)	Analyze only	Proactive · Horizontal	2022
CATScaler	Convolution-augmented transformer + MLP decoder; LightGBM scaler	Multivariate per-API (CPU, mem, RPS, specs)	144-step forecast → pod counts → scale action	144 steps (hours)	Full MAPE (full, live)	Proactive · Horizontal	2025
InformerAutoScale	Informer	Multivariate (req rate, CPU, mem), 10-step lookback	Next-min demand → pod count → scale action	1 step (multi-step capable)	Full MAPE (full, live)	Proactive · Horizontal	2025
CloudFormer	Dual-branch encoder-only transformer (temporal + system)	206 black-box host metrics @ 1s (target + neighbors)	Performance-degradation ratio P ∈ (0,1]	Degradation of observed run (not time-ahead)	Analyze only	Proactive* · Scaler-agnostic (H or V)	2026

Paper (short name)	Model / transformer variant	Inputs	Outputs	Forecast horizon	MAPE-K stage(s)	Proactive/Reactive & H/V	Year
MV-Transformer	Multivariate encoder-decoder transformer + Pearson feature selection	Multivariate metrics (CPU, mem, disk, net); target = network throughput	Next-interval throughput → VM count → scale action	1 step (next interval)	Full MAPE	Proactive · Horizontal	2025
AdaptiveAutoScaling	Encoder-only transformer (4 layers, 8 heads) + cost optimizer	Multivariate (CPU, mem, req rate, queue len) @ 5s	Multi-step forecast → VM count → signed scale action	Several steps (H)	Full MAPE (in simulator)	Proactive · Horizontal	2026
PredictiveAutoscaling	Informer	Ingress request rate (RPS)	1-min-ahead RPS → KEDA scale_value → pods	1 minute	Full MAPE (live, via KEDA)	Proactive · Horizontal	2025
PredictiveK8s	Informer	Univariate HTTP requests/min, 10-min lookback	Next-min request count → pod count → scale cmd	1 minute	Full MAPE (log-replay)	Proactive · Horizontal	2023
DAF	Autoformer encoder + diffusion decoder + exogenous attention	Univariate job-arrival rate (96 steps) + hour/day context	Probabilistic JAR forecast + confidence band	24 steps (~2 h)	Analyze only	Proactive · Horizontal (HPA target)	2025
FELT	Encoder-only, similarity-aware transformer (2 variants)	6 engineered features/container (CPU); similarity matrices	5-level workload class for every container	1 step	Analyze only	Proactive · H or V (agnostic)	2025
Fremer	Encoder-only frequency-domain transformer (complex attention)	Univariate CPU/QPS per instance, 5-day lookback	Forecast spectrum → next-day workload curve	1 day (288 steps)	Analyze only	Proactive · Horizontal	2025
WGAN-gp	Encoder-decoder transformer as GAN generator + MLP critic	Univariate job-arrival rate, lookback n	1-step JAR forecast (= target VM count)	1 step (5–60 min)	Analyze only	Proactive · Horizontal	2022
CATScaler	Convolution-augmented transformer + MLP decoder; LightGBM scaler	Multivariate per-API (CPU, mem, RPS, specs)	144-step forecast → pod counts → scale action	144 steps (hours)	Full MAPE (full, live)	Proactive · Horizontal	2025
InformerAutoScale	Informer	Multivariate (req rate, CPU, mem), 10-step lookback	Next-min demand → pod count → scale action	1 step (multi-step capable)	Full MAPE (full, live)	Proactive · Horizontal	2025
CloudFormer	Dual-branch encoder-only transformer (temporal + system)	206 black-box host metrics @ 1s (target + neighbors)	Performance-degradation ratio P ∈ (0,1]	Degradation of observed run (not time-ahead)	Analyze only	Proactive* · Scaler-agnostic (H or V)	2026

CloudFormer predicts degradation of an ongoing run (to pre-empt a QoS violation), not a fixed time-ahead forecast.

Model Training¶

Every paper in this review trains its model from scratch on cloud workload/trace data. None uses a pretrained time-series foundation model: no borrowed backbone, no fine-tuning, no zero-shot transfer. The contribution is always a task-specific architecture fit directly to the traces.

Paper	Optimizer (LR)	Loss	Epochs	Train/val/test split
MV-Transformer	not stated	MSE/MAE/RMSE/MAPE	10 / 50 / 100	80 / 20
AdaptiveAutoScaling	Adam (1e-4)	multi-step MSE	≤100, early stop	70 / 15 / 15
PredictiveAutoscaling	Darts defaults (unstated)	unstated	200	70 / 30
PredictiveK8s	not stated	MSE + early stop	10 (NASA) / 4 (FIFA)	train + last-16-days test
DAF	AdamW (1e-4)	denoising + MSE recon	early stop	train + val split
FELT	not stated	(classification)	not stated	7 : 1 : 2
Fremer	Adam (1e-3)	MSE	not stated	8 : 2
WGAN-gp	MADGRAD (1e-3)	MAE + WGAN-gp critic	1000	60 / 20 / 20
CATScaler	Adam	SmoothL1 (Alibaba) / MSE (Huawei)	early stop (5)	80 / 10 / 10
InformerAutoScale	Adam (1e-3)	MSE/MAE/RMSE	10 / 20 / 50	not stated
CloudFormer	Adam (1e-5)	log-cosh	not stated	7 train / 4 unseen apps

Two “false friends” worth flagging — wording that looks like pretraining but isn’t:

Fremer uses the words “zero-shot” and “foundational backbone”, but these describe Fremer itself being trained once and transferring across datasets (via channel independence), and its potential as a future shared backbone — not the use of any external pretrained model.
DAF calls itself a “hybrid” model, but “hybrid” means combining decomposition + diffusion + exogenous attention — not mixing pretrained and from-scratch training.

Other notables:

WGAN-gp is the odd one out: trained adversarially, per workload (it does not generalize across traces), and with MADGRAD rather than Adam — an ablation shows that optimizer choice is the key accuracy driver.
The closest thing to “train once, then serve” is the deploy pattern in CATScaler and PredictiveAutoscaling: train offline, serve, then periodically retrain (e.g. a Kubernetes CronJob) to fight distribution shift — still from-scratch retraining, not foundation-model fine-tuning.

MAPE Stage-by-stage¶

Monitor¶

No novelties, but richness varies. The table below lists the metrics each paper actually feeds the model and where they come from.

Paper	Input metrics	Uni/Multi	Source
MV-Transformer	CPU, memory, disk R/W, network throughput (Pearson selects throughput↔memory; predicts throughput)	Multi	Trace files (Bitbrains, Google, Azure Functions)
AdaptiveAutoScaling	CPU, memory, request arrival rate, queue length	Multi	Google Cluster trace, in simulator
CATScaler	CPU, memory, RPS per API + machine specs	Multi	Prometheus; Alibaba & Huawei traces
InformerAutoScale	CPU, memory, request rate (→ one aggregated workload target)	Multi	Metrics Server + cAdvisor (Prometheus secondary)
CloudFormer	206 host metrics = 103 × (target VM + neighbors); each 103 = 53 VM (libvirt) + 38 Linux `perf` counters + 12 Intel Top-Down	Multi	libvirt API, Linux perf, Intel Top-Down
PredictiveAutoscaling	Ingress request rate (RPS) — CPU/mem/latency tracked but not model inputs	Uni	Prometheus + Grafana (live, via KEDA)
PredictiveK8s	HTTP requests/min	Uni	Prometheus architecturally; experiments on offline NASA/FIFA logs
DAF	Job-arrival rate + hour/day context	Uni	Traces (Google, Azure, Alibaba, Facebook, Wikipedia)
WGAN-gp	Job-arrival rate	Uni	Traces (Facebook, Alibaba, Google, Wiki, Azure)
Fremer	CPU or QPS per instance (CPU for IaaS/PaaS, QPS for FaaS/RDS)	Uni per-series	ByteDance + Materna traces
FELT	CPU only, compressed into 6 features/container: min, Q1, median, Q3, max + 1 waveform class	Uni (per-source)	Alibaba microservices, Fisher

Takeaways:

Prometheus recurs (PredictiveAutoscaling, CATScaler, and architecturally PredictiveK8s); InformerAutoScale instead reads Metrics Server + cAdvisor.
Most “univariate” papers are workload-only (request/job-arrival rate); the multivariate ones add resource counters (CPU, memory, ...).
CloudFormer is the outlier — black-box host hardware counters (206), not workload at all.

Analyze¶

Plain transformer, made multivariate/looped: MV-Transformer, AdaptiveAutoScaling.
Efficient long-sequence (Informer): PredictiveK8s, InformerAutoScale.
Generative/probabilistic (diffusion): DAF, the only one with uncertainty.
Adversarial (GAN): WGAN-gp, the only one with a critic on forecast realism.
Frequency-domain (FFT): Fremer, the only one forecasting a spectrum.
Convolution-augmented: CATScaler, local-spike + global-trend.
Similarity-aware shared model: FELT, one model for thousands.
Performance predictor: CloudFormer, predicts slowdown, not workload.

Plan¶

Where approaches diverge most sharply:

Trivial / assumed
- WGAN-gp (1 job = 1 VM)
- PredictiveK8s
- MV-Transformer
- InformerAutoScale (pods = load / capacity)
- Fremer
- DAF (linear, left to HPA).
Designed controllers
- AdaptiveAutoScaling (α/β/γ cost optimization + damping cap)
- CATScaler (LightGBM learns the nonlinear forecast→pod mapping, the most sophisticated Plan)
- PredictiveAutoscaling (RPS → KEDA target).
“Anti-flapping” logic
- Cooldown / stabilization window: MV-Transformer’s CPT (Cool-down Period Time), CATScaler’s cooldown.
- Reactive rescaling guard (RRS): only apply a scale-in if the lower demand persists / passes a check, rather than reacting to a single dip. MV-Transformer, InformerAutoScale (on a 60 s cadence).
- Queue + cooldown: CATScaler runs scale requests through a FIFO queue plus a cooldown so bursts of decisions collapse into one.
- Damping cap: AdaptiveAutoScaling limits how much capacity can change per step $|c_{t+1} − c_t| ≤ \delta_{max}$ , so it can’t swing wildly.
Not implemented

Execute¶

Live K8s/Docker
- PredictiveAutoscaling (KEDA → HPA)
- CATScaler (pod replicas)
- InformerAutoScale & MV-Transformer (Docker Desktop + K8s)
Simulator / log-replay
- AdaptiveAutoScaling (discrete-time sim)
- PredictiveK8s (offline replay)
Real cloud VMs
- WGAN-gp (Google Cloud e2-medium)
No actuation (forecast-only)
- DAF
- FELT
- Fremer (except 24h HPA demo)
- CloudFormer

In one sentence each¶

MV-Transformer — multivariate forecaster, full loop, scales real VMs.
AdaptiveAutoScaling — forecaster + cost-optimizing controller, in simulation.
PredictiveAutoscaling — scales the ingress controller via KEDA, live.
PredictiveK8s — first to bring Informer to K8s autoscaling.
DAF — probabilistic forecast with a confidence band.
FELT — one tiny model for thousands of containers via similarity-aware attention.
Fremer — frequency-domain, ~12x smaller, multi-period workloads.
WGAN-gp — adversarially-trained, fast, bursty-traffic forecaster.
CATScaler — convolution-augmented forecast + LightGBM pod model, full live loop.
InformerAutoScale — Informer forecaster + own actuator, ~10x fewer pods than reactive.
CloudFormer — predicts slowdown from black-box host metrics, not workload.

Acronyms & short names¶

Models / methods named in this review¶

Short name	Stands for	What it is
DAF	Diffusion Autoformer	Autoformer (trend+seasonal decomposition) with a diffusion decoder → probabilistic JAR forecast with a confidence band.
MV-Transformer	Multivariate Transformer	Encoder-decoder transformer fed several metrics at once (CPU, mem, disk, net) to exploit cross-metric correlations.
WGAN-gp	Wasserstein Generative Adversarial Network with gradient penalty	A stable GAN variant: a transformer generator forecasts the series; an MLP critic scores realism via Wasserstein (earth-mover’s) distance, kept 1-Lipschitz by a gradient penalty.
FELT	(model name)	Encoder-only, similarity-aware transformer; one shared model classifies workload for thousands of containers.
Fremer	Frequency transformer	Frequency-domain (FFT) transformer that forecasts a spectrum instead of the raw curve.
CATScaler	Convolution-Augmented Transformer Scaler	Convolution-augmented transformer forecaster + LightGBM pod calculator, full live loop.
CloudFormer	(model name)	Dual-branch transformer predicting performance-degradation ratio from black-box host metrics.
Informer	(model name)	Efficient long-sequence transformer using ProbSparse attention (O(n log n)).
LightGBM	Light Gradient-Boosting Machine	A fast gradient-boosting decision-tree model (not a neural net), strong at learning nonlinear tabular input→output mappings — used as CATScaler’s “how many pods?” calculator.
RevIN	Reversible Instance Normalization	Normalize each input window, then exactly reverse it on the output; cheap defense against distribution shift.
Time2Vec	Time-to-Vector	Learnable encoding of timestamps as features.

Domain & infrastructure terms¶

Acronym	Stands for	Meaning
JAR	Job Arrival Rate	Incoming jobs/requests per unit time.
HPA	Horizontal Pod Autoscaler	Kubernetes’ built-in autoscaler that adds/removes pods.
KEDA	Kubernetes Event-Driven Autoscaling	Extends HPA to scale on custom/external metrics.
RPS / QPS	Requests / Queries Per Second	Traffic-rate inputs.
SLA / QoS	Service-Level Agreement / Quality of Service	The performance guarantees autoscaling tries not to violate.
RRS / CPT	Reactive Rescaling System / Cool-down Period Time	Anti-thrashing guards in the Plan stage.

References¶

Kumar, B., Verma, A., Verma, P., & Bennour, A. (2025). Optimizing resource allocation in cloud-native applications through proactive autoscaling with the InformerAutoScale model. The Journal of Supercomputing, 81(9). 10.1007/s11227-025-07500-7
Ding, Z., Feng, B., & Yu, W. (2025). FELT: Large-Scale Cloud Workload Prediction Through Adaptive Feature-Enhanced and Similarity-Aware Transformer. Tsinghua Science and Technology. 10.26599/tst.2025.9010102
Ye, H., Chen, J., Jiang, F., He, X., Zhang, T., Chen, J., & Gao, X. (2025). Fremer: Lightweight and Effective Frequency Transformer for Workload Forecasting in Cloud Services. Proceedings of the VLDB Endowment, 18(11), 3812–3825. 10.14778/3749646.3749656
Arbat, S., Jayakumar, V. K., Lee, J., Wang, W., & Kim, I. K. (2022). Wasserstein Adversarial Transformer for Cloud Workload Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 36(11), 12433–12439. 10.1609/aaai.v36i11.21509
G, C. N., O, B. C., R, J. K., Raj B N, P., Naik, P. N., & G, M. B. (2026). Transformer-Based Workload Prediction and Adaptive Auto-Scaling in Cloud Data Centers. 2026 IEEE International Conference for Convergence in Computing Technology (I3CTCON), 1–6. 10.1109/i3ctcon68242.2026.11507247
Shrestha, R., & Tuz Sabiha, F. (2025). Enhancing Cloud Resource Utilization with Predictive Autoscaling Using Transformer Models. 2025 9th International Conference on Cloud and Big Data Computing (ICCBDC), 24–29. 10.1109/iccbdc67784.2025.00011
Shim, S., Dhokariya, A., Doshi, D., Upadhye, S., Patwari, V., & Park, J.-Y. (2023). Predictive Auto-scaler for Kubernetes Cloud. 2023 IEEE International Systems Conference (SysCon), 1–8. 10.1109/syscon53073.2023.10131106
Meng, F., Dai, H., Cong, G., Zhu, B., & Zhao, H. (2025). CATScaler: A Convolution-Augmented Transformer Scaling Framework for Cloud-Native Applications. IEEE Transactions on Services Computing, 18(5), 2659–2672. 10.1109/tsc.2025.3592383
Shahbazinia, A., Huang, D., Costero, L., & Atienza, D. (2025). CloudFormer: An Attention-based Performance Prediction for Public Clouds with Unknown Workload. arXiv. 10.48550/ARXIV.2509.03394