CATScaler
A convolution-augmented transformer forecaster paired with a LightGBM pod calculator for a full, live autoscaling loop
TL;DR¶
Cloud apps run inside lots of small “containers”, and the platform that manages them, Kubernetes, normally adds or removes copies only after it notices the app is overloaded. That’s like only starting to cook more food after the restaurant is already full of hungry customers: you’re always late, customers wait, and some leave. CATScaler fixes this by forecasting how busy each service will be over the next few hours and adding/removing container copies ahead of time. It does the forecasting with a neural network called a Transformer that the authors upgraded with a convolution module (good at spotting short, sharp spikes) and a normalization trick called RevIN (keeps predictions accurate even when live traffic looks different from the data the model was trained on). A second model, LightGBM, then translates the forecast into a concrete answer: “you’ll need 7 container copies in 30 minutes.” On real traces from Alibaba Cloud and Huawei Cloud, CATScaler cut the rate of broken service promises (SLA violations) by about 76% and response time by about 52% versus Kubernetes’ default autoscaler.
The Problem (and why simple autoscaling isn’t enough)¶
Cloud applications must serve wildly varying demand. Allocate too few resources → requests slow down or fail and you violate the SLA (Service Level Agreement, the promise of, say, “responses under 200 ms”). Allocate too many → you pay for idle machines. The job of an autoscaler is to keep this balance automatically.
Kubernetes ships with a built-in autoscaler, the HPA (Horizontal Pod Autoscaler). It works reactively: you set a threshold like “CPU > 60%,” and when the measured CPU crosses it, the HPA starts adding pods (a pod = one running copy of your container). Two problems:
It’s always lagging. It reacts only after load is already high. Worse, spinning up a new pod takes time (the provisioning delay / cold start). So during the gap, real users get slow or failed responses. The paper measured HPA latency peaking at 650 ms while load ramped up.
Workloads are bursty and shift over time. Real API traffic spikes suddenly and its statistical character drifts from day to day, so a model trained on old data can mispredict — a problem called distribution shift.
Existing proactive (predictive) approaches help, but the paper argues they often have one of two weaknesses: (a) forecasts aren’t accurate enough, especially for short sharp spikes, or (b) the logic that turns a forecast into a pod count is too simplistic — e.g. assuming load and pod-count scale linearly when the real relationship is nonlinear.
Background¶
Container / Pod: a lightweight, isolated copy of your app. Scaling “horizontally” = changing how many copies run.
Horizontal scaling (scale out/in): add/remove whole pod copies. Vertical scaling (scale up/down): give one pod more/less CPU/RAM. CATScaler does horizontal scaling (vertical is mentioned as future work).
Serverless / cloud-native: an app architecture of many small services that the platform scales automatically; the developer doesn’t manage servers directly.
Time-series forecasting: predicting future values (e.g. next 144 CPU readings) from a window of recent past values.
Transformer: a neural network built on self-attention — a mechanism that lets every time step “look at” every other time step to learn long-range patterns (e.g. “traffic always rises around 9 a.m.”). Great at global patterns, weaker at tiny local wiggles.
Convolution (CNN): a sliding filter that focuses on local neighborhoods — excellent at catching short, sudden spikes the Transformer might smooth over.
RevIN (Reversible Instance Normalization): normalize each input window (subtract its own mean, divide by its own std) before the model, then reverse that exactly on the output. Cheap insurance against distribution shift.
LightGBM: a fast gradient-boosting decision-tree model (not a neural net) that’s strong at learning complex nonlinear input→output mappings from tabular features.
Contribution in Simple Terms¶
CATScaler is a complete proactive autoscaling framework with two genuinely-combined ideas:
A better workload forecaster. A Transformer is augmented with a convolution module so it captures both the big-picture trend (Transformer’s strength) and sudden local fluctuations (convolution’s strength), plus RevIN to stay robust when live data drifts from training data. Many prior works used only one of these tricks; CATScaler unifies them and adds a speed optimization (replacing the heavy Transformer decoder with a lightweight MLP).
A smarter “how many pods?” calculator. Instead of a naive linear formula, it uses LightGBM to learn the nonlinear relationship between predicted workload, history, and the target utilization, outputting the exact pod count needed.
The result is an end-to-end system: monitor → forecast → decide pod count → actually scale Kubernetes pods, all before demand hits.
How It Works, Step by Step¶
Forecasting module (the Convolution-Augmented Transformer):
Collect metrics. Prometheus scrapes per-API time series (CPU %, memory %, requests/sec, machine specs) at fixed intervals. A sliding window of the last T time steps is the model’s input (data split 80% train / 10% validation / 10% test in time order; categorical fields are numerically encoded; Z-score standardization applied for training).
RevIN normalize. Each input window is normalized using its own statistics, so the model isn’t thrown off by drift between training and live data. (This step is exactly reversed at the very end.)
Decompose into trend + residual. A moving-average kernel splits the normalized series into a smooth trend component (overall direction, up or down) and a residual component (the detailed local wiggles).
Model the trend cheaply. The trend carries little information and is stable, so a simple linear layer fits it.
Model the residual richly. The residual (nonlinear, multi-feature interactions) goes through the convolution-augmented Transformer encoder. In parallel branches, the residual is passed through (a) point + depthwise convolution for local features and (b) self-attention for global features; each uses a ResNet-style residual/skip connection (output = input + transform) to ease training and avoid vanishing gradients. The two are summed into a hidden representation.
Lightweight decoding. The usual Transformer decoder is replaced by an MLP (multi-layer perceptron), cutting compute from quadratic O(n²) to linear O(n) — faster inference, lower latency, important for real-time scaling.
Recombine + de-normalize. Trend output and residual output are merged via a learnable weight matrix W, then RevIN’s normalization is reversed to produce the final forecast: the next P workload values (P = 144 chosen).
Scaling module (the Proactive Scaler):
Compute expected pod count with LightGBM. Four inputs are fed in: the historical workload, the predicted future workload, the historical pod count, and the target utilization (CPU/memory). Feature engineering adds periodicity cues (hour, week), totals (CPUtotal, MEMtotal), and an interaction feature (CPU × memory) to expose nonlinear structure. LightGBM regresses these into the expected number of pods for each future step.
Smooth the decisions. Predicted counts are clamped to a configured min/max range and pushed into a FIFO queue. Two guards prevent thrashing: a cooling phase (no scaling for a set time after an action, to ignore momentary blips) and a temporal interval judgment (avoid scaling too early or too late).
Execute the scaling. If the predicted count differs from the live count and the system isn’t cooling down, CATScaler adjusts the Kubernetes pod replica count up or down automatically. The whole service is wrapped via Flask, and a Kubernetes CronJob periodically retrains/updates the model so it tracks changing workloads.
Inputs (what it consumes)¶
Per-API multivariate time series from Prometheus: CPU utilization, memory utilization, requests-per-second, plus machine/function specs.
Lookback window of T past steps (T chosen from {48, 96, 144, 196, 240}); data sampled at 30-second (Alibaba) / 1-minute (deployment) intervals.
For the scaler/LightGBM specifically: historical workload sequence, predicted workload sequence, historical pod count, and the target utilization thresholds (CPUexp, Memexp — set to 60% in experiments), plus engineered time/interaction features.
Outputs (what it produces)¶
Forecasting module: a multi-step forecast of future workload (e.g. CPU%, memory%) for the next P = 144 time steps (multivariate, multi-step).
Scaling module: a sequence of expected pod counts [R(t+1), …, R(t+n)] that keep utilization at the target, turned into concrete scale-out / scale-in actions on Kubernetes pods.
Horizon: up to 144 steps ahead (hours, at minute-level sampling), enabling capacity changes before demand arrives.
How It Fits the Autoscaling Framework (MAPE-K)¶
CATScaler is a full proactive, horizontal autoscaler — it covers the entire MAPE-K control loop, not just the prediction part:
Monitor: Prometheus gathers metrics (Grafana visualizes, Alibaba SLS logs decisions).
Analyze: the convolution-augmented Transformer forecasts future workload — this is where the transformer lives, as a time-series workload forecaster.
Plan: LightGBM converts the forecast into an exact pod count; the FIFO queue, cooldown, and interval checks turn that into stable scaling decisions.
Execute: CATScaler itself changes the Kubernetes pod replica count — it does not merely hand a number to the default HPA; it replaces the reactive HPA with its own proactive actuation.
Scaling type: horizontal (pod replicas). Vertical/hybrid scaling is explicitly left to future work.
Reactive vs proactive: firmly proactive — it scales ahead of demand, hiding the provisioning delay, whereas the default HPA only reacts after a threshold is crossed.
Concrete analogy: an online store the night before a big sale. The default HPA waits until checkout queues are already overflowing, then scrambles to open more registers (too late). CATScaler reads the trend, predicts the morning rush, and opens exactly the right number of registers ahead of time — then closes them as the rush fades, so it isn’t paying idle cashiers.
Evaluation (briefly)¶
Datasets: Huawei Cloud Private Functions Trace 2023 (234 days, 8 metrics, 200 functions — a public benchmark) and an Alibaba Cloud production Kubernetes trace (CPU%, memory%, RPS at 30-s intervals, May–June 2023, noisier).
Forecasting metrics: MSE, MAE, RMSE vs baselines Transformer, pCNN-LSTM, CNN-GRU, DLinear. On Huawei, CATScaler hit state-of-the-art, cutting errors by ~50% (and up to 80% MSE at 48 steps) vs the plain Transformer; near-SOTA on the harder Alibaba data.
Ablation: both the convolution-augmented module (CAM) and RevIN contribute; CAM alone cut Huawei errors by 67.7%/36.4%/42.5%; full model cut them 77.0%/49.7%/52.1% vs the backbone. LightGBM beat XGBoost/linear/random-forest for the pod-count regression (MSE 0.017).
Live cluster (4 servers, ClickHouse data service): vs default HPA, SLA violation rate −76.2%, response time −51.7%, average CPU utilization +17.5%; peak latency cut 5× vs HPA and ~1× vs the proactive baseline K-AGRUED. (The abstract’s headline figures of “1.1× latency” and “3.2× violation” reflect a different comparison framing.)
Training & pre-training¶
Trained from scratch — no pretrained or foundation model.
Both halves of CATScaler learn from random initialization on the cloud workload traces themselves (the Alibaba Cloud production trace and the Huawei Cloud Private Functions Trace 2023). There is no pretrained backbone, no time-series foundation model, no fine-tuning, and no zero-shot use — the convolution-augmented Transformer forecaster and the LightGBM pod-count regressor are each fit directly on the data described in Evaluation above.
Setup: Python 3.10.8 + PyTorch on an RTX 4090; Adam optimizer; early stopping (halt when validation loss doesn’t improve for 5 consecutive epochs); hyperparameters chosen by grid search.
Data handling: sliding-window inputs with a chronological 80/10/10 train/val/test split per service (see step 1); categorical fields encoded and Z-score standardized. The loss is dataset-dependent — SmoothL1 for the noisier Alibaba trace, MSE for the cleaner Huawei trace.
In deployment: a Kubernetes CronJob periodically retrains the model to fight distribution shift — but this is still from-scratch retraining on fresh data, not foundation-model fine-tuning.
(For the record, the only “Foundation” mentions in the paper refer to funding agencies, not foundation models.)
Strengths¶
End-to-end and deployable: real Prometheus/Grafana/Flask/Kubernetes stack, not just an offline forecasting study; it actually actuates scaling.
Best-of-both forecasting: convolution (local spikes) + attention (global trend) + trend/residual decomposition + RevIN (distribution-shift robustness) — measurable accuracy gains.
Nonlinear, learned pod-count mapping via LightGBM, instead of a naive linear formula.
Latency-conscious design: MLP replaces the Transformer decoder (O(n²)→O(n)); cooldown/FIFO logic avoids scaling thrashing.
Strong empirical results on two real cloud traces and in a live cluster.
Limitations¶
Cold start for brand-new services: no history means nothing to forecast from (acknowledged as future work).
Rare, never-before-seen traffic surges are hard to predict — the model learns from patterns it has seen.
Horizontal scaling only so far; no vertical or hybrid scaling yet.
Tested on a small cluster (4 servers); behavior at large scale is unverified (authors plan larger-cluster evaluation).
Periodic retraining required (via CronJob), adding operational overhead and a dependency on continued metric quality.
Workloads studied are mostly CPU-intensive, so generalization to memory- or I/O-bound services is less explored.
The abstract’s headline numbers differ from the detailed experiment numbers, depending on baseline framing — read the per-experiment results for the precise gains.
Glossary¶
SLA / SLO: Service Level Agreement/Objective — the performance promise (here, response time ≤ 200 ms with HTTP 200).
HPA / VPA: Horizontal / Vertical Pod Autoscaler — Kubernetes’ built-in reactive scalers.
Pod: one running copy of a containerized app; horizontal scaling changes the pod count.
Provisioning delay / cold start: the lag before a newly added pod is ready to serve.
Distribution shift: when live data’s statistics drift from the training data, hurting prediction accuracy.
Self-attention / Transformer: neural mechanism/network that relates all time steps to each other; good at global patterns.
Convolution / CNN: sliding local filter; good at short, sharp patterns.
RevIN: Reversible Instance Normalization — normalize input, reverse it on output, to counter distribution shift.
Trend / residual decomposition: splitting a series into a smooth direction (trend) and detailed fluctuations (residual).
MLP: multi-layer perceptron, a basic feed-forward neural net (used here instead of a Transformer decoder for speed).
LightGBM: fast gradient-boosting decision-tree model; here it maps forecast+state → pod count.
MAPE-K: Monitor–Analyze–Plan–Execute over shared Knowledge — the reference loop for self-adaptive/autoscaling systems.
Prometheus / Grafana / SLS: metric collection / visualization / logging tools.
- Meng, F., Dai, H., Cong, G., Zhu, B., & Zhao, H. (2025). CATScaler: A Convolution-Augmented Transformer Scaling Framework for Cloud-Native Applications. IEEE Transactions on Services Computing, 18(5), 2659–2672. 10.1109/tsc.2025.3592383