Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

CATScaler

A convolution-augmented transformer forecaster paired with a LightGBM pod calculator for a full, live autoscaling loop

Meng et al. (2025) Citations


TL;DR

Cloud apps run inside lots of small “containers”, and the platform that manages them, Kubernetes, normally adds or removes copies only after it notices the app is overloaded. That’s like only starting to cook more food after the restaurant is already full of hungry customers: you’re always late, customers wait, and some leave. CATScaler fixes this by forecasting how busy each service will be over the next few hours and adding/removing container copies ahead of time. It does the forecasting with a neural network called a Transformer that the authors upgraded with a convolution module (good at spotting short, sharp spikes) and a normalization trick called RevIN (keeps predictions accurate even when live traffic looks different from the data the model was trained on). A second model, LightGBM, then translates the forecast into a concrete answer: “you’ll need 7 container copies in 30 minutes.” On real traces from Alibaba Cloud and Huawei Cloud, CATScaler cut the rate of broken service promises (SLA violations) by about 76% and response time by about 52% versus Kubernetes’ default autoscaler.


The Problem (and why simple autoscaling isn’t enough)

Cloud applications must serve wildly varying demand. Allocate too few resources → requests slow down or fail and you violate the SLA (Service Level Agreement, the promise of, say, “responses under 200 ms”). Allocate too many → you pay for idle machines. The job of an autoscaler is to keep this balance automatically.

Kubernetes ships with a built-in autoscaler, the HPA (Horizontal Pod Autoscaler). It works reactively: you set a threshold like “CPU > 60%,” and when the measured CPU crosses it, the HPA starts adding pods (a pod = one running copy of your container). Two problems:

  1. It’s always lagging. It reacts only after load is already high. Worse, spinning up a new pod takes time (the provisioning delay / cold start). So during the gap, real users get slow or failed responses. The paper measured HPA latency peaking at 650 ms while load ramped up.

  2. Workloads are bursty and shift over time. Real API traffic spikes suddenly and its statistical character drifts from day to day, so a model trained on old data can mispredict — a problem called distribution shift.

Existing proactive (predictive) approaches help, but the paper argues they often have one of two weaknesses: (a) forecasts aren’t accurate enough, especially for short sharp spikes, or (b) the logic that turns a forecast into a pod count is too simplistic — e.g. assuming load and pod-count scale linearly when the real relationship is nonlinear.


Background


Contribution in Simple Terms

CATScaler is a complete proactive autoscaling framework with two genuinely-combined ideas:

  1. A better workload forecaster. A Transformer is augmented with a convolution module so it captures both the big-picture trend (Transformer’s strength) and sudden local fluctuations (convolution’s strength), plus RevIN to stay robust when live data drifts from training data. Many prior works used only one of these tricks; CATScaler unifies them and adds a speed optimization (replacing the heavy Transformer decoder with a lightweight MLP).

  2. A smarter “how many pods?” calculator. Instead of a naive linear formula, it uses LightGBM to learn the nonlinear relationship between predicted workload, history, and the target utilization, outputting the exact pod count needed.

The result is an end-to-end system: monitor → forecast → decide pod count → actually scale Kubernetes pods, all before demand hits.


How It Works, Step by Step

Forecasting module (the Convolution-Augmented Transformer):

  1. Collect metrics. Prometheus scrapes per-API time series (CPU %, memory %, requests/sec, machine specs) at fixed intervals. A sliding window of the last T time steps is the model’s input (data split 80% train / 10% validation / 10% test in time order; categorical fields are numerically encoded; Z-score standardization applied for training).

  2. RevIN normalize. Each input window is normalized using its own statistics, so the model isn’t thrown off by drift between training and live data. (This step is exactly reversed at the very end.)

  3. Decompose into trend + residual. A moving-average kernel splits the normalized series into a smooth trend component (overall direction, up or down) and a residual component (the detailed local wiggles).

  4. Model the trend cheaply. The trend carries little information and is stable, so a simple linear layer fits it.

  5. Model the residual richly. The residual (nonlinear, multi-feature interactions) goes through the convolution-augmented Transformer encoder. In parallel branches, the residual is passed through (a) point + depthwise convolution for local features and (b) self-attention for global features; each uses a ResNet-style residual/skip connection (output = input + transform) to ease training and avoid vanishing gradients. The two are summed into a hidden representation.

  6. Lightweight decoding. The usual Transformer decoder is replaced by an MLP (multi-layer perceptron), cutting compute from quadratic O(n²) to linear O(n) — faster inference, lower latency, important for real-time scaling.

  7. Recombine + de-normalize. Trend output and residual output are merged via a learnable weight matrix W, then RevIN’s normalization is reversed to produce the final forecast: the next P workload values (P = 144 chosen).

Scaling module (the Proactive Scaler):

  1. Compute expected pod count with LightGBM. Four inputs are fed in: the historical workload, the predicted future workload, the historical pod count, and the target utilization (CPU/memory). Feature engineering adds periodicity cues (hour, week), totals (CPUtotal, MEMtotal), and an interaction feature (CPU × memory) to expose nonlinear structure. LightGBM regresses these into the expected number of pods for each future step.

  2. Smooth the decisions. Predicted counts are clamped to a configured min/max range and pushed into a FIFO queue. Two guards prevent thrashing: a cooling phase (no scaling for a set time after an action, to ignore momentary blips) and a temporal interval judgment (avoid scaling too early or too late).

  3. Execute the scaling. If the predicted count differs from the live count and the system isn’t cooling down, CATScaler adjusts the Kubernetes pod replica count up or down automatically. The whole service is wrapped via Flask, and a Kubernetes CronJob periodically retrains/updates the model so it tracks changing workloads.


Inputs (what it consumes)

Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

CATScaler is a full proactive, horizontal autoscaler — it covers the entire MAPE-K control loop, not just the prediction part:

Concrete analogy: an online store the night before a big sale. The default HPA waits until checkout queues are already overflowing, then scrambles to open more registers (too late). CATScaler reads the trend, predicts the morning rush, and opens exactly the right number of registers ahead of time — then closes them as the rush fades, so it isn’t paying idle cashiers.


Evaluation (briefly)


Training & pre-training

Trained from scratch — no pretrained or foundation model.

Both halves of CATScaler learn from random initialization on the cloud workload traces themselves (the Alibaba Cloud production trace and the Huawei Cloud Private Functions Trace 2023). There is no pretrained backbone, no time-series foundation model, no fine-tuning, and no zero-shot use — the convolution-augmented Transformer forecaster and the LightGBM pod-count regressor are each fit directly on the data described in Evaluation above.

(For the record, the only “Foundation” mentions in the paper refer to funding agencies, not foundation models.)


Strengths

Limitations


Glossary

References
  1. Meng, F., Dai, H., Cong, G., Zhu, B., & Zhao, H. (2025). CATScaler: A Convolution-Augmented Transformer Scaling Framework for Cloud-Native Applications. IEEE Transactions on Services Computing, 18(5), 2659–2672. 10.1109/tsc.2025.3592383