WGAN-gp - Luca's Research @ PEG

Arbat et al. (2022) Citations

TL;DR¶

Cloud applications run on rented machines (VMs). If you rent too few, the app gets overwhelmed and slow; rent too many and you waste money. The smart move is to forecast how much traffic is coming and rent machines ahead of time. This paper builds a forecaster that predicts the number of incoming jobs/requests in the next time interval from the recent history of job counts. The new idea is to combine two modern AI techniques: a Transformer and an adversarial training trick called WGAN-gp (one network tries to forecast, a second “critic” network judges how realistic the forecast looks, and they train against each other). The result, WGAN-gp Transformer, predicts cloud workloads up to 5.1% more accurately and 5x faster than the previous best method (an LSTM-based system called LoadDynamics), and when plugged into a real Google Cloud autoscaler it cut both over-provisioning (wasted machines) and under-provisioning (too few machines) substantially.

The Problem (and why simple autoscaling isn’t enough)¶

Autoscaling is the cloud feature that automatically adds machines when an app gets busy (“scale-out”) and removes them when it goes quiet (“scale-in”). The goal is to always have just enough capacity.

The naive way to autoscale is reactive: watch the current load, and once it crosses a threshold (say, CPU > 70%), start booting more machines. The fatal flaw is timing. Booting a VM takes time (the “startup delay” or cold start). By the time the new machine is ready, the traffic spike may have already caused slow responses or dropped requests. Reactive autoscaling is always running a step behind.

The fix is predictive (proactive) autoscaling: forecast the demand for the next interval and start booting machines before the spike hits, so they’re warm and ready in time. A predictive autoscaler has two parts:

A workload predictor — forecasts future demand (here, the job arrival rate, i.e. how many jobs/requests will show up).
An autoscaler — turns that forecast into a decision about how many VMs to run.

This paper is about building a better workload predictor. The state of the art at the time was LoadDynamics, an LSTM-based forecaster. LSTMs have two weaknesses the authors target:

Slow at inference. An LSTM is recurrent: it reads a sequence one time step at a time, in order. The longer the input history, the slower it gets. In autoscaling, the forecast must be produced quickly each interval, so this overhead matters.
Struggles with burstiness. Real cloud traffic isn’t smooth. It has sudden, random spikes (think a news site when a story goes viral). LSTMs are good at seasonal/repeating patterns but less good at these unpredictable bursts.

Background¶

A few terms, defined once:

Job Arrival Rate (JAR): how many jobs/user-requests arrive in a time interval. This is the single number the model forecasts. (The paper uses “workload,” “user requests,” and “job arrival rate” interchangeably.)
Time series / univariate: a sequence of numbers measured over time, e.g. job counts every 10 minutes: [120, 134, 98, 210, ...]. Univariate means a single value per time step (just the job count), not multiple metrics.
Lookback / history window (n): how many past time steps the model reads to make a prediction. The authors slide this window along the series.
Transformer: a neural network that uses attention — a mechanism that lets the model look at all positions in the input at once and decide which past moments matter most for the prediction. Unlike an LSTM, it does not process the sequence step-by-step, so it can run in one shot (faster) and directly relate distant points in time. It has an encoder (digests the input history) and a decoder (produces the forecast).
Positional encoding: because a Transformer reads everything “at once,” it has no built-in sense of order. Positional encoding is extra information added to each input value telling the model when in the sequence it occurred. Essential for time series, where order is everything.
GAN (Generative Adversarial Network): two networks trained against each other. A generator produces outputs; a discriminator/critic tries to tell real data from generated data. Competition pushes the generator to make increasingly realistic outputs. Here the “realistic output” is a believable continuation of the workload time series.
WGAN (Wasserstein GAN) + gradient penalty (gp): plain GANs are notoriously unstable to train. WGAN replaces the discriminator with a critic that estimates the Wasserstein distance (an “earth-mover’s distance” — how far apart two distributions are) between real and generated data, which gives smoother, more reliable training signals. WGAN needs the critic to be 1-Lipschitz (mathematically “not too steep”). The original WGAN enforced this by crudely clipping weights, which caused problems. WGAN-gp instead adds a gradient penalty: a soft nudge in the loss that keeps the critic’s gradients near magnitude 1. Cleaner and more stable.
MLP (Multi-Layer Perceptron): the plainest neural network — a stack of fully connected layers. Used here as the critic.
MADGRAD: a relatively modern optimizer (the algorithm that adjusts the network’s weights during training). The authors find it converges better than the popular Adam optimizer, and they show it is the key factor in their accuracy gains.

Contribution in Simple Terms¶

The paper’s contribution is a new time-series forecaster, WGAN-gp Transformer, purpose-built for predicting bursty cloud workloads, that is both more accurate and far faster than the prior LSTM state of the art. Concretely, what’s genuinely new:

It uses a Transformer instead of an LSTM as the forecaster. Because the Transformer processes the whole history in one shot (no step-by-step recurrence), inference is ~5x faster, and its attention mechanism captures the relevant past moments for predicting sudden spikes.
It trains that Transformer adversarially with WGAN-gp. The Transformer is the generator; a small MLP critic judges how realistic its forecasts look. This adversarial pressure makes forecasts of dynamic, bursty workloads more accurate than training the Transformer alone. (Prior adversarial-Transformer work, “Adversarial Sparse Transformer,” used a sparse attention that tended to lose long-term info; this paper uses standard attention + WGAN-gp to keep that long-range information.)
It plugs MADGRAD in as the optimizer for both generator and critic, and shows experimentally that this — not just the architecture — is what unlocks the accuracy improvement over an Adam-trained version.
It validates the predictor inside a real autoscaler on Google Cloud, not just on offline error metrics — showing the better forecasts translate into fewer over- and under-provisioned VMs in practice.

In one sentence: take a Transformer, train it as the generator in a stable WGAN-gp adversarial setup with the MADGRAD optimizer, and you get a fast, accurate cloud-workload forecaster that improves real autoscaling.

How It Works, Step by Step¶

Training (offline):

Collect & format data. Take a single stream of job counts over time (univariate series). Split chronologically into 60% train / 20% validation / 20% test.
Slide a window. Use a sliding window of history length n (stride of 1 time step). Each window of past values x[1..t0] is an input; the very next value x[t0+1] is the target. (Prediction range τ = 1 — i.e. one-step-ahead forecasting.)
Add positional encoding to the input so the Transformer knows the order of the history. Positional encoding is also applied to the decoder input, which is the last value of the history window.
Generator forecasts. The Transformer generator (one encoder layer + one decoder layer, with multi-head attention) reads the history. The encoder compresses it into a latent “memory” vector h; the decoder uses h to produce the predicted next value x̂.
Critic judges. The MLP critic (3 fully-connected layers, LeakyReLU activations) is fed two things: the real full sequence (history + actual future) and the fake one (history + the generator’s forecast, concatenated). It outputs a score approximating the Wasserstein distance between real and generated, plus the gradient penalty term that keeps it 1-Lipschitz (penalty coefficient λ = 10).
Adversarial loop (Algorithm 1). Each round: update the critic 5 times (ncritic = 5) so it becomes a good judge, then update the generator once to fool the critic and minimize forecast error. The generator’s loss combines (a) mean absolute error between forecast and truth and (b) the critic’s score — so it’s pushed to be both accurate and realistic. Both networks are optimized with MADGRAD (learning rate 0.001), trained for 1000 epochs.
Hyperparameter search. A grid search tunes history length n, batch size m, model dimension dmodel, and number of attention heads nhead per workload.

Inference (online, per interval):

Feed the most recent n job counts (with positional encoding) through the generator only. Because the Transformer reads the whole window at once (no recurrence), it returns the next-step forecast in a single fast pass (~4.85 ms vs ~25.57 ms for the LSTM baseline).

Using it to autoscale (the deployment experiment):

The forecast Pi = predicted number of jobs arriving in interval i. Under the paper’s assumption that one job needs one VM, the autoscaler pre-creates Pi VMs before interval i begins.
Compare to actual arrivals Ti: if Ti > Pi → under-provisioned (need more VMs, incur startup delay); if Ti < Pi → over-provisioned (idle VMs waste money). Better forecasts shrink both error types.

Workload Modeling & Prediction Pipeline (full-text deep read)¶

This section drills into exactly two things: (A) what the paper treats as “the workload” and how it turns that into numbers a neural network can read, and (B) the precise path from raw data to a prediction.

(A) How the workload is modeled and characterized¶

What real-world quantity is “the workload”? It is the Job Arrival Rate (JAR) — literally how many jobs (or user requests) arrive at the cloud application during one fixed time interval. The paper says outright that it “use[s] the terms workloads, user requests, and job arrival rates interchangeably.” So the workload is a count of incoming jobs per interval, not CPU%, not memory, not latency, not a performance-degradation ratio. (Note: the paper is framed around VM auto-scaling — its title is “Cloud Workload Prediction” and it repeatedly talks about provisioning/de-provisioning VMs — but the actual signal it models and forecasts is purely the job/request count, never VM utilization metrics like CPU or memory.)

One signal or many? Univariate. The paper is explicit and formal: “A univariate time-series is defined as a sequence of measurements of the same variable collected over time. We study univariate time-series data of JARs.” Univariate means there is exactly one number per time step — the job count — and nothing else (no parallel CPU, memory, or disk channels). Formally the series is x = [x1, x2, ..., xT] where each xt is a single real number, the job count at time t.

Time granularity (how often it is sampled). This varies by dataset, and the authors deliberately test several granularities because “different time granularities can exhibit subtle variations in the time-series workload patterns.” From Table 1:

Facebook and Alibaba-2018 (data-center traces): 5 and 10 minute intervals.
Google and Wikipedia: 10 and 30 minute intervals.
Azure-VM-2017: 10, 30, and 60 minute intervals.
Azure-VM-2019: 10, 30, 60 minute intervals.
Azure-Func-2019 (serverless functions): 30 and 60 minute intervals.

So one job-count sample is recorded every 5 to 60 minutes depending on the trace. Across 7 real workloads and these interval choices, the authors build 15 distinct workload configurations.

What the raw data looks like, and what makes it hard. Figure 1 plots four of the traces as plain line charts of job counts over interval index — e.g. Google Cluster job counts (in millions, ~0 to 2.5x10^6) over ~1200 thirty-minute intervals; Facebook (0 to ~150 jobs) over ~120 ten-minute intervals; Azure-VM-2017 (0 to ~6000 jobs) over ~600 sixty-minute intervals; Wikipedia (~2 to 6.5 million) over ~500 thirty-minute intervals. The hard parts the paper calls out:

Random, dynamic burstiness. “The majority of real-world cloud workloads have random and dynamic burstiness.” Sudden spikes (“unprecedented changes in user request patterns over time”) are the core difficulty — think a news/social spike that no calendar predicts.
Mixed seasonality. Some traces (notably Wikipedia, and the Azure function trace) have strong seasonal/repeating patterns; data-center traces (Facebook, Google) are dominated by spikes; Azure VM traces are a mixture of both. No single regime fits all 15 configurations — which is why the authors tune a separate model per trace.
Long sequences. Histories can be long (the history-length search space reaches 676 steps for Google), and the paper’s whole speed argument is that long inputs slow an LSTM but not a one-shot Transformer.

How the workload is represented numerically. Three concrete moves, and notably no frequency transform or explicit trend/seasonal/residual decomposition:

Sliding window into fixed-length sequences. The long raw series x1:T is sliced into N overlapping sub-sequences of length S (the paper writes the training set as X ∈ R^{N×S}). A window of history x[1..t0] is the input; the next value x[t0+1] is what to predict. The window slides forward with a stride of one time step.
History length n is a tuned hyperparameter, not a fixed constant — search ranges per workload run from [3-46] (Facebook) up to [28-676] (Google) — see Table 2.
Positional encoding is added to each input value so the order-agnostic Transformer (which reads everything at once) knows when each job-count occurred. The paper does not describe any extra normalization scheme, embedding lookup, patching, or trend/seasonal decomposition beyond windowing + positional encoding. Internally the encoder turns the window into a latent “memory” vector, but the input representation is just: a window of raw job counts + positional encodings.

(B) The step-by-step prediction pipeline (raw signal -> prediction result)¶

The detailed training/inference mechanics are already given above in “How It Works, Step by Step.” Here is the tight raw-signal-to-result walkthrough, with the numbers the paper states:

Collect the raw workload. Obtain one trace’s stream of job counts at a fixed interval (5-60 min). This is a single univariate time series of length T.
Split chronologically. First 60% for training, next 20% for cross-validation, last 20% for testing — kept in time order (no shuffling across the split boundaries), because it is forecasting.
Window the series. Apply the sliding window (stride 1) to cut the series into input/target pairs: input = the previous n job counts x[1..t0]; target = the single next job count x[t0+1]. The paper fixes the prediction range to τ = 1, i.e. strictly one step ahead.
Encode position and feed the generator. Add positional encoding to the window and pass it to the Transformer generator — a compact one encoder layer + one decoder layer with multi-head attention (number of heads nhead ∈ {4, 8}; model width dmodel ∈ {8,16,32,64,128,512}, both grid-searched). Why a Transformer: attention can directly relate any past moment to the prediction (good for catching the run-up to a spike) and processes the whole window in a single pass (fast).
Encoder -> latent memory. The encoder compresses the history window into a latent vector h[1..t0]. Why: this h is the condensed “what the recent past looks like” that the decoder reads from.
Decoder -> raw forecast. The decoder, seeded with the last value of the history window (x[t0]) plus its own positional encoding and the encoder memory h, emits the predicted next value x_hat[t0+1]. Why this stage exists: it converts the encoded context into the actual forecast number.
(Training only) Critic judges realism. A 3-layer MLP critic (LeakyReLU, α=0.2) scores the real full sequence x[1..S] against the generated one [x[1..t0] ⊕ x_hat[t0+1..S]] (history concatenated with forecast), approximating the Wasserstein distance between real and generated workloads, plus a gradient penalty (λ=10) keeping it 1-Lipschitz. Why: this adversarial pressure makes forecasts of bursty data look like plausible real continuations, not just least-squares-flat guesses.
(Training only) Adversarial update loop. Per round: update the critic ncritic = 5 times, then the generator once; the generator’s loss = mean absolute error (L1) to the true next value plus the critic’s score. Optimizer is MADGRAD (learning rate 0.001, momentum 0.9), run 1000 epochs. Why MADGRAD: an ablation (Figure 3) shows it, not the architecture alone, is the key accuracy driver versus Adam.
Inference (online, the part that runs each interval). Use the generator only: feed the most recent n job counts (with positional encoding) in one forward pass and read out the single predicted job count for the next interval. This is a point forecast (one number), not a probabilistic interval, and the unit is jobs/requests per interval. Measured latency: ~4.85 ms (vs ~25.57 ms for the LSTM baseline).
PREDICTION RESULT. The output is P_i = the predicted number of jobs arriving in the next interval i, in raw job-count units, one interval (5-60 min) ahead. (How it’s consumed downstream, briefly: under the paper’s “1 job = 1 VM” assumption, the autoscaler pre-creates P_i VMs before interval i; accuracy is scored with MAPE, and on Google Cloud the better forecasts cut under-/over-provisioning. The pipeline’s job ends at producing P_i.)*

Model Parameters & How They Were Chosen¶

This section consolidates every hyperparameter the paper reports for the WGAN-gp Transformer and states, for each, how the value was selected.

(A) What are the model parameters?¶

Architecture. The model has two networks: a Transformer generator (the forecaster) and an MLP critic (the adversarial judge). “Encoder/decoder layers” are the stacked attention blocks; “attention heads” are parallel attention computations; d_model is the width of the internal feature vectors; “gradient penalty” is the soft constraint that keeps the critic mathematically well-behaved.

Parameter	Value	Notes
Generator encoder layers	1	“one layer of an encoder”
Generator decoder layers	1	“subsequent one layer of a decoder”
Attention heads (`nhead`)	grid-searched over {4, 8}	same value used in encoder and decoder
Model dimension (`dmodel`)	grid-searched over {8, 16, 32, 64, 128, 512}	number of input features for encoder and decoder; the critic’s linear-layer width is set to the same value
Feed-forward dimension	not reported	the encoder/decoder contain a “Feed Forward” sublayer (Figure 2), but its width is not stated
Dropout	not reported
Positional encoding	per Vaswani et al. (2017); specific form (e.g. sinusoidal) not stated	applied to both the encoder input and the decoder input (the decoder is seeded with the last value of the history window, `x_{i,t0}`)
Decoder attention	masked multi-head attention, then a second multi-head attention over the encoder output	per Figure 2 (standard Transformer decoder); the paper does not name the second block
Embedding dimension	not reported separately	inputs are raw scalar job counts plus positional encoding; no learned token embedding table is described
Critic structure	3 fully connected (linear) layers	a multi-layer perceptron (MLP)
Critic activation	LeakyReLU, slope `α = 0.2`	`f(x) = max(αx, x)`
Critic linear-layer width	equal to `dmodel`	stated explicitly
Gradient-penalty coefficient (`λ`)	10	the WGAN-gp default; keeps the critic 1-Lipschitz
Critic iterations per generator step (`ncritic`)	5	critic updated 5 times, then generator once
Convolution kernel sizes / RevIN / ProbSparse / FFT settings / diffusion timesteps / similarity matrices / output classes	not applicable	the model uses standard (dense) attention, not sparse attention, convolutions, frequency transforms, or a classification head; it is a single-value regressor

Training.

Parameter	Value	Notes
Optimizer	MADGRAD	used for both generator and critic; chosen over Adam
Learning rate (`α`)	0.001	same for generator and critic
Learning-rate schedule	not reported	no decay/warmup described
MADGRAD momentum	0.9	“default configurations”
MADGRAD weight decay	0
MADGRAD epsilon	1e-6
Batch size (`m`)	grid-searched; ranges [16–256] (Facebook) to [16–1024] (Alibaba, Google, Wiki, Azure-VM) and [16–512] (Azure-Func)	per Table 2
Epochs	1000	fixed; “works well for our proposed method”
Early stopping	not reported	a fixed 1000-epoch budget is used instead
Generator loss	mean absolute error (L1) to the true future, minus the critic’s score	Equation 4
Critic loss	Wasserstein term (generated minus real critic scores) plus `λ` × gradient penalty	Equation 5
Loss weights	only `λ = 10` on the gradient penalty; the L1 and critic terms in the generator loss are unweighted (coefficient 1)	per Equations 4–5
Hardware	single NVIDIA GeForce RTX 2080 Ti GPU
Software stack	PyTorch + scikit-learn

Adam comparison settings (ablation only). For the optimizer ablation (Figure 3), the Adam variant used β1 = 0, β2 = 0.9, and learning rate 0.0001 for both generator and critic. These are not the deployed model’s settings.

Data / windowing.

Parameter	Value	Notes
Lookback / history length (`n`)	grid-searched per workload: [3–46] Facebook, [20–324] Alibaba-2018, [28–676] Google, [12–274] Wikipedia, [14–682] Azure-VM-2017, [14–230] Azure-VM-2019, [7–108] Azure-Func-2019	per Table 2
Forecast horizon / prediction range (`τ`)	1 (one step ahead)	fixed
Sliding-window stride	1 time step
Sampling interval	5, 10, 30, or 60 min depending on dataset	7 workloads, 15 configurations (Table 1)
Train / validation / test split	60% / 20% / 20%, chronological (no shuffling)	first 60% train, next 20% cross-validation, last 20% test

Total parameter count / model size. Not reported. The paper gives no parameter count or memory footprint; it reports only inference latency (4.85 ms average for WGAN-gp Transformer vs 25.57 ms for the LSTM baseline).

(B) How were the parameters chosen?¶

Grid search (per workload). The four primary hyperparameters, history length n, batch size m, model dimension dmodel, and number of attention heads nhead, were selected by an “effective grid search” over the explicit ranges in Table 2 (reproduced above), run separately for each of the workloads. The search spaces differ by trace (for example, history length spans [3–46] for Facebook but [28–676] for Google), reflecting the authors’ position that no single configuration fits all 15 workload settings and that a separate model is tuned per trace. The paper does not state which specific value within each range was chosen for each workload, nor whether the search was exhaustive over a discrete product grid or coarser.

Cross-validation. The middle 20% split is explicitly designated for cross-validation, the basis on which grid-search candidates are compared. Selection is by Mean Absolute Percentage Error (MAPE), the paper’s accuracy metric.

Taken from prior work / defaults (not searched).

Gradient-penalty coefficient λ = 10: explicitly “the default values suggested by the authors of WGAN-GP” (Gulrajani et al. 2017).
Critic iterations ncritic = 5 and the MADGRAD defaults (momentum 0.9, weight decay 0, eps 1e-6): stated as fixed configuration values, described as “default configurations,” not tuned.
Learning rate 0.001 and prediction range τ = 1: stated as fixed implementation choices, with no search range or justification given beyond the one-step-ahead problem framing.
Positional encoding, encoder-decoder structure, LeakyReLU α = 0.2: adopted from prior work (Vaswani et al. 2017; the AST critic design and Xu et al. 2015 for LeakyReLU), not independently tuned.

Chosen by ablation / sensitivity study.

Optimizer (MADGRAD vs Adam): selected by an ablation (Figure 3) across six workload configurations showing MADGRAD yields substantially lower MAPE; the authors identify the optimizer as “the critical factor” for accuracy.

Fixed by judgment, not searched.

Number of epochs = 1000: fixed because it “works well for our proposed method”; no schedule or early-stopping criterion. No tuning procedure is described.
Single encoder layer and single decoder layer: stated as the architecture, with no reported search over depth.

Unspecified. Feed-forward dimension, dropout, embedding dimension, learning-rate schedule, weight initialization, and any early-stopping rule are not reported, and no selection method is given for them.

Inputs (what it consumes)¶

A univariate time series of job arrival rates — the count of jobs/user-requests per fixed time interval, from one workload. This is the only signal; the model is univariate (no CPU, memory, latency, or multivariate features).
A lookback window of length n of recent job counts (tuned per workload; e.g. ranges like 3-46 for Facebook up to 28-676 for Google in the search space).
Positional encodings appended so the Transformer knows time order.
Time granularities tested: 5, 10, 30, or 60 minute intervals depending on the dataset.

Outputs (what it produces)¶

A one-step-ahead forecast (τ = 1) of the job arrival rate for the next time interval — a single number, the predicted job count.
The forecast horizon is therefore one interval ahead (5-60 minutes depending on configuration).
In the autoscaling deployment, that predicted job count is directly interpreted as the number of VMs to pre-provision (one job ↔ one VM by assumption), i.e. it effectively yields a target VM count that the autoscaler executes.

How It Fits the Autoscaling Framework (MAPE-K)¶

The MAPE-K loop is the standard blueprint for self-managing systems: Monitor → Analyze → Plan → Execute over shared Knowledge.

Primary stage: ANALYZE. The WGAN-gp Transformer is the predictive brain that turns recent history into a forecast of future demand. This is the paper’s core.
Monitor: assumed/upstream — the system must be collecting per-interval job counts to feed the model. The paper doesn’t focus on this.
Plan & Execute: demonstrated but deliberately simple. With the “1 job = 1 VM” rule, the Plan step is trivial arithmetic (target VMs = predicted jobs), and Execute is handled by a standard VM autoscaler creating/terminating Google Cloud e2-medium VMs. The paper’s contribution is the forecaster; the actuation is a thin wrapper around it.
Makes autoscaling PROACTIVE. By forecasting the next interval and pre-creating VMs, it hides VM startup delay — the whole point of predictive autoscaling.
Horizontal scaling. It decides how many VM instances to run (scale-out/scale-in), not how much CPU/RAM per instance. So this is horizontal, not vertical, scaling.
Honest scope: the deliverable is the predictor. It feeds a basic autoscaler rather than introducing a sophisticated planner or controller, and the experiments use a clean one-job-per-VM mapping rather than a general resource-allocation model.

Analogy: think of an online store the night before a big sale. A reactive system only adds servers once checkout requests are already piling up and customers are seeing spinning wheels. The WGAN-gp Transformer is like a planner who studies recent traffic, predicts tomorrow’s surge, and spins up the extra servers in advance so they’re warm when shoppers arrive — while not leaving expensive servers running idle afterward.

Evaluation (datasets & metrics, briefly)¶

Datasets: 7 real-world cloud traces, expanded by sampling interval into 15 workload configurations (Table 1). Data-center traces: Facebook (Hadoop, 5/10 min), Alibaba-2018 (5/10 min), Google Cluster (10/30 min). Web: Wikipedia (Wikibench, 10/30 min, strongly seasonal). Cloud VM/serverless: Azure-VM-2017 (10/30/60 min), Azure-VM-2019, and Azure-Func-2019 (serverless functions). (Internal inconsistency: Table 1 lists Azure-VM-2019 at 10/30/60 min and Azure-Func-2019 at 30/60 min, but the actual evaluated configurations in Table 3 are Azure-VM-2019 at 30m/60m and Azure-Func-2019 at 5m/10m — the granularities reported in the two tables do not match.)
Accuracy metric: MAPE (Mean Absolute Percentage Error), 100 × (1/n) Σ |(ŷᵢ − yᵢ)/yᵢ|; lower is better. Overhead metric: average inference time (ms).
Baseline: LoadDynamics (Jayakumar et al. 2020), the state-of-the-art LSTM forecaster, run in its “brute-force” grid-search form on the same RTX 2080 Ti.
Headline accuracy (Table 3): WGAN-gp Transformer wins on most of the 15 configs, up to 5.1% lower MAPE, with the largest gains on dynamic data-center/Azure-VM traces — e.g. Facebook-5m 47.20→42.11, Alibaba-2018-5m 17.95→15.76, Azure-VM-2017-60m 16.11→12.77, Azure-VM-2019-30m 19.74→15.19 (LoadDynamics→WGAN-gp). It loses on the strongly seasonal cases (see Limitations).
Inference speed (Figure 4): ~5× faster — average 4.85 ms vs 25.57 ms for LoadDynamics, because the Transformer processes the whole window in one pass rather than step-by-step.
Optimizer ablation (Figure 3): across six configs (FB-10m, Alibaba-2018-5m, Google-10m, Azure-VM-2017-60m, Azure-VM-2019-30m/-60m), MADGRAD beats Adam (β₁=0, β₂=0.9, lr=0.0001) substantially — the authors call the optimizer “the critical factor.”
Live autoscaling on Google Cloud (Table 4): e2-medium VMs, CloudSuite Data Analytics (MapReduce) for Facebook and In-Memory Analytics (Apache Spark) for Azure-2019. vs LoadDynamics, WGAN-gp cut Facebook under-provisioning by 27.95% (40.22%→12.27%) — though Facebook over-provisioning rose (10.33%→16.13%) — and reduced Azure-VM-2019 under- and over-provisioning by 2.56% and 1.92% (9.63%→7.07% and 8.60%→6.68%).
Code: publicly available (github.com/shivaniarbat/wgan-gp-transformer).

Training & pre-training¶

Trained from scratch (adversarially), per workload — no pretrained or foundation model.

The WGAN-gp Transformer is trained from random initialization, and a separate model is fit for every workload trace — there is no pretraining, fine-tuning, foundation model, or zero-shot transfer. Each trace gets its own trained-and-tuned model. Training is the adversarial WGAN-gp loop of Algorithm 1 — a Transformer generator against a 3-layer MLP critic, where the critic is updated n_critic = 5 times per generator step, the generator loss combines MAE (L1) with the critic’s score, and the gradient-penalty coefficient is λ = 10. The optimizer is MADGRAD (not Adam) — learning rate 0.001, momentum 0.9, eps 1e-6, run for 1000 epochs — and an ablation pins MADGRAD as the key accuracy driver over an Adam variant. The setup is univariate one-step-ahead (τ = 1) on the job-arrival-rate series, split chronologically 60/20/20 train/CV/test with a sliding window (stride 1), with a per-workload grid search over history length n, batch size m, d_model, and n_head.

From scratch, per trace: random init, one model per workload; no cross-workload transfer is claimed or demonstrated.
Stack & hardware: PyTorch + scikit-learn, single RTX 2080 Ti.
Traces trained on: Facebook, Alibaba-2018, Google, Wikipedia, and Azure (VM-2017, VM-2019, Func-2019).

Strengths¶

Faster inference (~5x): ~4.85 ms vs ~25.57 ms for the LSTM baseline, because the Transformer processes the whole window in one pass instead of step-by-step. Important for a predictor that must run every interval.
More accurate on bursty workloads: up to 5.1% lower MAPE (Mean Absolute Percentage Error) than LoadDynamics across most of the 15 workload configurations, especially on highly dynamic data-center traces (Facebook, Google, Azure VM).
Stable adversarial training: WGAN-gp’s gradient penalty avoids the instability of plain GANs and the weight-clipping problems of vanilla WGAN.
MADGRAD matters and is shown to: an ablation (Figure 3) confirms MADGRAD beats Adam for this task — the authors identify the optimizer as the critical accuracy factor.
Real-world validation: deployed on Google Cloud, it cut Facebook-workload under-provisioning by ~27.95% (40.22% → 12.27%) and improved Azure-VM-2019 under- and over-provisioning by ~2.56% and ~1.92% (though Facebook over-provisioning rose, 10.33% → 16.13%).
Broad evaluation: 7 real workloads (Facebook, Alibaba, Google, Wikipedia, three Azure traces) across 15 interval configurations; code is public.

Limitations¶

Univariate only: uses just the job-count series. It ignores CPU, memory, latency, and other signals that could sharpen forecasts; no multivariate inputs.
One-step-ahead only (τ = 1): forecasts a single next interval. Longer multi-step horizons aren’t demonstrated, which can matter if VM startup spans more than one interval.
Weaker on strongly seasonal workloads: the paper’s text says it is less accurate than the LSTM on three strongly periodic cases — Wiki-10m, Azure-Func-2019-5m, and Azure-Func-2019-10m — where the LSTM’s memory better stores repeating patterns, so it’s not a universal winner. FLAG — the text contradicts Table 3: by the numbers, WGAN-gp actually beats LoadDynamics on Azure-Func-2019-10m (1.85 vs 2.06), and is worse on Wiki-30m (3.43 vs 1.75) — a config the text omits. So the configs where WGAN-gp loses per Table 3 are {Wiki-10m, Wiki-30m, Azure-Func-2019-5m}, not the three the text names.
Over-provisioning can worsen: the headline autoscaling win is on under-provisioning; on the Facebook workload over-provisioning actually rose (10.33% → 16.13%), so the deployment trade-off is not uniformly positive.
Oversimplified scaling model: the autoscaling experiment assumes one job = one VM, which sidesteps real resource-allocation complexity (jobs of different sizes, packing multiple jobs per VM, vertical scaling, cost models).
Prediction-only contribution: the Plan/Execute side is minimal; integrating with sophisticated schedulers/controllers (e.g. Kubernetes HPA, cost-aware planners) is left open.
Training cost & tuning: adversarial training for 1000 epochs plus per-workload grid search is heavier to set up than a single straightforward regressor (though inference is cheap).

Glossary¶

Autoscaling: automatically adding/removing cloud resources to match demand. Scale-out/in = add/remove instances (horizontal). Scale-up/down = resize an instance’s CPU/RAM (vertical).
Reactive vs. proactive autoscaling: react after load crosses a threshold (always lagging) vs. forecast demand and scale ahead of time.
Provisioning delay / VM startup time: the lag between requesting a VM and it being ready — the gap proactive scaling tries to hide.
Over-/under-provisioning: too many idle VMs (wasted cost) vs. too few VMs (slow app, SLA violations).
Job Arrival Rate (JAR): jobs/requests arriving per interval — the quantity forecast here.
Time series (univariate): a single value measured at regular time steps.
Lookback/history window (n): number of past steps fed to the model.
Transformer: attention-based neural network that reads a whole sequence at once; has an encoder (reads input) and decoder (produces output).
Attention / multi-head attention: mechanism that weighs which past time steps matter most for the prediction; “multi-head” runs several such weightings in parallel.
Positional encoding: added signal telling the order-agnostic Transformer when each value occurred.
GAN: generator vs. discriminator/critic trained adversarially.
WGAN-gp: a stable GAN variant using Wasserstein distance (earth-mover’s distance between distributions) with a gradient penalty to keep the critic well-behaved (1-Lipschitz).
Critic: the judging network (here a 3-layer MLP) estimating Wasserstein distance instead of a yes/no real-vs-fake label.
MLP (Multi-Layer Perceptron): a basic fully-connected neural network.
LSTM: a recurrent network that reads sequences step-by-step; the prior state-of-the-art baseline (in the LoadDynamics system).
MADGRAD: the optimizer used here; outperformed Adam for this task.
MAPE (metric): Mean Absolute Percentage Error — average % gap between forecast and actual; lower is better. (Not to be confused with the MAPE-K loop.)
MAPE-K loop: Monitor–Analyze–Plan–Execute over Knowledge; the reference model for self-adaptive systems. This model lives in Analyze.
MapReduce / Apache Spark: big-data processing frameworks used in the cloud benchmarks (CloudSuite Data Analytics, In-Memory Analytics) during the Google Cloud autoscaling experiment.

References¶

Arbat, S., Jayakumar, V. K., Lee, J., Wang, W., & Kim, I. K. (2022). Wasserstein Adversarial Transformer for Cloud Workload Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 36(11), 12433–12439. 10.1609/aaai.v36i11.21509