Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

WGAN-gp

An adversarially-trained Wasserstein-GAN transformer for fast forecasting of bursty job-arrival workloads

Arbat et al. (2022) Citations


TL;DR

Cloud applications run on rented machines (VMs). If you rent too few, the app gets overwhelmed and slow; rent too many and you waste money. The smart move is to forecast how much traffic is coming and rent machines ahead of time. This paper builds a forecaster that predicts the number of incoming jobs/requests in the next time interval from the recent history of job counts. The new idea is to combine two modern AI techniques: a Transformer and an adversarial training trick called WGAN-gp (one network tries to forecast, a second “critic” network judges how realistic the forecast looks, and they train against each other). The result, WGAN-gp Transformer, predicts cloud workloads up to 5.1% more accurately and 5x faster than the previous best method (an LSTM-based system called LoadDynamics), and when plugged into a real Google Cloud autoscaler it cut both over-provisioning (wasted machines) and under-provisioning (too few machines) substantially.


The Problem (and why simple autoscaling isn’t enough)

Autoscaling is the cloud feature that automatically adds machines when an app gets busy (“scale-out”) and removes them when it goes quiet (“scale-in”). The goal is to always have just enough capacity.

The naive way to autoscale is reactive: watch the current load, and once it crosses a threshold (say, CPU > 70%), start booting more machines. The fatal flaw is timing. Booting a VM takes time (the “startup delay” or cold start). By the time the new machine is ready, the traffic spike may have already caused slow responses or dropped requests. Reactive autoscaling is always running a step behind.

The fix is predictive (proactive) autoscaling: forecast the demand for the next interval and start booting machines before the spike hits, so they’re warm and ready in time. A predictive autoscaler has two parts:

  1. A workload predictor — forecasts future demand (here, the job arrival rate, i.e. how many jobs/requests will show up).

  2. An autoscaler — turns that forecast into a decision about how many VMs to run.

This paper is about building a better workload predictor. The state of the art at the time was LoadDynamics, an LSTM-based forecaster. LSTMs have two weaknesses the authors target:


Background

A few terms, defined once:


Contribution in Simple Terms

The paper’s contribution is a new time-series forecaster, WGAN-gp Transformer, purpose-built for predicting bursty cloud workloads, that is both more accurate and far faster than the prior LSTM state of the art. Concretely, what’s genuinely new:

  1. It uses a Transformer instead of an LSTM as the forecaster. Because the Transformer processes the whole history in one shot (no step-by-step recurrence), inference is ~5x faster, and its attention mechanism captures the relevant past moments for predicting sudden spikes.

  2. It trains that Transformer adversarially with WGAN-gp. The Transformer is the generator; a small MLP critic judges how realistic its forecasts look. This adversarial pressure makes forecasts of dynamic, bursty workloads more accurate than training the Transformer alone. (Prior adversarial-Transformer work, “Adversarial Sparse Transformer,” used a sparse attention that tended to lose long-term info; this paper uses standard attention + WGAN-gp to keep that long-range information.)

  3. It plugs MADGRAD in as the optimizer for both generator and critic, and shows experimentally that this — not just the architecture — is what unlocks the accuracy improvement over an Adam-trained version.

  4. It validates the predictor inside a real autoscaler on Google Cloud, not just on offline error metrics — showing the better forecasts translate into fewer over- and under-provisioned VMs in practice.

In one sentence: take a Transformer, train it as the generator in a stable WGAN-gp adversarial setup with the MADGRAD optimizer, and you get a fast, accurate cloud-workload forecaster that improves real autoscaling.


How It Works, Step by Step

Training (offline):

  1. Collect & format data. Take a single stream of job counts over time (univariate series). Split chronologically into 60% train / 20% validation / 20% test.

  2. Slide a window. Use a sliding window of history length n (stride of 1 time step). Each window of past values x[1..t0] is an input; the very next value x[t0+1] is the target. (Prediction range τ = 1 — i.e. one-step-ahead forecasting.)

  3. Add positional encoding to the input so the Transformer knows the order of the history. Positional encoding is also applied to the decoder input, which is the last value of the history window.

  4. Generator forecasts. The Transformer generator (one encoder layer + one decoder layer, with multi-head attention) reads the history. The encoder compresses it into a latent “memory” vector h; the decoder uses h to produce the predicted next value .

  5. Critic judges. The MLP critic (3 fully-connected layers, LeakyReLU activations) is fed two things: the real full sequence (history + actual future) and the fake one (history + the generator’s forecast, concatenated). It outputs a score approximating the Wasserstein distance between real and generated, plus the gradient penalty term that keeps it 1-Lipschitz (penalty coefficient λ = 10).

  6. Adversarial loop (Algorithm 1). Each round: update the critic 5 times (ncritic = 5) so it becomes a good judge, then update the generator once to fool the critic and minimize forecast error. The generator’s loss combines (a) mean absolute error between forecast and truth and (b) the critic’s score — so it’s pushed to be both accurate and realistic. Both networks are optimized with MADGRAD (learning rate 0.001), trained for 1000 epochs.

  7. Hyperparameter search. A grid search tunes history length n, batch size m, model dimension dmodel, and number of attention heads nhead per workload.

Inference (online, per interval):

  1. Feed the most recent n job counts (with positional encoding) through the generator only. Because the Transformer reads the whole window at once (no recurrence), it returns the next-step forecast in a single fast pass (~4.85 ms vs ~25.57 ms for the LSTM baseline).

Using it to autoscale (the deployment experiment):

  1. The forecast Pi = predicted number of jobs arriving in interval i. Under the paper’s assumption that one job needs one VM, the autoscaler pre-creates Pi VMs before interval i begins.

  2. Compare to actual arrivals Ti: if Ti > Piunder-provisioned (need more VMs, incur startup delay); if Ti < Piover-provisioned (idle VMs waste money). Better forecasts shrink both error types.


Workload Modeling & Prediction Pipeline (full-text deep read)

This section drills into exactly two things: (A) what the paper treats as “the workload” and how it turns that into numbers a neural network can read, and (B) the precise path from raw data to a prediction.

(A) How the workload is modeled and characterized

What real-world quantity is “the workload”? It is the Job Arrival Rate (JAR) — literally how many jobs (or user requests) arrive at the cloud application during one fixed time interval. The paper says outright that it “use[s] the terms workloads, user requests, and job arrival rates interchangeably.” So the workload is a count of incoming jobs per interval, not CPU%, not memory, not latency, not a performance-degradation ratio. (Note: the paper is framed around VM auto-scaling — its title is “Cloud Workload Prediction” and it repeatedly talks about provisioning/de-provisioning VMs — but the actual signal it models and forecasts is purely the job/request count, never VM utilization metrics like CPU or memory.)

One signal or many? Univariate. The paper is explicit and formal: “A univariate time-series is defined as a sequence of measurements of the same variable collected over time. We study univariate time-series data of JARs.” Univariate means there is exactly one number per time step — the job count — and nothing else (no parallel CPU, memory, or disk channels). Formally the series is x = [x1, x2, ..., xT] where each xt is a single real number, the job count at time t.

Time granularity (how often it is sampled). This varies by dataset, and the authors deliberately test several granularities because “different time granularities can exhibit subtle variations in the time-series workload patterns.” From Table 1:

So one job-count sample is recorded every 5 to 60 minutes depending on the trace. Across 7 real workloads and these interval choices, the authors build 15 distinct workload configurations.

What the raw data looks like, and what makes it hard. Figure 1 plots four of the traces as plain line charts of job counts over interval index — e.g. Google Cluster job counts (in millions, ~0 to 2.5x10^6) over ~1200 thirty-minute intervals; Facebook (0 to ~150 jobs) over ~120 ten-minute intervals; Azure-VM-2017 (0 to ~6000 jobs) over ~600 sixty-minute intervals; Wikipedia (~2 to 6.5 million) over ~500 thirty-minute intervals. The hard parts the paper calls out:

How the workload is represented numerically. Three concrete moves, and notably no frequency transform or explicit trend/seasonal/residual decomposition:

  1. Sliding window into fixed-length sequences. The long raw series x1:T is sliced into N overlapping sub-sequences of length S (the paper writes the training set as X ∈ R^{N×S}). A window of history x[1..t0] is the input; the next value x[t0+1] is what to predict. The window slides forward with a stride of one time step.

  2. History length n is a tuned hyperparameter, not a fixed constant — search ranges per workload run from [3-46] (Facebook) up to [28-676] (Google) — see Table 2.

  3. Positional encoding is added to each input value so the order-agnostic Transformer (which reads everything at once) knows when each job-count occurred. The paper does not describe any extra normalization scheme, embedding lookup, patching, or trend/seasonal decomposition beyond windowing + positional encoding. Internally the encoder turns the window into a latent “memory” vector, but the input representation is just: a window of raw job counts + positional encodings.

(B) The step-by-step prediction pipeline (raw signal -> prediction result)

The detailed training/inference mechanics are already given above in “How It Works, Step by Step.” Here is the tight raw-signal-to-result walkthrough, with the numbers the paper states:

  1. Collect the raw workload. Obtain one trace’s stream of job counts at a fixed interval (5-60 min). This is a single univariate time series of length T.

  2. Split chronologically. First 60% for training, next 20% for cross-validation, last 20% for testing — kept in time order (no shuffling across the split boundaries), because it is forecasting.

  3. Window the series. Apply the sliding window (stride 1) to cut the series into input/target pairs: input = the previous n job counts x[1..t0]; target = the single next job count x[t0+1]. The paper fixes the prediction range to τ = 1, i.e. strictly one step ahead.

  4. Encode position and feed the generator. Add positional encoding to the window and pass it to the Transformer generator — a compact one encoder layer + one decoder layer with multi-head attention (number of heads nhead ∈ {4, 8}; model width dmodel ∈ {8,16,32,64,128,512}, both grid-searched). Why a Transformer: attention can directly relate any past moment to the prediction (good for catching the run-up to a spike) and processes the whole window in a single pass (fast).

  5. Encoder -> latent memory. The encoder compresses the history window into a latent vector h[1..t0]. Why: this h is the condensed “what the recent past looks like” that the decoder reads from.

  6. Decoder -> raw forecast. The decoder, seeded with the last value of the history window (x[t0]) plus its own positional encoding and the encoder memory h, emits the predicted next value x_hat[t0+1]. Why this stage exists: it converts the encoded context into the actual forecast number.

  7. (Training only) Critic judges realism. A 3-layer MLP critic (LeakyReLU, α=0.2) scores the real full sequence x[1..S] against the generated one [x[1..t0] ⊕ x_hat[t0+1..S]] (history concatenated with forecast), approximating the Wasserstein distance between real and generated workloads, plus a gradient penalty (λ=10) keeping it 1-Lipschitz. Why: this adversarial pressure makes forecasts of bursty data look like plausible real continuations, not just least-squares-flat guesses.

  8. (Training only) Adversarial update loop. Per round: update the critic ncritic = 5 times, then the generator once; the generator’s loss = mean absolute error (L1) to the true next value plus the critic’s score. Optimizer is MADGRAD (learning rate 0.001, momentum 0.9), run 1000 epochs. Why MADGRAD: an ablation (Figure 3) shows it, not the architecture alone, is the key accuracy driver versus Adam.

  9. Inference (online, the part that runs each interval). Use the generator only: feed the most recent n job counts (with positional encoding) in one forward pass and read out the single predicted job count for the next interval. This is a point forecast (one number), not a probabilistic interval, and the unit is jobs/requests per interval. Measured latency: ~4.85 ms (vs ~25.57 ms for the LSTM baseline).

  10. PREDICTION RESULT. The output is P_i = the predicted number of jobs arriving in the next interval i, in raw job-count units, one interval (5-60 min) ahead. (How it’s consumed downstream, briefly: under the paper’s “1 job = 1 VM” assumption, the autoscaler pre-creates P_i VMs before interval i; accuracy is scored with MAPE, and on Google Cloud the better forecasts cut under-/over-provisioning. The pipeline’s job ends at producing P_i.)*


Model Parameters & How They Were Chosen

This section consolidates every hyperparameter the paper reports for the WGAN-gp Transformer and states, for each, how the value was selected.

(A) What are the model parameters?

Architecture. The model has two networks: a Transformer generator (the forecaster) and an MLP critic (the adversarial judge). “Encoder/decoder layers” are the stacked attention blocks; “attention heads” are parallel attention computations; d_model is the width of the internal feature vectors; “gradient penalty” is the soft constraint that keeps the critic mathematically well-behaved.

ParameterValueNotes
Generator encoder layers1“one layer of an encoder”
Generator decoder layers1“subsequent one layer of a decoder”
Attention heads (nhead)grid-searched over {4, 8}same value used in encoder and decoder
Model dimension (dmodel)grid-searched over {8, 16, 32, 64, 128, 512}number of input features for encoder and decoder; the critic’s linear-layer width is set to the same value
Feed-forward dimensionnot reportedthe encoder/decoder contain a “Feed Forward” sublayer (Figure 2), but its width is not stated
Dropoutnot reported
Positional encodingper Vaswani et al. (2017); specific form (e.g. sinusoidal) not statedapplied to both the encoder input and the decoder input (the decoder is seeded with the last value of the history window, x_{i,t0})
Decoder attentionmasked multi-head attention, then a second multi-head attention over the encoder outputper Figure 2 (standard Transformer decoder); the paper does not name the second block
Embedding dimensionnot reported separatelyinputs are raw scalar job counts plus positional encoding; no learned token embedding table is described
Critic structure3 fully connected (linear) layersa multi-layer perceptron (MLP)
Critic activationLeakyReLU, slope α = 0.2f(x) = max(αx, x)
Critic linear-layer widthequal to dmodelstated explicitly
Gradient-penalty coefficient (λ)10the WGAN-gp default; keeps the critic 1-Lipschitz
Critic iterations per generator step (ncritic)5critic updated 5 times, then generator once
Convolution kernel sizes / RevIN / ProbSparse / FFT settings / diffusion timesteps / similarity matrices / output classesnot applicablethe model uses standard (dense) attention, not sparse attention, convolutions, frequency transforms, or a classification head; it is a single-value regressor

Training.

ParameterValueNotes
OptimizerMADGRADused for both generator and critic; chosen over Adam
Learning rate (α)0.001same for generator and critic
Learning-rate schedulenot reportedno decay/warmup described
MADGRAD momentum0.9“default configurations”
MADGRAD weight decay0
MADGRAD epsilon1e-6
Batch size (m)grid-searched; ranges [16–256] (Facebook) to [16–1024] (Alibaba, Google, Wiki, Azure-VM) and [16–512] (Azure-Func)per Table 2
Epochs1000fixed; “works well for our proposed method”
Early stoppingnot reporteda fixed 1000-epoch budget is used instead
Generator lossmean absolute error (L1) to the true future, minus the critic’s scoreEquation 4
Critic lossWasserstein term (generated minus real critic scores) plus λ × gradient penaltyEquation 5
Loss weightsonly λ = 10 on the gradient penalty; the L1 and critic terms in the generator loss are unweighted (coefficient 1)per Equations 4–5
Hardwaresingle NVIDIA GeForce RTX 2080 Ti GPU
Software stackPyTorch + scikit-learn

Adam comparison settings (ablation only). For the optimizer ablation (Figure 3), the Adam variant used β1 = 0, β2 = 0.9, and learning rate 0.0001 for both generator and critic. These are not the deployed model’s settings.

Data / windowing.

ParameterValueNotes
Lookback / history length (n)grid-searched per workload: [3–46] Facebook, [20–324] Alibaba-2018, [28–676] Google, [12–274] Wikipedia, [14–682] Azure-VM-2017, [14–230] Azure-VM-2019, [7–108] Azure-Func-2019per Table 2
Forecast horizon / prediction range (τ)1 (one step ahead)fixed
Sliding-window stride1 time step
Sampling interval5, 10, 30, or 60 min depending on dataset7 workloads, 15 configurations (Table 1)
Train / validation / test split60% / 20% / 20%, chronological (no shuffling)first 60% train, next 20% cross-validation, last 20% test

Total parameter count / model size. Not reported. The paper gives no parameter count or memory footprint; it reports only inference latency (4.85 ms average for WGAN-gp Transformer vs 25.57 ms for the LSTM baseline).

(B) How were the parameters chosen?

Grid search (per workload). The four primary hyperparameters, history length n, batch size m, model dimension dmodel, and number of attention heads nhead, were selected by an “effective grid search” over the explicit ranges in Table 2 (reproduced above), run separately for each of the workloads. The search spaces differ by trace (for example, history length spans [3–46] for Facebook but [28–676] for Google), reflecting the authors’ position that no single configuration fits all 15 workload settings and that a separate model is tuned per trace. The paper does not state which specific value within each range was chosen for each workload, nor whether the search was exhaustive over a discrete product grid or coarser.

Cross-validation. The middle 20% split is explicitly designated for cross-validation, the basis on which grid-search candidates are compared. Selection is by Mean Absolute Percentage Error (MAPE), the paper’s accuracy metric.

Taken from prior work / defaults (not searched).

Chosen by ablation / sensitivity study.

Fixed by judgment, not searched.

Unspecified. Feed-forward dimension, dropout, embedding dimension, learning-rate schedule, weight initialization, and any early-stopping rule are not reported, and no selection method is given for them.


Inputs (what it consumes)


Outputs (what it produces)


How It Fits the Autoscaling Framework (MAPE-K)

The MAPE-K loop is the standard blueprint for self-managing systems: Monitor → Analyze → Plan → Execute over shared Knowledge.

Analogy: think of an online store the night before a big sale. A reactive system only adds servers once checkout requests are already piling up and customers are seeing spinning wheels. The WGAN-gp Transformer is like a planner who studies recent traffic, predicts tomorrow’s surge, and spins up the extra servers in advance so they’re warm when shoppers arrive — while not leaving expensive servers running idle afterward.


Evaluation (datasets & metrics, briefly)


Training & pre-training

Trained from scratch (adversarially), per workload — no pretrained or foundation model.

The WGAN-gp Transformer is trained from random initialization, and a separate model is fit for every workload trace — there is no pretraining, fine-tuning, foundation model, or zero-shot transfer. Each trace gets its own trained-and-tuned model. Training is the adversarial WGAN-gp loop of Algorithm 1 — a Transformer generator against a 3-layer MLP critic, where the critic is updated n_critic = 5 times per generator step, the generator loss combines MAE (L1) with the critic’s score, and the gradient-penalty coefficient is λ = 10. The optimizer is MADGRAD (not Adam) — learning rate 0.001, momentum 0.9, eps 1e-6, run for 1000 epochs — and an ablation pins MADGRAD as the key accuracy driver over an Adam variant. The setup is univariate one-step-ahead (τ = 1) on the job-arrival-rate series, split chronologically 60/20/20 train/CV/test with a sliding window (stride 1), with a per-workload grid search over history length n, batch size m, d_model, and n_head.


Strengths

Limitations


Glossary

References
  1. Arbat, S., Jayakumar, V. K., Lee, J., Wang, W., & Kim, I. K. (2022). Wasserstein Adversarial Transformer for Cloud Workload Prediction. Proceedings of the AAAI Conference on Artificial Intelligence, 36(11), 12433–12439. 10.1609/aaai.v36i11.21509