DAF
A diffusion Autoformer for probabilistic cloud job-arrival forecasting with uncertainty-aware confidence bands
TL;DR¶
Cloud data centers want to add or remove servers before a traffic surge hits, not after. To do that, they need to forecast how many incoming jobs/requests will arrive in the near future (the job arrival rate, or JAR). This paper proposes the Diffusion Autoformer (DAF), a neural-network forecaster that does three things at once: (1) it splits the past workload signal into a smooth long-term trend plus a repeating seasonal pattern (the “Autoformer” idea), (2) it generates the future not as one single guessed line but as a range of plausible futures with a probability attached, using a diffusion model (the same family of generative AI that powers image generators), and (3) it mixes in context like time-of-day and day-of-week. The result is a forecast that is both more accurate (up to ~13% lower error than strong baselines) and uncertainty-aware (it tells you a confidence band, not just a point), while still being fast enough (~68 milliseconds per prediction) to feed a live autoscaler. The model only does the prediction part; a standard scaler (e.g. Kubernetes HPA) would act on its forecast.
The Problem (and why simple autoscaling isn’t enough)¶
Imagine an online store the night before a big sale. Traffic is calm now, but at midnight thousands of shoppers arrive at once. If the store only adds servers after it notices CPU is overloaded (this is reactive autoscaling), it is already too late — new servers take time to boot (the “cold start” / provisioning delay), so customers see slow pages or errors in the meantime. The opposite mistake — keeping tons of servers running “just in case” — wastes money.
The fix is proactive (predictive) autoscaling: forecast the demand spike and add servers ahead of time so they are warm and ready when the wave arrives. That makes the forecast the heart of the whole system. But forecasting real cloud workloads is hard:
Bursty and irregular: sudden spikes and drops with no clean pattern.
Long-range patterns: daily cycles, weekly routines, scheduled jobs — structure that stretches across long time spans.
Uncertainty matters: knowing the range of possible demand is as important as the single most-likely number. Under-guessing peak demand causes SLA violations (slow/dropped requests); over-guessing wastes money.
Context matters: time-of-day and day-of-week strongly shape demand.
Speed matters: the forecast must be produced fast enough to drive real-time scaling.
Older tools fall short on at least one axis:
ARIMA / Holt-Winters (classical statistics): assume tidy, linear, stationary data — they cannot handle bursts.
LSTM (a recurrent neural net): better, but struggles with very long-range dependencies and can be slow for real-time inference.
Plain Transformers / Autoformer: strong accuracy, but deterministic — they output a single line and no uncertainty band.
GAN-based transformers (e.g. WGAN-gp): more robust, but still effectively deterministic point forecasts.
DAF is built to hit all of these targets together: accuracy + uncertainty + context + low latency.
Background¶
A few terms, defined once:
Time series: a sequence of numbers measured over time (here, jobs arriving per 5/10/30-minute slot).
Job Arrival Rate (JAR): how many tasks/requests come in per unit time. This is the thing being forecast.
Transformer: a neural network built on attention — a mechanism that lets the model look across an entire input sequence at once and decide which past time steps matter most for predicting the future. Unlike older RNN/LSTM models that read step-by-step, transformers process the sequence in parallel, which makes them fast and good at long-range patterns.
Autoformer: a transformer variant specialized for forecasting. Its trick is series decomposition — it separates a signal into a smooth trend (the slow drift) and a repeating seasonal part (the cycles), and forecasts them in a structured way. (Analogy: separating a song into its steady bass line and its repeating melody.)
Diffusion model: a generative technique (famous from AI image generators). During training it takes real data and gradually adds random noise until it is pure static; it learns to reverse that — to start from noise and denoise it back into realistic data. Because it starts from random noise each time, running it multiple times yields many different plausible outputs, which naturally gives you a probability distribution over the future — i.e., uncertainty.
Exogenous features / context: extra inputs that aren’t the signal itself but influence it — here, hour-of-day and day-of-week.
Probabilistic vs. deterministic forecast: deterministic = one number per future step. Probabilistic = a distribution / confidence band (“most likely 500 jobs, but plausibly 350–700”).
Contribution in Simple Terms¶
The genuinely new idea is fusing three previously separate techniques into one forecaster for cloud autoscaling:
Autoformer-style trend/seasonal decomposition in the encoder — to cleanly capture long-term structure and cycles.
A diffusion-based decoder — to turn forecasting into generation, so the model outputs a probabilistic forecast (a confidence band, not a single line). This is what gives operators uncertainty quantification.
Exogenous attention — to condition the forecast on context (time-of-day, day-of-week).
Earlier work had pieces of this but not the combination: Autoformer decomposes but is deterministic; TimeGrad uses diffusion but without decomposition or external-feature conditioning; Temporal Fusion Transformer uses context but isn’t a diffusion model. DAF stitches them together and adds a practical speed trick (trend-guided initialization, explained below) so the diffusion process — normally slow — runs fast enough (~68 ms) for live autoscaling.
In one line: a transformer that forecasts cloud job arrivals as a probability band instead of a single guess, while staying fast enough to drive a real autoscaler.
How It Works, Step by Step¶
Training and inference, walked through:
Decompose the input. Take the recent workload window
X(length 96 time steps). Apply a moving-average filter (kernel size K) to extract the smooth Trend. Subtract it to get the Seasonal component (Seasonal = X − Trend). Trend models long-term structure; seasonal captures short-term cycles.Encode with the Autoformer encoder. Feed the seasonal component through multi-head self-attention (3 encoder layers). This produces a hidden embedding
H_Esummarizing the workload’s patterns. (The trend is kept aside for later, in step 5.)Encode the context. Project the exogenous features
C(hour-of-day, day-of-week — encoded as sine/cosine waves so the model understands their cyclic nature) into an embedding.Fuse. Combine the workload embedding and the context embedding via cross-attention (
E_C = Attention(C, H_E)), then concatenate into one conditioning vectorZ_E = Concat(H_E, E_C). ThisZ_Eis the “everything we know about the situation” summary that guides the generator.Generate the future with the diffusion decoder (2 decoder layers):
Training: take the real future
Y, progressively add Gaussian noise over many steps (the forward diffusion process) until it is noise. The model is trained to predict the noise that was added at each step, conditioned onZ_E. Loss = how well it predicts the noise (denoising loss) plus a forecast-reconstruction term (MSE between predicted and true future), balanced by a weight λ.Inference (the speed trick — Trend-Guided Initialization): instead of starting the denoising from pure random noise, start from the extrapolated future trend plus a little noise (
y_N = Trend_future + noise). Because the starting point is already close to a sensible forecast, far fewer denoising steps are needed — the paper uses just 20 steps, which is what keeps latency low.
Produce the probabilistic forecast. Run the reverse (denoising) process to generate the next 24 time steps. Because diffusion is stochastic, sampling it multiple times yields a distribution of futures — giving both a central prediction and a calibrated confidence band (its prediction intervals cover the truth >90% of the time).
The whole training loop is given as Algorithm 1 in the paper: decompose → encode workload → attend over context → fuse → sample a diffusion timestep → noise the target → predict the noise → compute combined loss → update weights.
Inputs (what it consumes)¶
Historical workload window: the job arrival rate time series, a lookback of 96 time steps.
Exogenous context features: hour-of-day and day-of-week, encoded with sine/cosine transforms.
Preprocessing: all sequences normalized with min-max scaling.
Scope note: the model is univariate — it forecasts the single JAR signal (not CPU + memory + I/O together). Extending to multivariate metrics is named as future work.
Outputs (what it produces)¶
A probabilistic forecast of the job arrival rate for the next 24 time steps (e.g. the next 2 hours at a 5-minute granularity).
Crucially, this is not a single line but a distribution / prediction interval — a most-likely trajectory plus an uncertainty band that, in experiments, contains the true value >90% of the time (PICP > 90%).
It does not output a number of VMs/pods or a scaling action directly. It outputs the demand forecast; a separate scaling policy/autoscaler turns that into resource decisions.
How It Fits the Autoscaling Framework (MAPE-K)¶
DAF lives squarely in the ANALYZE stage of the MAPE-K loop — it is the forecasting “brain” that turns monitored metrics into a look-ahead prediction. It makes scaling proactive rather than reactive.
Which stage: primarily Analyze (forecasting). It feeds Plan/Execute, but the paper does not implement the planner or actuator itself.
Proactive vs reactive: Proactive — the whole point is to predict demand ahead of time and hide the provisioning/cold-start delay.
Horizontal vs vertical: The paper frames it for VM auto-scaling and explicitly mentions Kubernetes Horizontal Pod Autoscaler (HPA) and serverless function management as integration targets — so the natural fit is horizontal scaling (scale out/in). But DAF only produces the forecast; the actual scaling is left to a standard autoscaler.
How the output drives a decision (the uncertainty payoff): because DAF gives a confidence band, an operator can choose a policy. A conservative policy provisions for the upper confidence bound to avoid SLA violations during spikes; a cost-optimized policy provisions closer to the mean forecast. This risk-tunable behavior is the practical advantage over deterministic forecasters.
In short: DAF is the predictive engine that makes an otherwise-reactive autoscaler proactive — and adds a confidence band so the resulting scaling policy can be tuned for safety vs. cost. Closed-loop control (using reinforcement learning to actually act on the forecast) is listed as future work.
Evaluation (datasets & metrics, briefly)¶
Datasets: Google Cluster (Borg) job-arrival traces (5-min sampling), Microsoft Azure (VM 2017/2019 and Functions 2019, serverless), plus Facebook, Alibaba, and Wikipedia traffic workloads — at 5-, 10-, and 30-minute granularities.
Baselines: LoadDynamics (LSTM), WGAN-gp Transformer, and Autoformer.
Metrics: MAPE, MAE, RMSE (accuracy); PICP = Prediction Interval Coverage Probability (uncertainty calibration); and inference latency.
Headline results: up to ~13% lower MAPE than strong baselines (lowest MAPE on several datasets incl. Alibaba, Google-5m, Azure-VM-2019, Azure-Func-2019); PICP > 90% across all datasets; ~68 ms average inference latency per sequence. Hardware: Intel i9-13900K + NVIDIA RTX 4090, PyTorch 2.2, AdamW (lr 1e-4), batch 64, 3 encoder / 2 decoder layers, 20 diffusion steps; each experiment repeated 3 times.
Training & pre-training¶
Trained from scratch — no pretrained or foundation model.
DAF is initialized randomly and trained end-to-end on the cloud job-arrival traces themselves (Google, Azure, Alibaba, Facebook, Wikipedia). It is not adapted from any pretrained or foundation time-series model. The supervised loop (Algorithm 1, “Training the Diffusion Autoformer”) optimizes a combined objective: the diffusion denoising loss ||ε − ε_θ||² plus a λ-weighted forecast-reconstruction MSE. Optimization uses AdamW (lr 1e-4), batch 64, mixed precision, and early stopping on a validation split; sequences are min-max normalized with lookback 96, horizon 24, and 20 denoising steps, and every experiment is repeated 3 times with different seeds.
One clarification to avoid a false friend: the paper calls DAF a “hybrid forecasting model”, but “hybrid” refers only to combining decomposition + diffusion + exogenous attention — it does not mean mixing pretrained and from-scratch components. There is no pretraining, fine-tuning, foundation model, or zero-shot transfer anywhere in the pipeline.
Strengths¶
Uncertainty-aware: outputs a calibrated confidence band (PICP > 90%), not just a point — operators can tune for SLA-safety vs. cost.
Accurate: up to ~13% lower MAPE than LSTM/Autoformer/GAN baselines; best on several real traces.
Fast enough for production: ~68 ms latency, thanks to trend-guided initialization cutting diffusion to 20 steps.
Handles structure + bursts: decomposition captures cycles/trend; diffusion handles the irregular residual spikes.
Context-aware: integrates time-of-day / day-of-week via attention.
Generalizes across services (Google, Azure, Alibaba, Facebook, Wiki) and sampling rates (5–30 min).
More interpretable than a black box: explicitly separates trend from seasonality.
Limitations¶
Univariate only: forecasts just the job-arrival rate, not multivariate metrics (CPU, memory, I/O together) — yet real autoscaling often needs several signals.
Higher training complexity: diffusion + decomposition + attention is heavier to train than a plain LSTM/transformer.
Struggles on extremely volatile traces: e.g. Facebook-5m remains the hardest case (still ~40% MAPE), even if it beats baselines there.
Prediction only — not a full autoscaler: it stops at the forecast; the Plan/Execute (turning forecast into pod/VM counts) is left to an external scaler (HPA). No closed-loop control is implemented.
No end-to-end autoscaling experiment: results are forecasting metrics, not measured SLA/cost improvements from real scaling actions.
Future work acknowledged: distilled/faster diffusion, multivariate support, and reinforcement-learning closed-loop autoscaling are all still to come.
Glossary¶
Job Arrival Rate (JAR): number of incoming tasks/requests per unit time — the forecast target.
Proactive autoscaling: scaling before demand changes, based on a forecast (vs. reactive = after a threshold is crossed).
MAPE-K: the autoscaling control loop — Monitor, Analyze, Plan, Execute over shared Knowledge. DAF sits in Analyze.
HPA (Horizontal Pod Autoscaler): Kubernetes component that adds/removes pods (horizontal scaling). DAF’s forecast could feed it.
Transformer / attention: neural network that weighs which past time steps matter most, processing the whole sequence in parallel.
Autoformer: forecasting transformer that decomposes a signal into trend + seasonal parts.
Decomposition (trend / seasonal): splitting a time series into a slow drift (trend) and repeating cycles (seasonal). Here via a moving-average filter.
Diffusion model: generative method that learns to turn noise into realistic data; sampling it repeatedly yields a probability distribution (= uncertainty).
Trend-guided initialization: starting the diffusion denoising from the extrapolated trend (not pure noise) so far fewer steps are needed → low latency.
Exogenous features: external context inputs (time-of-day, day-of-week) that influence demand.
Probabilistic forecast / prediction interval: a forecast expressed as a distribution / confidence band rather than a single value.
PICP (Prediction Interval Coverage Probability): fraction of true values that fall inside the predicted band; >90% here means well-calibrated uncertainty.
MAPE / MAE / RMSE: standard forecast-error metrics (lower is better).
SLA / SLO: Service-Level Agreement/Objective — performance promises (e.g. latency) that under-provisioning would violate.
Cold start / provisioning delay: time it takes a new VM/container to boot — the lag that proactive forecasting hides.
- Kumar, S., Chauhan, M. K., Priyadarshni, Tripathi, S., Misra, R., & Singh, T. N. (2025). Predicting Cloud Workload Job Arrival Rates Using a Diffusion Autoformer Model. 2025 IEEE International Conference on Big Data (BigData), 6102–6107. 10.1109/bigdata66926.2025.11401693