Autoformer

Wu, Xu, Wang & Long (Tsinghua University), NeurIPS 2021. Paper: https://proceedings.neurips.cc/paper/2021/hash/bcc0d400288793e8bdcd7c19a8ac0c2b-Abstract.html — Code: https://github.com/thuml/Autoformer

TL;DR (one paragraph a newcomer can grasp)¶

Imagine you want to predict the next two weeks of a city’s electricity demand from the last week of readings. Standard Transformer models (the architecture behind ChatGPT-style systems) can do this, but they compare every time point to every other time point one by one. Over long horizons this is both slow (cost grows with the square of the series length) and unreliable, because the meaningful signal gets buried under noise. Autoformer redesigns the Transformer in two ways. (1) It decomposes the series inside the model, repeatedly splitting it into a smooth trend (“the demand is slowly rising this month”) and a repeating seasonal pattern (“every day peaks around 6 pm”), instead of doing this once as pre-processing. (2) It replaces the usual point-by-point “attention” with an Auto-Correlation mechanism that works at the level of whole sub-series: it finds the dominant periods in the data and aligns and averages the chunks of history that sit at the same phase of those periods. The result is a model that is faster (cost grows as L·log L instead of L²) and more accurate, giving a ~38% average error reduction over the previous best models across six real-world datasets (energy, traffic, economics, weather, disease).

The Problem¶

Long-term time series forecasting means predicting far into the future. Formally the task is input-I-predict-O: given the past I time steps, predict the next O time steps. “Long-term” means O is large (the paper tests O = 96, 192, 336, 720 steps; e.g. 720 hours ≈ a month ahead).

This is genuinely hard for two reasons:

Tangled patterns. A long future is a messy mix of slow drifts, daily cycles, weekly cycles, weather shocks, etc. Trying to read dependencies straight off the raw series is unreliable because the useful structure is “obscured by entangled temporal patterns.”
Computational cost. The Transformer’s core operation, self-attention, compares every pair of time points. For a length-L series that is L² comparisons — quadratic complexity. Double the horizon and you quadruple the memory and time. This becomes prohibitive for long series.

Earlier Transformer variants (Informer, LogTrans, Reformer) attacked cost #2 by making attention sparse — only attending to a cleverly chosen subset of points. That helps speed, but it keeps the point-wise view: it still aggregates information one isolated time point at a time, and throwing points away creates an “information utilization bottleneck.” Autoformer argues the point-wise view itself is the wrong abstraction for periodic data.

Background a newcomer needs¶

A few terms, defined once:

Time series. A sequence of measurements over time, e.g. hourly electricity load. May be univariate (one variable) or multivariate (many variables tracked together, e.g. temperature + humidity + pressure).
Lookback window / input length I. How much history the model sees.
Horizon / output length O. How far ahead it predicts.
Attention. A mechanism that, for each position, computes a weighted blend of information from other positions, with weights (“how much should I listen to you?”) learned from the data. Self-attention is attention of a sequence onto itself.
Query / Key / Value (Q, K, V). Attention’s three learned projections of the input. Loosely: each position emits a query (“what am I looking for?”), every position offers a key (“here’s what I am”), Q·K similarity sets the weights, and the values are what actually gets blended.
Encoder / Decoder. An encoder reads the input and builds a rich internal representation; a decoder uses that representation to generate the output. Autoformer keeps this two-part structure.
Stationarity. A series is stationary if its statistical properties (mean, variance) don’t drift over time. Real series usually aren’t — they trend. Classical methods like ARIMA force stationarity by differencing; Autoformer instead peels off the trend explicitly.
Seasonality / period. A pattern that repeats every fixed interval (the period). Daily electricity load has a 24-hour period; traffic has both 24-hour and 168-hour (weekly) periods.
Series decomposition. A classic idea: split a series into a smooth trend-cyclical component (long-term progression) plus a seasonal component (the repeating part). Traditionally done once, as pre-processing, because you can’t decompose a future you haven’t predicted yet.
Autocorrelation. How similar a series is to a time-shifted copy of itself. If shifting by 24 hours lines the series up almost perfectly, that’s strong evidence of a 24-hour period.

Key Idea / Innovation¶

Autoformer keeps the residual encoder–decoder skeleton of the Transformer but makes two replacements:

Decomposition becomes an inner block, used repeatedly (not one-shot pre-processing). The model progressively separates trend from seasonal at every layer — even on the predicted intermediate values, not just the past. This lets trend and seasonal components interact and get refined as the forecast is built.
Auto-Correlation replaces self-attention. Instead of relating individual points, it relates whole sub-series based on the data’s periodicity. It (a) discovers the most likely periods via autocorrelation and (b) aligns and averages the sub-series sitting at matching phases of those periods. This gives O(L·log L) complexity and richer, series-level information use — improving efficiency and accuracy at the same time, which sparse attentions could not do.

The intuition behind (2): the same phase position across different periods tends to look alike. Today’s 6 pm demand resembles yesterday’s 6 pm and the day-before’s 6 pm. So rather than asking “which scattered past instants matter?”, ask “what is the period, and which past chunks sit at the same phase?” — then reuse those chunks.

How It Works, Step by Step (the execution flow)¶

Building block A — the Series Decomposition block (`SeriesDecomp`)¶

Given a hidden series X, extract a smooth trend by a moving average (an average pooling over a sliding window, with padding so the length is unchanged); the seasonal part is whatever’s left over:

trend     X_t = AvgPool(Padding(X))        # smooth long-term progression
seasonal  X_s = X - X_t                     # the repeating fluctuations
X_s, X_t = SeriesDecomp(X)

This block is sprinkled throughout the network, so decomposition happens again and again on evolving hidden states.

Building block B — the Auto-Correlation mechanism¶

Replaces self-attention. From the input it forms query Q, key K, value V (same as attention). Then:

Find the periods. Compute the autocorrelation R(τ) — the similarity between the series and itself shifted by lag τ, for all lags at once. This is done efficiently with the Fast Fourier Transform (FFT) via the Wiener–Khinchin theorem (autocorrelation = inverse FFT of the power spectrum). Computing all lags this way is the source of the O(L·log L) speed.
Pick the top-k lags. Take the k = ⌊c · log L⌋ lags with the highest autocorrelation: τ₁,…,τ_k. These are the most probable period lengths. Their autocorrelation scores become confidences, normalized with softmax into weights.
Time Delay Aggregation. For each chosen lag τ_i, Roll the value series V by τ_i — a circular shift that re-introduces elements pushed off one end at the other end. Rolling aligns the sub-series that sit at the same phase. Then sum the rolled copies, weighted by the softmax confidences:

τ₁,…,τ_k = argTopk( R_{Q,K}(τ) )            # most likely periods
weights   = SoftMax( R(τ₁),…,R(τ_k) )        # confidence per period
AutoCorrelation(Q,K,V) = Σ_i  Roll(V, τ_i) · weight_i

Contrast with self-attention, which would do a dot-product blend of individual points. Auto-Correlation blends aligned chunks. It runs in multi-head form (several Q/K/V projections in parallel) exactly like standard attention, so it is a drop-in replacement.

The Encoder (`N` layers) — models seasonality¶

Each encoder layer interleaves Auto-Correlation and decomposition. The trend that gets peeled off is discarded ("_"); the encoder deliberately focuses on the seasonal part and produces a clean seasonal representation passed to the decoder.

S¹, _ = SeriesDecomp( AutoCorrelation(X) + X )
S², _ = SeriesDecomp( FeedForward(S¹)   + S¹ )

(FeedForward is a small per-position neural net, and + X is a residual connection — adding the input back so layers learn refinements.)

Decoder input initialization¶

The decoder needs a starting guess for the future. The latter half of the encoder’s input is decomposed into seasonal + trend, and each is padded out to the forecast horizon:

Seasonal init: recent seasonal part, with the horizon filled by zeros (a clean slate to be predicted).
Trend init: recent trend part, with the horizon filled by the mean of the input (a flat baseline to be accumulated onto).

The Decoder (`M` layers) — two accumulating pathways¶

Each decoder layer runs two Auto-Correlations: an inner one (refine its own seasonal estimate) and an encoder–decoder one (where Q comes from the decoder and K, V come from the encoder’s seasonal output — this is how past seasonal information flows in). After each Auto-Correlation and FeedForward, a SeriesDecomp splits off trend. Crucially the trend pieces are not thrown away — they are projected and accumulated into a running trend estimate:

S, T = SeriesDecomp( AutoCorrelation(...) + input )   # seasonal kept, trend extracted
trend_total += W * T                                   # accumulate trend across all blocks

Final prediction¶

forecast = W_S · (final seasonal Xde)  +  (accumulated trend Tde)

The two refined components are summed. As decomposition blocks are added, the model captures rising trends and sharp seasonal peaks/troughs that a no-decomposition baseline misses.

A bird’s-eye sketch:

            ┌──────────────  ENCODER (Nx)  ──────────────┐
 past I  ──▶│ AutoCorr → Decomp → FeedFwd → Decomp        │ (trend discarded,
            └──────────────┬─────────────────────────────┘  keeps SEASONAL)
                           │ seasonal memory (K,V)
                           ▼
            ┌──────────────  DECODER (Mx)  ──────────────┐
 seasonal   │ AutoCorr → Decomp                          │   trend pieces
 init (0s)  │ EncDec-AutoCorr → Decomp                   │   accumulate ──┐
 ──────────▶│ FeedFwd → Decomp                           │                │
 trend init │                                            │                ▼
 (means) ───┼────────────────────────────────────────────┼──▶ trend_total
            └──────────────┬─────────────────────────────┘
                           ▼
            forecast = seasonal_part  +  trend_total   ──▶  next O steps

A Worked Example¶

Suppose hourly electricity load. You feed I = 96 hours (4 days) and want O = 336 hours (2 weeks) ahead.

Init. Take the last 2 days of the 96-hour input. Decompose it: the slow upward drift becomes the trend seed; the daily up-and-down becomes the seasonal seed. Pad both to cover the 336-hour horizon — seasonal slots filled with zeros, trend slots filled with the 4-day average.
Auto-Correlation finds the rhythm. Via FFT, the model computes self-similarity at every lag and finds the strongest at τ = 24 (daily) and τ = 168 (weekly). (The paper confirms exactly these learned lags on real traffic/electricity data — 24h and 168h — making the forecast human-interpretable.)
Time Delay Aggregation. It Rolls the history by 24h and by 168h so that “6 pm chunks” stack onto “6 pm chunks,” then averages them weighted by how confident each period is. Yesterday’s evening peak and last week’s evening peak both inform tonight’s predicted evening peak — as whole sub-series, not isolated points.
Progressive decomposition. Across layers, the smooth “demand is creeping up this month” trend is repeatedly skimmed off and accumulated, while the seasonal pathway sharpens the daily peaks and troughs.
Output. The final 2-week forecast = sharpened daily/weekly seasonal pattern + accumulated rising trend.

Notably, even the Exchange dataset, which has no obvious periodicity, still improves — Auto-Correlation degrades gracefully toward using the most-correlated lags it can find.

Input / Output Summary¶

Input: the past I time steps of a (univariate or multivariate) numeric time series, embedded into d-dimensional vectors. Tested on energy (ETT, Electricity), traffic, economics (Exchange), weather, and disease (ILI) data.
Output: the predicted next O steps (point forecasts). Trained with L2 loss (mean-squared error), Adam optimizer, learning rate 1e-4. The paper’s configuration uses 2 encoder layers, 1 decoder layer.

Strengths¶

State-of-the-art accuracy. ~38% average MSE reduction across six benchmarks and four horizons vs. prior bests (e.g. on input-96-predict-336: 74% lower MSE on ETT, 61% on Exchange, 21% on Weather).
Efficiency. O(L·log L) time and memory (vs. L² for full attention), via the FFT-based autocorrelation. It can handle very long settings (input-336-predict-1440) where full and sparse attentions run out of memory.
Long-horizon robustness. Accuracy degrades gently as the horizon O grows — valuable for real planning and early-warning use.
Interpretability. Learned top lags correspond to genuine real-world periods (24h daily, 168h weekly, monthly/quarterly/yearly on Exchange), so the model offers a human-readable explanation of why it predicts what it does.
Modular and transferable. The progressive decomposition architecture, dropped into vanilla Transformer / Informer / LogTrans / Reformer, consistently improves them — and it beats one-shot pre-decomposition (which can even hurt at long horizons because it ignores future component interactions).

Limitations¶

Leans on periodicity. The core mechanism assumes meaningful repeating periods. It still works on weakly-periodic data (Exchange) but its biggest wins are on clearly periodic series; very irregular or regime-shifting signals get less benefit.
Choosing how many periods (k). k = ⌊c·log L⌋ with hyper-parameter c (the paper uses 1–3) trades accuracy for speed; too few lags can miss phases, too many adds cost.
Moving-average window is a design choice. The decomposition’s pooling kernel size must be set; an ill-fitting window can blur or under-separate trend vs. seasonal.
Point forecasts, not probabilistic. It outputs single best-guess values, not uncertainty intervals (unlike e.g. DeepAR), so it doesn’t natively express forecast confidence.
Still a heavyweight deep model. It needs GPU training and a reasonable amount of data; for short-horizon or tiny-data problems, classical methods (ARIMA was best on Exchange’s shortest horizon) can match or beat it.

Glossary¶

Attention / self-attention: mechanism that blends information across positions using learned, data-dependent weights; self-attention applies this within one sequence. Naïvely costs O(L²).
Query/Key/Value (Q/K/V): the three learned projections attention uses to decide weights (Q·K) and what to blend (V).
Encoder / Decoder: the read-and-represent half and the generate-output half of the network.
Lookback window (I) / horizon (O): length of history seen / length of future predicted. “Long-term” = large O.
Quadratic complexity: cost growing as L²; here reduced to L·log L.
Stationarity: time-invariance of a series’ statistics; non-stationary (trending) series are the norm.
Period / seasonality: the fixed interval over which a pattern repeats.
Series decomposition: splitting a series into smooth trend-cyclical + repeating seasonal parts.
Autocorrelation R(τ): similarity of a series to its own copy shifted by lag τ; peaks reveal periods.
FFT (Fast Fourier Transform): fast algorithm to move between time and frequency domains; here used to compute all autocorrelation lags at once in O(L·log L).
Time Delay Aggregation / Roll: circularly shift sub-series to align matching phases, then weighted-sum them.
Residual connection: adding a block’s input back to its output so layers learn refinements, easing deep training.