Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Autoformer

Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

Wu, Xu, Wang & Long (Tsinghua University), NeurIPS 2021. Paper: https://proceedings.neurips.cc/paper/2021/hash/bcc0d400288793e8bdcd7c19a8ac0c2b-Abstract.html — Code: https://github.com/thuml/Autoformer

TL;DR (one paragraph a newcomer can grasp)

Imagine you want to predict the next two weeks of a city’s electricity demand from the last week of readings. Standard Transformer models (the architecture behind ChatGPT-style systems) can do this, but they compare every time point to every other time point one by one. Over long horizons this is both slow (cost grows with the square of the series length) and unreliable, because the meaningful signal gets buried under noise. Autoformer redesigns the Transformer in two ways. (1) It decomposes the series inside the model, repeatedly splitting it into a smooth trend (“the demand is slowly rising this month”) and a repeating seasonal pattern (“every day peaks around 6 pm”), instead of doing this once as pre-processing. (2) It replaces the usual point-by-point “attention” with an Auto-Correlation mechanism that works at the level of whole sub-series: it finds the dominant periods in the data and aligns and averages the chunks of history that sit at the same phase of those periods. The result is a model that is faster (cost grows as L·log L instead of ) and more accurate, giving a ~38% average error reduction over the previous best models across six real-world datasets (energy, traffic, economics, weather, disease).


The Problem

Long-term time series forecasting means predicting far into the future. Formally the task is input-I-predict-O: given the past I time steps, predict the next O time steps. “Long-term” means O is large (the paper tests O = 96, 192, 336, 720 steps; e.g. 720 hours ≈ a month ahead).

This is genuinely hard for two reasons:

  1. Tangled patterns. A long future is a messy mix of slow drifts, daily cycles, weekly cycles, weather shocks, etc. Trying to read dependencies straight off the raw series is unreliable because the useful structure is “obscured by entangled temporal patterns.”

  2. Computational cost. The Transformer’s core operation, self-attention, compares every pair of time points. For a length-L series that is comparisons — quadratic complexity. Double the horizon and you quadruple the memory and time. This becomes prohibitive for long series.

Earlier Transformer variants (Informer, LogTrans, Reformer) attacked cost #2 by making attention sparse — only attending to a cleverly chosen subset of points. That helps speed, but it keeps the point-wise view: it still aggregates information one isolated time point at a time, and throwing points away creates an “information utilization bottleneck.” Autoformer argues the point-wise view itself is the wrong abstraction for periodic data.


Background a newcomer needs

A few terms, defined once:


Key Idea / Innovation

Autoformer keeps the residual encoder–decoder skeleton of the Transformer but makes two replacements:

  1. Decomposition becomes an inner block, used repeatedly (not one-shot pre-processing). The model progressively separates trend from seasonal at every layer — even on the predicted intermediate values, not just the past. This lets trend and seasonal components interact and get refined as the forecast is built.

  2. Auto-Correlation replaces self-attention. Instead of relating individual points, it relates whole sub-series based on the data’s periodicity. It (a) discovers the most likely periods via autocorrelation and (b) aligns and averages the sub-series sitting at matching phases of those periods. This gives O(L·log L) complexity and richer, series-level information use — improving efficiency and accuracy at the same time, which sparse attentions could not do.

The intuition behind (2): the same phase position across different periods tends to look alike. Today’s 6 pm demand resembles yesterday’s 6 pm and the day-before’s 6 pm. So rather than asking “which scattered past instants matter?”, ask “what is the period, and which past chunks sit at the same phase?” — then reuse those chunks.


How It Works, Step by Step (the execution flow)

Building block A — the Series Decomposition block (SeriesDecomp)

Given a hidden series X, extract a smooth trend by a moving average (an average pooling over a sliding window, with padding so the length is unchanged); the seasonal part is whatever’s left over:

trend     X_t = AvgPool(Padding(X))        # smooth long-term progression
seasonal  X_s = X - X_t                     # the repeating fluctuations
X_s, X_t = SeriesDecomp(X)

This block is sprinkled throughout the network, so decomposition happens again and again on evolving hidden states.

Building block B — the Auto-Correlation mechanism

Replaces self-attention. From the input it forms query Q, key K, value V (same as attention). Then:

  1. Find the periods. Compute the autocorrelation R(τ) — the similarity between the series and itself shifted by lag τ, for all lags at once. This is done efficiently with the Fast Fourier Transform (FFT) via the Wiener–Khinchin theorem (autocorrelation = inverse FFT of the power spectrum). Computing all lags this way is the source of the O(L·log L) speed.

  2. Pick the top-k lags. Take the k = ⌊c · log L⌋ lags with the highest autocorrelation: τ₁,…,τ_k. These are the most probable period lengths. Their autocorrelation scores become confidences, normalized with softmax into weights.

  3. Time Delay Aggregation. For each chosen lag τ_i, Roll the value series V by τ_i — a circular shift that re-introduces elements pushed off one end at the other end. Rolling aligns the sub-series that sit at the same phase. Then sum the rolled copies, weighted by the softmax confidences:

τ₁,…,τ_k = argTopk( R_{Q,K}(τ) )            # most likely periods
weights   = SoftMax( R(τ₁),…,R(τ_k) )        # confidence per period
AutoCorrelation(Q,K,V) = Σ_i  Roll(V, τ_i) · weight_i

Contrast with self-attention, which would do a dot-product blend of individual points. Auto-Correlation blends aligned chunks. It runs in multi-head form (several Q/K/V projections in parallel) exactly like standard attention, so it is a drop-in replacement.

The Encoder (N layers) — models seasonality

Each encoder layer interleaves Auto-Correlation and decomposition. The trend that gets peeled off is discarded ("_"); the encoder deliberately focuses on the seasonal part and produces a clean seasonal representation passed to the decoder.

S¹, _ = SeriesDecomp( AutoCorrelation(X) + X )
S², _ = SeriesDecomp( FeedForward(S¹)   + S¹ )

(FeedForward is a small per-position neural net, and + X is a residual connection — adding the input back so layers learn refinements.)

Decoder input initialization

The decoder needs a starting guess for the future. The latter half of the encoder’s input is decomposed into seasonal + trend, and each is padded out to the forecast horizon:

The Decoder (M layers) — two accumulating pathways

Each decoder layer runs two Auto-Correlations: an inner one (refine its own seasonal estimate) and an encoder–decoder one (where Q comes from the decoder and K, V come from the encoder’s seasonal output — this is how past seasonal information flows in). After each Auto-Correlation and FeedForward, a SeriesDecomp splits off trend. Crucially the trend pieces are not thrown away — they are projected and accumulated into a running trend estimate:

S, T = SeriesDecomp( AutoCorrelation(...) + input )   # seasonal kept, trend extracted
trend_total += W * T                                   # accumulate trend across all blocks

Final prediction

forecast = W_S · (final seasonal Xde)  +  (accumulated trend Tde)

The two refined components are summed. As decomposition blocks are added, the model captures rising trends and sharp seasonal peaks/troughs that a no-decomposition baseline misses.

A bird’s-eye sketch:

            ┌──────────────  ENCODER (Nx)  ──────────────┐
 past I  ──▶│ AutoCorr → Decomp → FeedFwd → Decomp        │ (trend discarded,
            └──────────────┬─────────────────────────────┘  keeps SEASONAL)
                           │ seasonal memory (K,V)
                           ▼
            ┌──────────────  DECODER (Mx)  ──────────────┐
 seasonal   │ AutoCorr → Decomp                          │   trend pieces
 init (0s)  │ EncDec-AutoCorr → Decomp                   │   accumulate ──┐
 ──────────▶│ FeedFwd → Decomp                           │                │
 trend init │                                            │                ▼
 (means) ───┼────────────────────────────────────────────┼──▶ trend_total
            └──────────────┬─────────────────────────────┘
                           ▼
            forecast = seasonal_part  +  trend_total   ──▶  next O steps

A Worked Example

Suppose hourly electricity load. You feed I = 96 hours (4 days) and want O = 336 hours (2 weeks) ahead.

  1. Init. Take the last 2 days of the 96-hour input. Decompose it: the slow upward drift becomes the trend seed; the daily up-and-down becomes the seasonal seed. Pad both to cover the 336-hour horizon — seasonal slots filled with zeros, trend slots filled with the 4-day average.

  2. Auto-Correlation finds the rhythm. Via FFT, the model computes self-similarity at every lag and finds the strongest at τ = 24 (daily) and τ = 168 (weekly). (The paper confirms exactly these learned lags on real traffic/electricity data — 24h and 168h — making the forecast human-interpretable.)

  3. Time Delay Aggregation. It Rolls the history by 24h and by 168h so that “6 pm chunks” stack onto “6 pm chunks,” then averages them weighted by how confident each period is. Yesterday’s evening peak and last week’s evening peak both inform tonight’s predicted evening peak — as whole sub-series, not isolated points.

  4. Progressive decomposition. Across layers, the smooth “demand is creeping up this month” trend is repeatedly skimmed off and accumulated, while the seasonal pathway sharpens the daily peaks and troughs.

  5. Output. The final 2-week forecast = sharpened daily/weekly seasonal pattern + accumulated rising trend.

Notably, even the Exchange dataset, which has no obvious periodicity, still improves — Auto-Correlation degrades gracefully toward using the most-correlated lags it can find.


Input / Output Summary


Strengths

Limitations


Glossary