Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Forecasting with Transformers

Four models for long-term forecasting: Informer, Autoformer, PatchTST, and TimesFM

Transformers in forecasting

Four long-horizon forecasters, compared stage by stage through a single pipeline to show where their designs agree, where they diverge, and why.

Scope. Informer, Autoformer, and PatchTST are per-dataset architectures — trained on the series to be forecast — so they compare cleanly stage by stage and are treated together first. TimesFM (2024) is a pretrained, zero-shot foundation model; it is mapped onto the same stages in a separate section as the paradigm shift it represents.

Glossary. Lookback L — past steps fed in (672). Horizon T — steps predicted (168). Channel — one series (one household). Token — one input unit the Transformer processes (the central disagreement). Self-attention — lets each token weigh every other; naively O(L²).


Lineage

All three per-dataset models are Transformer-based point forecasters trained with MSE, attacking the same problem: the vanilla Transformer’s O(L²) attention is too costly for long series and its step-by-step decoder too slow and error-prone for long horizons. They diverge in what they rethink.

YearModelWhat it rethinksThesis
2021 (AAAI)InformerThe attention computation — skip unimportant query-key pairsMost attention is wasted; compute only the selective queries.
2021 (NeurIPS)AutoformerThe operator — period-aware Auto-Correlation + built-in decompositionAlign and average whole sub-series at matching phases, not points.
2023 (ICLR)PatchTSTThe input representation — patch the series, forecast each channel independentlyRedesign the token, not the attention.
2024 (ICML)TimesFMThe training paradigm — pretrain once, forecast zero-shotDon’t train per dataset at all.

Informer makes attention cheaper; Autoformer makes it periodicity-aware; PatchTST makes redesigning it unnecessary by changing the tokens. PatchTST is partly a reaction to the first two: by 2022 a plain linear model (DLinear) had outperformed the Informer/Autoformer family, and PatchTST’s aim was to rescue the Transformer by simplifying it. TimesFM changes the question — pretrain one model on billions of timepoints and forecast unseen series zero-shot — so it is compared separately.


The common pipeline

Seven stages, applied to the 321-household problem:

  1. Input & framing — what is one training example?

  2. Normalization — how is distribution drift handled?

  3. Tokenization & embedding — how do raw numbers become tokens?

  4. Core mechanism — the attention (or its replacement) and its cost.

  5. Decomposition — is trend/seasonality separated?

  6. Generation — one-shot or step-by-step?

  7. Output — what is emitted?


Stage 1 — Input & framing

How the multivariate history is sliced into an example: one multivariate window, or many univariate windows?

InformerAutoformerPatchTST
Multivariate handlingChannel-mixing: all 321 values per hour stacked into one vectorChannel-mixing: one vector per step holds all channelsChannel-independence: split into 321 univariate series
One example672 × 321 window + timestamps + start-token history672 × 321 window (latter half seeds the decoder)321 separate 672 × 1 series through one shared-weight model
Affordable lookbacklong; each step is a tokenlongvery long — cheap patch tokens (Stage 3)

Trade-off. Channel-mixing is cheap and can exploit cross-channel correlation (a neighbor’s spike informs yours). PatchTST refuses to mix: one household’s signal is not smeared by 320 others, overfitting drops, and one model handles datasets of any channel count. The cost is that PatchTST ignores cross-channel correlation (flagged as future work) — yet it still won on benchmarks.


Stage 2 — Normalization & stationarization

Series drift — winter demand ≠ summer demand. This distribution shift can wreck a model trained on one regime.

InformerAutoformerPatchTST
Global scalingzero-mean/unit-variancesamesame, plus per-instance norm
Per-windownoneimplicit via decomposition (Stage 5)instance norm: each window re-centered to mean 0 / std 1 before patching; mean & std added back to the forecast
Philosophyrelies on network capacityremoves trend so the seasonal pathway sees a near-stationary signalneutralizes level/scale per window so the encoder models only shape

Trade-off. Autoformer and PatchTST reach the same insight from opposite directions: Autoformer subtracts a moving-average trend; PatchTST re-centers each window so winter and summer look comparable in scale. Informer does neither explicitly, trusting capacity and data — simpler, but more exposed to shift. Under instance norm, a 2 kW household and a 9 kW household present comparable shapes to the shared model, with real scale reattached at the end.


Stage 3 — Tokenization & embedding (the crux)

A Transformer processes a sequence of tokens, each embedded as a vector. The largest conceptual fork is what one token represents.

InformerAutoformerPatchTST
One token =one time step (one hour)one time step (one hour)one patch (e.g. 16 hours)
Embeddingvalue (Conv1d) + positional + learnable timestamp (hour/day/week/holiday)value + positionallinear projection of the length-P patch + learnable position
Tokens for L=672~672~672⌊(672−16)/8⌋ + 2 ≈ 84 patches
Calendar featuresyes, heavily — tells the one-shot decoder “this slot is Monday 9am”periodicity discovered from datanone; relies on patch position

Trade-off. This is PatchTST’s headline (“A Time Series is Worth 64 Words”, echoing ViT image patches). A single hour, like a single letter, carries little meaning; a 16-hour patch encodes “a morning ramp” or “an evening peak.” Patching does three things at once: (1) preserves local shape inside the token, (2) cuts token count from L to ~L/S, reducing attention cost by , and (3) frees budget for a longer lookback. Informer and Autoformer, fixed at one-token-per-step, never gain this leverage — which is why a near-trivial change mattered. For the grid, each 672-hour series becomes ~84 patch-tokens versus 672 step-tokens.


Stage 4 — Core sequence mechanism

How tokens exchange information, and at what cost — the deepest technical divergence.

InformerAutoformerPatchTST
MechanismProbSparse self-attentionAuto-Correlation (replaces attention)vanilla full self-attention over patches
Core tricksample ~L·lnL query-key pairs, keep Top-u = c·lnL most selective queries, assign the rest mean(V)FFT autocorrelation R(τ); pick top-k = c·logL lags; Roll values to align same-phase chunks and weighted-sum (Time Delay Aggregation)standard softmax(QKᵀ/√d)·V; efficiency came at tokenization
Granularitypoint-wise (sparse subset)series-wise (aligned sub-series)patch-wise (chunks via ordinary attention)
Interpretabilitylow (heuristic sampling)high — top lags match real periods (24h, 168h)medium (attention over patches)

Complexity, for token count L:

ModelAttention costMechanism
VanillaO(L²)every token attends to every token
InformerO(L·logL)full attention only for ~logL selective queries
AutoformerO(L·logL)FFT computes all lags at once (Wiener–Khinchin)
PatchTSTO((L/S)²) effectivefull attention on N≈L/S patch tokens

Trade-off. Three philosophies: Informer keeps point-wise attention but prunes it — cheap and effective, but a heuristic resting on a sparsity assumption and a sampling factor c. Autoformer argues point-wise attention is wrong for periodic data and swaps in a period-aware operator (today’s 6pm resembles yesterday’s and last week’s) — faster, richer, and graceful even on weakly-periodic data, at the cost of leaning on real periodicity and a well-chosen moving-average window. PatchTST keeps vanilla attention but feeds it patches: since cost is quadratic in token count and patching shrank that ~8×, it gains a ~64× reduction with none of the others’ approximations — the “simplicity wins” thesis.


Stage 5 — Decomposition (trend & seasonality)

A long forecast tangles a slow trend with repeating seasonal cycles. Does the model separate them?

InformerAutoformerPatchTST
Built-in decompositionnoyes — the central ideano
Handlingleft to the networkprogressive: a SeriesDecomp block (moving-average trend + seasonal remainder) applied inside every layer, even on intermediate predictionsinstance norm removes level; encoder learns patterns implicitly
Trend pathwayn/aencoder discards trend (keeps seasonal); decoder accumulates itn/a

Trade-off. This is Autoformer’s signature. Classical decomposition is one-shot pre-processing, but a future cannot be decomposed before it is predicted. Autoformer makes decomposition an inner, repeated block, so trend and seasonal components are refined as the forecast is built; the output is literally seasonal + accumulated_trend. The payoff is long-horizon robustness and interpretability; the cost is dependence on a good moving-average window and genuinely separable structure. Informer and PatchTST bet they don’t need it — Informer trusts capacity, PatchTST trusts that patches plus instance norm suffice.


Stage 6 — Encoder/decoder & generation

How the architecture is wired, and whether the forecast emerges all at once or step by step.

InformerAutoformerPatchTST
Architectureencoder–decoderencoder–decoderencoder-only + linear head
Distinctive pieceself-attention distilling: Conv → ELU → stride-2 MaxPool halves length between layers (L→L/2→L/4)encoder keeps seasonal memory; decoder runs inner + cross Auto-Correlation, accumulating trendvanilla encoder, then flatten
Decoder inputstart-token + zero placeholders for each future step, each with its timestampseasonal seed (zeros) + trend seed (input mean)none
Generationone-shot generative — 168 values in one passone-shotseasonal + trend over the horizonone-shot direct — linear head emits all T
Autoregressiveno (avoids error accumulation)nono

Trade-off. All three avoid slow, error-accumulating step-by-step decoding — but by different routes. Informer keeps the heaviest machinery (encoder-decoder, distilling for long inputs, a generative decoder fed timestamped zero-placeholders), powerful but knob-heavy. Autoformer reframes the decoder as maintaining two pathways — a sharpening seasonal estimate and an accumulating trend — then summing them. PatchTST drops the decoder entirely: an encoder plus one linear layer to all T values; maximal simplicity, the only catch being that the head grows with T.


Stage 7 — Output

InformerAutoformerPatchTST
Output168 × 321 point forecast168 × 321 (seasonal + trend)168 × 321 (per-channel, concatenated)
LossMSEMSEMSE (per-channel)
Uncertainty intervalsnonono
Uni↔multivariate switchresize final FCre-embed channelsper-channel by construction
Self-supervised pretrainingnonoyes — mask 40% of patches, reconstruct; transfers across datasets

Trade-off. All three emit point forecasts trained with MSE; none gives native uncertainty intervals (unlike DeepAR). The standout is PatchTST’s masked pretraining: because a patch is a meaningful unit, masking and reconstructing whole patches forces real understanding (a single point could just be interpolated), and the pretrained model transfers across datasets — a step toward time-series foundation models.


Summary: the per-dataset trio

StageInformer (2021)Autoformer (2021)PatchTST (2023)
1. Framingchannel-mixingchannel-mixingchannel-independent
2. Normalizationglobaldecomposition removes trendglobal + instance norm
3. Tokentime steptime steppatch (~16 steps)
4. MechanismProbSparse, O(L logL)Auto-Correlation (FFT), O(L logL)vanilla attention, few tokens
5. Decompositionnoneprogressive, every layernone (instance norm only)
6. Generationenc-dec + distilling, one-shotenc-dec, seasonal+trend, one-shotencoder + linear head, one-shot
7. Outputpointpointpoint + pretraining

TimesFM (2024): the foundation-model turn

The trio disagree on tokens, attention, and decomposition, but silently agree on one premise: train on the dataset you want to forecast. TimesFM rejects it.

Borrowing the NLP recipe, TimesFM pretrains one ~200M-parameter decoder-only Transformer on ~100 billion timepoints across domains (Google Trends, Wikipedia, traffic, weather, synthetic), then forecasts a never-seen series zero-shot — no per-dataset training. It is GPT for time series. Crucially, it inherits PatchTST’s patching and bolts it onto GPT’s decoder-only pretraining. Mapped onto the seven stages:

  1. Framing. Univariate, like PatchTST. But “one example” is no longer from your dataset — during pretraining it is a window among billions of points; at inference, your series handed over cold. Random-length masking teaches it to forecast from any context length.

  2. Normalization. Per-context rescaling (megawatts vs. web clicks), reversed on output — PatchTST’s instance norm in spirit.

  3. Tokenization. Patches, but each non-overlapping 32-point patch is tokenized by a small residual MLP, not a single linear projection.

  4. Mechanism. Causal self-attention (GPT-style), ~20 layers — each token sees only the past. The sharp contrast with PatchTST’s bidirectional encoder: a decoder that generates the future must be causal.

  5. Decomposition. None explicit — it leans on a vast, diverse corpus (including synthetic trend/seasonal series) to learn structure implicitly.

  6. Generation. The only autoregressive model of the four: emit a patch, append, roll forward. It limits error accumulation via the output-patch trick — each step emits a 128-long patch (4× the 32-long input patch), so a 168-hour week needs only 2 roll-forward steps.

  7. Output. Still a point forecast with MSE — but the headline is provenance: pretrained once, applied zero-shot, the axis none of the others move on.

TimesFM vs. the trio:

Informer / Autoformer / PatchTSTTimesFM
Trainingper datasetpretrain once, zero-shot
Channelsmixing / mixing / independentunivariate
Tokenstep / step / patchpatch (residual-MLP)
Attentionsparse / Auto-Correlation / full — encoder(-decoder)causal, decoder-only
Decompositionno / yes / nono (implicit)
Generationone-shotautoregressive, long output patches
Outputpointpoint
Cost on a new datasetcollect + train + tunejust run it

PatchTST’s lesson: how you represent the input matters more than attention cleverness. TimesFM accepts it and stacks a second from NLP: how much you pretrain matters more than per-dataset fitting. It is the patching idea, scaled into a foundation model.


Choosing a model

The arc: cheaper attention (Informer) → decomposed, periodicity-aware attention (Autoformer) → smarter tokens (PatchTST) → no per-dataset training (TimesFM). The meta-lesson compounds: first representation beat architecture, then scale and pretraining beat per-dataset fitting — the two forces that reshaped NLP, now arriving for time series.