Forecasting with Transformers
Four models for long-term forecasting: Informer, Autoformer, PatchTST, and TimesFM
Transformers in forecasting¶
Four long-horizon forecasters, compared stage by stage through a single pipeline to show where their designs agree, where they diverge, and why.
Scope. Informer, Autoformer, and PatchTST are per-dataset architectures — trained on the series to be forecast — so they compare cleanly stage by stage and are treated together first. TimesFM (2024) is a pretrained, zero-shot foundation model; it is mapped onto the same stages in a separate section as the paradigm shift it represents.
Glossary. Lookback L — past steps fed in (672). Horizon T — steps predicted (168). Channel — one series (one household). Token — one input unit the Transformer processes (the central disagreement). Self-attention — lets each token weigh every other; naively O(L²).
Lineage¶
All three per-dataset models are Transformer-based point forecasters trained with MSE, attacking the same problem: the vanilla Transformer’s O(L²) attention is too costly for long series and its step-by-step decoder too slow and error-prone for long horizons. They diverge in what they rethink.
| Year | Model | What it rethinks | Thesis |
|---|---|---|---|
| 2021 (AAAI) | Informer | The attention computation — skip unimportant query-key pairs | Most attention is wasted; compute only the selective queries. |
| 2021 (NeurIPS) | Autoformer | The operator — period-aware Auto-Correlation + built-in decomposition | Align and average whole sub-series at matching phases, not points. |
| 2023 (ICLR) | PatchTST | The input representation — patch the series, forecast each channel independently | Redesign the token, not the attention. |
| 2024 (ICML) | TimesFM | The training paradigm — pretrain once, forecast zero-shot | Don’t train per dataset at all. |
Informer makes attention cheaper; Autoformer makes it periodicity-aware; PatchTST makes redesigning it unnecessary by changing the tokens. PatchTST is partly a reaction to the first two: by 2022 a plain linear model (DLinear) had outperformed the Informer/Autoformer family, and PatchTST’s aim was to rescue the Transformer by simplifying it. TimesFM changes the question — pretrain one model on billions of timepoints and forecast unseen series zero-shot — so it is compared separately.
The common pipeline¶
Seven stages, applied to the 321-household problem:
Input & framing — what is one training example?
Normalization — how is distribution drift handled?
Tokenization & embedding — how do raw numbers become tokens?
Core mechanism — the attention (or its replacement) and its cost.
Decomposition — is trend/seasonality separated?
Generation — one-shot or step-by-step?
Output — what is emitted?
Stage 1 — Input & framing¶
How the multivariate history is sliced into an example: one multivariate window, or many univariate windows?
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Multivariate handling | Channel-mixing: all 321 values per hour stacked into one vector | Channel-mixing: one vector per step holds all channels | Channel-independence: split into 321 univariate series |
| One example | 672 × 321 window + timestamps + start-token history | 672 × 321 window (latter half seeds the decoder) | 321 separate 672 × 1 series through one shared-weight model |
| Affordable lookback | long; each step is a token | long | very long — cheap patch tokens (Stage 3) |
Trade-off. Channel-mixing is cheap and can exploit cross-channel correlation (a neighbor’s spike informs yours). PatchTST refuses to mix: one household’s signal is not smeared by 320 others, overfitting drops, and one model handles datasets of any channel count. The cost is that PatchTST ignores cross-channel correlation (flagged as future work) — yet it still won on benchmarks.
Stage 2 — Normalization & stationarization¶
Series drift — winter demand ≠ summer demand. This distribution shift can wreck a model trained on one regime.
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Global scaling | zero-mean/unit-variance | same | same, plus per-instance norm |
| Per-window | none | implicit via decomposition (Stage 5) | instance norm: each window re-centered to mean 0 / std 1 before patching; mean & std added back to the forecast |
| Philosophy | relies on network capacity | removes trend so the seasonal pathway sees a near-stationary signal | neutralizes level/scale per window so the encoder models only shape |
Trade-off. Autoformer and PatchTST reach the same insight from opposite directions: Autoformer subtracts a moving-average trend; PatchTST re-centers each window so winter and summer look comparable in scale. Informer does neither explicitly, trusting capacity and data — simpler, but more exposed to shift. Under instance norm, a 2 kW household and a 9 kW household present comparable shapes to the shared model, with real scale reattached at the end.
Stage 3 — Tokenization & embedding (the crux)¶
A Transformer processes a sequence of tokens, each embedded as a vector. The largest conceptual fork is what one token represents.
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| One token = | one time step (one hour) | one time step (one hour) | one patch (e.g. 16 hours) |
| Embedding | value (Conv1d) + positional + learnable timestamp (hour/day/week/holiday) | value + positional | linear projection of the length-P patch + learnable position |
Tokens for L=672 | ~672 | ~672 | ⌊(672−16)/8⌋ + 2 ≈ 84 patches |
| Calendar features | yes, heavily — tells the one-shot decoder “this slot is Monday 9am” | periodicity discovered from data | none; relies on patch position |
Trade-off. This is PatchTST’s headline (“A Time Series is Worth 64 Words”, echoing ViT image patches). A single hour, like a single letter, carries little meaning; a 16-hour patch encodes “a morning ramp” or “an evening peak.” Patching does three things at once: (1) preserves local shape inside the token, (2) cuts token count from L to ~L/S, reducing attention cost by S², and (3) frees budget for a longer lookback. Informer and Autoformer, fixed at one-token-per-step, never gain this leverage — which is why a near-trivial change mattered. For the grid, each 672-hour series becomes ~84 patch-tokens versus 672 step-tokens.
Stage 4 — Core sequence mechanism¶
How tokens exchange information, and at what cost — the deepest technical divergence.
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Mechanism | ProbSparse self-attention | Auto-Correlation (replaces attention) | vanilla full self-attention over patches |
| Core trick | sample ~L·lnL query-key pairs, keep Top-u = c·lnL most selective queries, assign the rest mean(V) | FFT autocorrelation R(τ); pick top-k = c·logL lags; Roll values to align same-phase chunks and weighted-sum (Time Delay Aggregation) | standard softmax(QKᵀ/√d)·V; efficiency came at tokenization |
| Granularity | point-wise (sparse subset) | series-wise (aligned sub-series) | patch-wise (chunks via ordinary attention) |
| Interpretability | low (heuristic sampling) | high — top lags match real periods (24h, 168h) | medium (attention over patches) |
Complexity, for token count L:
| Model | Attention cost | Mechanism |
|---|---|---|
| Vanilla | O(L²) | every token attends to every token |
| Informer | O(L·logL) | full attention only for ~logL selective queries |
| Autoformer | O(L·logL) | FFT computes all lags at once (Wiener–Khinchin) |
| PatchTST | O((L/S)²) effective | full attention on N≈L/S patch tokens |
Trade-off. Three philosophies: Informer keeps point-wise attention but prunes it — cheap and effective, but a heuristic resting on a sparsity assumption and a sampling factor c. Autoformer argues point-wise attention is wrong for periodic data and swaps in a period-aware operator (today’s 6pm resembles yesterday’s and last week’s) — faster, richer, and graceful even on weakly-periodic data, at the cost of leaning on real periodicity and a well-chosen moving-average window. PatchTST keeps vanilla attention but feeds it patches: since cost is quadratic in token count and patching shrank that ~8×, it gains a ~64× reduction with none of the others’ approximations — the “simplicity wins” thesis.
Stage 5 — Decomposition (trend & seasonality)¶
A long forecast tangles a slow trend with repeating seasonal cycles. Does the model separate them?
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Built-in decomposition | no | yes — the central idea | no |
| Handling | left to the network | progressive: a SeriesDecomp block (moving-average trend + seasonal remainder) applied inside every layer, even on intermediate predictions | instance norm removes level; encoder learns patterns implicitly |
| Trend pathway | n/a | encoder discards trend (keeps seasonal); decoder accumulates it | n/a |
Trade-off. This is Autoformer’s signature. Classical decomposition is one-shot pre-processing, but a future cannot be decomposed before it is predicted. Autoformer makes decomposition an inner, repeated block, so trend and seasonal components are refined as the forecast is built; the output is literally seasonal + accumulated_trend. The payoff is long-horizon robustness and interpretability; the cost is dependence on a good moving-average window and genuinely separable structure. Informer and PatchTST bet they don’t need it — Informer trusts capacity, PatchTST trusts that patches plus instance norm suffice.
Stage 6 — Encoder/decoder & generation¶
How the architecture is wired, and whether the forecast emerges all at once or step by step.
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Architecture | encoder–decoder | encoder–decoder | encoder-only + linear head |
| Distinctive piece | self-attention distilling: Conv → ELU → stride-2 MaxPool halves length between layers (L→L/2→L/4) | encoder keeps seasonal memory; decoder runs inner + cross Auto-Correlation, accumulating trend | vanilla encoder, then flatten |
| Decoder input | start-token + zero placeholders for each future step, each with its timestamp | seasonal seed (zeros) + trend seed (input mean) | none |
| Generation | one-shot generative — 168 values in one pass | one-shot — seasonal + trend over the horizon | one-shot direct — linear head emits all T |
| Autoregressive | no (avoids error accumulation) | no | no |
Trade-off. All three avoid slow, error-accumulating step-by-step decoding — but by different routes. Informer keeps the heaviest machinery (encoder-decoder, distilling for long inputs, a generative decoder fed timestamped zero-placeholders), powerful but knob-heavy. Autoformer reframes the decoder as maintaining two pathways — a sharpening seasonal estimate and an accumulating trend — then summing them. PatchTST drops the decoder entirely: an encoder plus one linear layer to all T values; maximal simplicity, the only catch being that the head grows with T.
Stage 7 — Output¶
| Informer | Autoformer | PatchTST | |
|---|---|---|---|
| Output | 168 × 321 point forecast | 168 × 321 (seasonal + trend) | 168 × 321 (per-channel, concatenated) |
| Loss | MSE | MSE | MSE (per-channel) |
| Uncertainty intervals | no | no | no |
| Uni↔multivariate switch | resize final FC | re-embed channels | per-channel by construction |
| Self-supervised pretraining | no | no | yes — mask 40% of patches, reconstruct; transfers across datasets |
Trade-off. All three emit point forecasts trained with MSE; none gives native uncertainty intervals (unlike DeepAR). The standout is PatchTST’s masked pretraining: because a patch is a meaningful unit, masking and reconstructing whole patches forces real understanding (a single point could just be interpolated), and the pretrained model transfers across datasets — a step toward time-series foundation models.
Summary: the per-dataset trio¶
| Stage | Informer (2021) | Autoformer (2021) | PatchTST (2023) |
|---|---|---|---|
| 1. Framing | channel-mixing | channel-mixing | channel-independent |
| 2. Normalization | global | decomposition removes trend | global + instance norm |
| 3. Token | time step | time step | patch (~16 steps) |
| 4. Mechanism | ProbSparse, O(L logL) | Auto-Correlation (FFT), O(L logL) | vanilla attention, few tokens |
| 5. Decomposition | none | progressive, every layer | none (instance norm only) |
| 6. Generation | enc-dec + distilling, one-shot | enc-dec, seasonal+trend, one-shot | encoder + linear head, one-shot |
| 7. Output | point | point | point + pretraining |
TimesFM (2024): the foundation-model turn¶
The trio disagree on tokens, attention, and decomposition, but silently agree on one premise: train on the dataset you want to forecast. TimesFM rejects it.
Borrowing the NLP recipe, TimesFM pretrains one ~200M-parameter decoder-only Transformer on ~100 billion timepoints across domains (Google Trends, Wikipedia, traffic, weather, synthetic), then forecasts a never-seen series zero-shot — no per-dataset training. It is GPT for time series. Crucially, it inherits PatchTST’s patching and bolts it onto GPT’s decoder-only pretraining. Mapped onto the seven stages:
Framing. Univariate, like PatchTST. But “one example” is no longer from your dataset — during pretraining it is a window among billions of points; at inference, your series handed over cold. Random-length masking teaches it to forecast from any context length.
Normalization. Per-context rescaling (megawatts vs. web clicks), reversed on output — PatchTST’s instance norm in spirit.
Tokenization. Patches, but each non-overlapping 32-point patch is tokenized by a small residual MLP, not a single linear projection.
Mechanism. Causal self-attention (GPT-style), ~20 layers — each token sees only the past. The sharp contrast with PatchTST’s bidirectional encoder: a decoder that generates the future must be causal.
Decomposition. None explicit — it leans on a vast, diverse corpus (including synthetic trend/seasonal series) to learn structure implicitly.
Generation. The only autoregressive model of the four: emit a patch, append, roll forward. It limits error accumulation via the output-patch trick — each step emits a 128-long patch (4× the 32-long input patch), so a 168-hour week needs only 2 roll-forward steps.
Output. Still a point forecast with MSE — but the headline is provenance: pretrained once, applied zero-shot, the axis none of the others move on.
TimesFM vs. the trio:
| Informer / Autoformer / PatchTST | TimesFM | |
|---|---|---|
| Training | per dataset | pretrain once, zero-shot |
| Channels | mixing / mixing / independent | univariate |
| Token | step / step / patch | patch (residual-MLP) |
| Attention | sparse / Auto-Correlation / full — encoder(-decoder) | causal, decoder-only |
| Decomposition | no / yes / no | no (implicit) |
| Generation | one-shot | autoregressive, long output patches |
| Output | point | point |
| Cost on a new dataset | collect + train + tune | just run it |
PatchTST’s lesson: how you represent the input matters more than attention cleverness. TimesFM accepts it and stacks a second from NLP: how much you pretrain matters more than per-dataset fitting. It is the patching idea, scaled into a foundation model.
Choosing a model¶
PatchTST — the default for standard long-term forecasting: highest reported accuracy (~21% MSE improvement over prior Transformers), genuine long lookbacks, fast training, clean encoder-only design. Also the pick for self-supervised transfer or varying channel counts.
Autoformer — for clearly periodic data (energy, traffic, weather) where interpretability matters; it reports which periods (24h, 168h) drive the forecast and is robust at very long horizons. Weaker on irregular or regime-shifting data.
Informer — for extremely long inputs that strain memory (its distilling mechanism was built for this) and where calendar features are richly meaningful. Also the reference point for the lineage.
TimesFM — for a forecast now, with no training: cold starts, many disparate series, or a strong baseline. Univariate and point-only, and a model trained on your data can still edge it out — but it lands surprisingly close zero-shot.
None of them — for short horizons or tiny data (ARIMA, exponential smoothing compete) or when you need calibrated uncertainty intervals (DeepAR and other probabilistic models).
The arc: cheaper attention (Informer) → decomposed, periodicity-aware attention (Autoformer) → smarter tokens (PatchTST) → no per-dataset training (TimesFM). The meta-lesson compounds: first representation beat architecture, then scale and pretraining beat per-dataset fitting — the two forces that reshaped NLP, now arriving for time series.