Forecasting with Transformers - Luca's Research @ PEG

Transformers in forecasting¶

Four long-horizon forecasters, compared stage by stage through a single pipeline to show where their designs agree, where they diverge, and why.

Scope. Informer, Autoformer, and PatchTST are per-dataset architectures — trained on the series to be forecast — so they compare cleanly stage by stage and are treated together first. TimesFM (2024) is a pretrained, zero-shot foundation model; it is mapped onto the same stages in a separate section as the paradigm shift it represents.

Glossary. Lookback L — past steps fed in (672). Horizon T — steps predicted (168). Channel — one series (one household). Token — one input unit the Transformer processes (the central disagreement). Self-attention — lets each token weigh every other; naively O(L²).

Lineage¶

All three per-dataset models are Transformer-based point forecasters trained with MSE, attacking the same problem: the vanilla Transformer’s O(L²) attention is too costly for long series and its step-by-step decoder too slow and error-prone for long horizons. They diverge in what they rethink.

Year	Model	What it rethinks	Thesis
2021 (AAAI)	Informer	The attention computation — skip unimportant query-key pairs	Most attention is wasted; compute only the selective queries.
2021 (NeurIPS)	Autoformer	The operator — period-aware Auto-Correlation + built-in decomposition	Align and average whole sub-series at matching phases, not points.
2023 (ICLR)	PatchTST	The input representation — patch the series, forecast each channel independently	Redesign the token, not the attention.
2024 (ICML)	TimesFM	The training paradigm — pretrain once, forecast zero-shot	Don’t train per dataset at all.

Informer makes attention cheaper; Autoformer makes it periodicity-aware; PatchTST makes redesigning it unnecessary by changing the tokens. PatchTST is partly a reaction to the first two: by 2022 a plain linear model (DLinear) had outperformed the Informer/Autoformer family, and PatchTST’s aim was to rescue the Transformer by simplifying it. TimesFM changes the question — pretrain one model on billions of timepoints and forecast unseen series zero-shot — so it is compared separately.

The common pipeline¶

Seven stages, applied to the 321-household problem:

Input & framing — what is one training example?
Normalization — how is distribution drift handled?
Tokenization & embedding — how do raw numbers become tokens?
Core mechanism — the attention (or its replacement) and its cost.
Decomposition — is trend/seasonality separated?
Generation — one-shot or step-by-step?
Output — what is emitted?

Stage 1 — Input & framing¶

How the multivariate history is sliced into an example: one multivariate window, or many univariate windows?

	Informer	Autoformer	PatchTST
Multivariate handling	Channel-mixing: all 321 values per hour stacked into one vector	Channel-mixing: one vector per step holds all channels	Channel-independence: split into 321 univariate series
One example	`672 × 321` window + timestamps + start-token history	`672 × 321` window (latter half seeds the decoder)	321 separate `672 × 1` series through one shared-weight model
Affordable lookback	long; each step is a token	long	very long — cheap patch tokens (Stage 3)

Trade-off. Channel-mixing is cheap and can exploit cross-channel correlation (a neighbor’s spike informs yours). PatchTST refuses to mix: one household’s signal is not smeared by 320 others, overfitting drops, and one model handles datasets of any channel count. The cost is that PatchTST ignores cross-channel correlation (flagged as future work) — yet it still won on benchmarks.

Stage 2 — Normalization & stationarization¶

Series drift — winter demand ≠ summer demand. This distribution shift can wreck a model trained on one regime.

	Informer	Autoformer	PatchTST
Global scaling	zero-mean/unit-variance	same	same, plus per-instance norm
Per-window	none	implicit via decomposition (Stage 5)	instance norm: each window re-centered to mean 0 / std 1 before patching; mean & std added back to the forecast
Philosophy	relies on network capacity	removes trend so the seasonal pathway sees a near-stationary signal	neutralizes level/scale per window so the encoder models only shape

Trade-off. Autoformer and PatchTST reach the same insight from opposite directions: Autoformer subtracts a moving-average trend; PatchTST re-centers each window so winter and summer look comparable in scale. Informer does neither explicitly, trusting capacity and data — simpler, but more exposed to shift. Under instance norm, a 2 kW household and a 9 kW household present comparable shapes to the shared model, with real scale reattached at the end.

Stage 3 — Tokenization & embedding (the crux)¶

A Transformer processes a sequence of tokens, each embedded as a vector. The largest conceptual fork is what one token represents.

	Informer	Autoformer	PatchTST
One token =	one time step (one hour)	one time step (one hour)	one patch (e.g. 16 hours)
Embedding	value (Conv1d) + positional + learnable timestamp (hour/day/week/holiday)	value + positional	linear projection of the length-`P` patch + learnable position
Tokens for `L=672`	~672	~672	`⌊(672−16)/8⌋ + 2 ≈ 84` patches
Calendar features	yes, heavily — tells the one-shot decoder “this slot is Monday 9am”	periodicity discovered from data	none; relies on patch position

Trade-off. This is PatchTST’s headline (“A Time Series is Worth 64 Words”, echoing ViT image patches). A single hour, like a single letter, carries little meaning; a 16-hour patch encodes “a morning ramp” or “an evening peak.” Patching does three things at once: (1) preserves local shape inside the token, (2) cuts token count from L to ~L/S, reducing attention cost by S², and (3) frees budget for a longer lookback. Informer and Autoformer, fixed at one-token-per-step, never gain this leverage — which is why a near-trivial change mattered. For the grid, each 672-hour series becomes ~84 patch-tokens versus 672 step-tokens.

Stage 4 — Core sequence mechanism¶

How tokens exchange information, and at what cost — the deepest technical divergence.

	Informer	Autoformer	PatchTST
Mechanism	ProbSparse self-attention	Auto-Correlation (replaces attention)	vanilla full self-attention over patches
Core trick	sample ~`L·lnL` query-key pairs, keep Top-u = c·lnL most selective queries, assign the rest `mean(V)`	FFT autocorrelation `R(τ)`; pick top-`k = c·logL` lags; Roll values to align same-phase chunks and weighted-sum (Time Delay Aggregation)	standard `softmax(QKᵀ/√d)·V`; efficiency came at tokenization
Granularity	point-wise (sparse subset)	series-wise (aligned sub-series)	patch-wise (chunks via ordinary attention)
Interpretability	low (heuristic sampling)	high — top lags match real periods (24h, 168h)	medium (attention over patches)

Complexity, for token count L:

Model	Attention cost	Mechanism
Vanilla	`O(L²)`	every token attends to every token
Informer	`O(L·logL)`	full attention only for `~logL` selective queries
Autoformer	`O(L·logL)`	FFT computes all lags at once (Wiener–Khinchin)
PatchTST	`O((L/S)²)` effective	full attention on `N≈L/S` patch tokens

Trade-off. Three philosophies: Informer keeps point-wise attention but prunes it — cheap and effective, but a heuristic resting on a sparsity assumption and a sampling factor c. Autoformer argues point-wise attention is wrong for periodic data and swaps in a period-aware operator (today’s 6pm resembles yesterday’s and last week’s) — faster, richer, and graceful even on weakly-periodic data, at the cost of leaning on real periodicity and a well-chosen moving-average window. PatchTST keeps vanilla attention but feeds it patches: since cost is quadratic in token count and patching shrank that ~8×, it gains a ~64× reduction with none of the others’ approximations — the “simplicity wins” thesis.

Stage 5 — Decomposition (trend & seasonality)¶

A long forecast tangles a slow trend with repeating seasonal cycles. Does the model separate them?

	Informer	Autoformer	PatchTST
Built-in decomposition	no	yes — the central idea	no
Handling	left to the network	progressive: a `SeriesDecomp` block (moving-average trend + seasonal remainder) applied inside every layer, even on intermediate predictions	instance norm removes level; encoder learns patterns implicitly
Trend pathway	n/a	encoder discards trend (keeps seasonal); decoder accumulates it	n/a

Trade-off. This is Autoformer’s signature. Classical decomposition is one-shot pre-processing, but a future cannot be decomposed before it is predicted. Autoformer makes decomposition an inner, repeated block, so trend and seasonal components are refined as the forecast is built; the output is literally seasonal + accumulated_trend. The payoff is long-horizon robustness and interpretability; the cost is dependence on a good moving-average window and genuinely separable structure. Informer and PatchTST bet they don’t need it — Informer trusts capacity, PatchTST trusts that patches plus instance norm suffice.

Stage 6 — Encoder/decoder & generation¶

How the architecture is wired, and whether the forecast emerges all at once or step by step.

	Informer	Autoformer	PatchTST
Architecture	encoder–decoder	encoder–decoder	encoder-only + linear head
Distinctive piece	self-attention distilling: Conv → ELU → stride-2 MaxPool halves length between layers (`L→L/2→L/4`)	encoder keeps seasonal memory; decoder runs inner + cross Auto-Correlation, accumulating trend	vanilla encoder, then flatten
Decoder input	start-token + zero placeholders for each future step, each with its timestamp	seasonal seed (zeros) + trend seed (input mean)	none
Generation	one-shot generative — 168 values in one pass	one-shot — `seasonal + trend` over the horizon	one-shot direct — linear head emits all `T`
Autoregressive	no (avoids error accumulation)	no	no

Trade-off. All three avoid slow, error-accumulating step-by-step decoding — but by different routes. Informer keeps the heaviest machinery (encoder-decoder, distilling for long inputs, a generative decoder fed timestamped zero-placeholders), powerful but knob-heavy. Autoformer reframes the decoder as maintaining two pathways — a sharpening seasonal estimate and an accumulating trend — then summing them. PatchTST drops the decoder entirely: an encoder plus one linear layer to all T values; maximal simplicity, the only catch being that the head grows with T.

Stage 7 — Output¶

	Informer	Autoformer	PatchTST
Output	`168 × 321` point forecast	`168 × 321` (seasonal + trend)	`168 × 321` (per-channel, concatenated)
Loss	MSE	MSE	MSE (per-channel)
Uncertainty intervals	no	no	no
Uni↔multivariate switch	resize final FC	re-embed channels	per-channel by construction
Self-supervised pretraining	no	no	yes — mask 40% of patches, reconstruct; transfers across datasets

Trade-off. All three emit point forecasts trained with MSE; none gives native uncertainty intervals (unlike DeepAR). The standout is PatchTST’s masked pretraining: because a patch is a meaningful unit, masking and reconstructing whole patches forces real understanding (a single point could just be interpolated), and the pretrained model transfers across datasets — a step toward time-series foundation models.

Summary: the per-dataset trio¶

Stage	Informer (2021)	Autoformer (2021)	PatchTST (2023)
1. Framing	channel-mixing	channel-mixing	channel-independent
2. Normalization	global	decomposition removes trend	global + instance norm
3. Token	time step	time step	patch (~16 steps)
4. Mechanism	ProbSparse, `O(L logL)`	Auto-Correlation (FFT), `O(L logL)`	vanilla attention, few tokens
5. Decomposition	none	progressive, every layer	none (instance norm only)
6. Generation	enc-dec + distilling, one-shot	enc-dec, seasonal+trend, one-shot	encoder + linear head, one-shot
7. Output	point	point	point + pretraining

TimesFM (2024): the foundation-model turn¶

The trio disagree on tokens, attention, and decomposition, but silently agree on one premise: train on the dataset you want to forecast. TimesFM rejects it.

Borrowing the NLP recipe, TimesFM pretrains one ~200M-parameter decoder-only Transformer on ~100 billion timepoints across domains (Google Trends, Wikipedia, traffic, weather, synthetic), then forecasts a never-seen series zero-shot — no per-dataset training. It is GPT for time series. Crucially, it inherits PatchTST’s patching and bolts it onto GPT’s decoder-only pretraining. Mapped onto the seven stages:

Framing. Univariate, like PatchTST. But “one example” is no longer from your dataset — during pretraining it is a window among billions of points; at inference, your series handed over cold. Random-length masking teaches it to forecast from any context length.
Normalization. Per-context rescaling (megawatts vs. web clicks), reversed on output — PatchTST’s instance norm in spirit.
Tokenization. Patches, but each non-overlapping 32-point patch is tokenized by a small residual MLP, not a single linear projection.
Mechanism. Causal self-attention (GPT-style), ~20 layers — each token sees only the past. The sharp contrast with PatchTST’s bidirectional encoder: a decoder that generates the future must be causal.
Decomposition. None explicit — it leans on a vast, diverse corpus (including synthetic trend/seasonal series) to learn structure implicitly.
Generation. The only autoregressive model of the four: emit a patch, append, roll forward. It limits error accumulation via the output-patch trick — each step emits a 128-long patch (4× the 32-long input patch), so a 168-hour week needs only 2 roll-forward steps.
Output. Still a point forecast with MSE — but the headline is provenance: pretrained once, applied zero-shot, the axis none of the others move on.

TimesFM vs. the trio:

	Informer / Autoformer / PatchTST	TimesFM
Training	per dataset	pretrain once, zero-shot
Channels	mixing / mixing / independent	univariate
Token	step / step / patch	patch (residual-MLP)
Attention	sparse / Auto-Correlation / full — encoder(-decoder)	causal, decoder-only
Decomposition	no / yes / no	no (implicit)
Generation	one-shot	autoregressive, long output patches
Output	point	point
Cost on a new dataset	collect + train + tune	just run it

PatchTST’s lesson: how you represent the input matters more than attention cleverness. TimesFM accepts it and stacks a second from NLP: how much you pretrain matters more than per-dataset fitting. It is the patching idea, scaled into a foundation model.

Choosing a model¶

PatchTST — the default for standard long-term forecasting: highest reported accuracy (~21% MSE improvement over prior Transformers), genuine long lookbacks, fast training, clean encoder-only design. Also the pick for self-supervised transfer or varying channel counts.
Autoformer — for clearly periodic data (energy, traffic, weather) where interpretability matters; it reports which periods (24h, 168h) drive the forecast and is robust at very long horizons. Weaker on irregular or regime-shifting data.
Informer — for extremely long inputs that strain memory (its distilling mechanism was built for this) and where calendar features are richly meaningful. Also the reference point for the lineage.
TimesFM — for a forecast now, with no training: cold starts, many disparate series, or a strong baseline. Univariate and point-only, and a model trained on your data can still edge it out — but it lands surprisingly close zero-shot.
None of them — for short horizons or tiny data (ARIMA, exponential smoothing compete) or when you need calibrated uncertainty intervals (DeepAR and other probabilistic models).

The arc: cheaper attention (Informer) → decomposed, periodicity-aware attention (Autoformer) → smarter tokens (PatchTST) → no per-dataset training (TimesFM). The meta-lesson compounds: first representation beat architecture, then scale and pretraining beat per-dataset fitting — the two forces that reshaped NLP, now arriving for time series.