Informer
Beyond Efficient Transformer for Long Sequence Time-Series Forecasting
Zhou, Zhang, Peng, Zhang, Li, Xiong & Zhang — AAAI 2021 (arXiv:2012.07436). Best Paper Award winner.
Jargon is defined the first time it appears, with concrete examples (mostly forecasting electricity and weather).
TL;DR (one paragraph)¶
Suppose you want to predict the next 20 days of an electrical transformer’s oil temperature, one reading per hour — that’s 480 numbers into the future, not just the next one. Classic models (and even the original Transformer) choke on this: they’re either too slow, use too much memory, or accumulate error as they crawl forward one step at a time. Informer is a redesigned Transformer built specifically for long-sequence time-series forecasting (LSTF). It keeps the Transformer’s superpower — directly relating any past moment to any future moment — but adds three tricks: (1) ProbSparse attention, which only does the expensive math for the handful of “important” past moments instead of all of them, cutting cost from quadratic to near-linear; (2) self-attention distilling, which progressively shrinks the sequence between layers so a very long history fits in memory; and (3) a generative decoder that spits out the entire future forecast in one shot instead of one step at a time. The result: longer forecasts, lower error, and dramatically faster inference.
The Problem¶
What is time-series forecasting?¶
A time series is a sequence of measurements ordered in time: hourly electricity demand, daily temperature, per-minute stock prices. Forecasting means: given the recent history (the lookback window — e.g. the last 5 days of hourly readings), predict future values over some horizon (e.g. the next 7 days).
What makes “long” sequence forecasting hard?¶
Most older methods were tuned for short horizons — predicting roughly 48 steps ahead or fewer. The paper’s motivating experiment runs an LSTM (a popular recurrent neural network) to predict an electrical transformer’s hourly temperature. It does fine predicting half a day ahead (12 points), but as the horizon grows past 48 points:
Prediction error (MSE) explodes — the model loses the plot.
Inference speed collapses — because it generates outputs one step at a time.
MSE (Mean Squared Error) and MAE (Mean Absolute Error) are the two error scores used throughout: average squared / absolute gap between prediction and truth. Lower is better.
So LSTF (Long Sequence Time-series Forecasting) demands what the authors call high prediction capacity: the ability to (a) capture long-range dependencies — relationships between a future point and a far-away past point — and (b) do it efficiently on long inputs and outputs.
Why not just use a vanilla Transformer?¶
The Transformer (Vaswani et al., 2017) is great at long-range dependencies because of self-attention (explained below). But applied naively to LSTF it has three fatal flaws:
Quadratic cost. Self-attention compares every position to every other position. For a length-
Lsequence that’sL × Lcomparisons → O(L²) time and memory per layer. Double the sequence, quadruple the cost.Memory blows up when you stack layers. With
Jstacked layers it becomes O(J · L²), so you can’t feed in a long history.Slow step-by-step decoding. The standard decoder predicts output #2 only after producing output #1 (called dynamic / autoregressive decoding), so a long forecast is as slow as an RNN — and small early mistakes accumulate into big late ones (error accumulation).
Prior “efficient Transformers” (Sparse Transformer, LogSparse, Longformer, Reformer, Linformer) mostly attacked only flaw #1. Informer tackles all three.
Background a newcomer needs¶
Attention, in one analogy¶
Imagine you’re forecasting tomorrow’s electricity load. Tomorrow is a Monday. Intuitively, last Monday and the Monday before matter far more than last Thursday. Attention is the mechanism that lets a model learn to “look back at” and weight the relevant past moments more heavily.
Mechanically, each position produces three vectors:
Query (Q) — “what am I looking for?”
Key (K) — “what do I offer / what am I about?”
Value (V) — “the actual content I’ll contribute.”
For a given query, you score it against every key (a dot product), turn the scores into weights with softmax (a function that turns numbers into a probability-like distribution summing to 1), and take the weighted average of the values. Self-attention is just attention where the queries, keys, and values all come from the same sequence — every moment attending to every other moment.
score(query_i, key_j) = dot(query_i, key_j) / sqrt(d)
weights_i = softmax over all j of score(query_i, key_j)
output_i = sum_j weights_i[j] * value_jThis is wonderfully expressive (any position can directly reach any other in a single hop — a “path length” of O(1), versus an RNN that must pass information step by step). But computing score(i, j) for all pairs (i, j) is the O(L²) problem.
Encoder / decoder¶
The encoder reads the input history and turns it into a rich internal representation (a set of vectors capturing patterns and dependencies).
The decoder uses that representation to produce the forecast.
Stationarity (a quick aside)¶
A series is stationary if its statistical behavior (mean, variance, seasonality structure) doesn’t drift over time. Classical models like ARIMA assume near-stationarity and struggle when long-horizon data trends or shifts. Deep models like Informer don’t assume stationarity, which helps on messy real-world series — but they need lots of data and don’t come with the clean theoretical guarantees classical models offer.
Key Idea / Innovation¶
Informer’s central observation: self-attention is wasteful because it’s sparse in disguise. When the authors inspected a trained vanilla Transformer’s attention scores, they found a long-tail distribution — a few query–key pairs carry almost all the attention weight, and the vast majority contribute almost nothing (essentially a flat, uniform “I’m not really paying attention to anything in particular” pattern).
Attention weight per key (sorted):
█
█
██
███
████████
█████████████████████████████ <- long tail of near-useless pairsSo why spend O(L²) computing all pairs when only a handful matter? That insight drives the three innovations:
ProbSparse self-attention — find and compute only the “important” queries.
Self-attention distilling — shrink the sequence between encoder layers so long inputs fit in memory.
Generative-style decoder — produce the whole forecast in a single forward pass.
How It Works, Step by Step (the execution flow)¶
Step 0 — Input representation (turning raw readings into vectors)¶
A raw time series is just numbers, but a Transformer has no built-in sense of order or calendar context. Informer builds each input vector from three added-together parts:
Scalar projection — the actual measured value(s), projected up to the model’s dimension by a 1-D convolution (kernel width 3).
Local positional embedding — fixed sine/cosine patterns encoding position within the window (so the model knows what came before what).
Global timestamp embeddings — learnable vectors for calendar context: minute, hour, day-of-week, month, holiday flags, etc.
input_vector[i] = α · value_proj[i] + PositionEmbed[i] + Σ TimestampEmbed[i]
(the reading) (order in window) (Mon? July? holiday?)The timestamp embeddings are what let the model know “this future point is a Monday 9am in July” even far into the horizon — crucial for the one-shot decoder later.
Step 1 — ProbSparse self-attention (the efficiency heart)¶
Goal: identify the few queries that actually attend selectively, and only compute full attention for those.
A query is “important” if its attention distribution is far from uniform — i.e. it strongly prefers some keys over others. A query that attends roughly equally to everything is “lazy” and can be replaced by a cheap average of the values.
The authors measure this “non-uniformity” with a sparsity measurement
M(q, K)derived from KL-divergence (a measure of how different two distributions are). In practice they use a cheap, numerically stable approximation:M(q_i, K) ≈ max_j ( q_i · k_j / √d ) − mean_j ( q_i · k_j / √d ) (its single strongest match) (its average match)A big gap = a peaky, selective query = important.
The clever shortcut: computing
Mfor every query would itself be O(L²). Instead, Informer randomly samples only ~L·ln Lquery–key pairs, computesMon that sample, and picks the Top-u queries (withu = c · ln L, wherecis a tunable sampling factor, set to 5 in practice).Only those Top-u “active” queries get full attention computed; all other (“lazy”) query positions are filled with the plain average of the values.
Algorithm (ProbSparse), in plain terms:
1. Randomly sample U = L·ln(L) query-key dot products.
2. Score each query by max(sample) − mean(sample).
3. Keep the Top-u = c·ln(L) highest-scoring queries → Q̄.
4. Compute full softmax attention ONLY for Q̄.
5. Fill every other query's output with mean(V).
6. Reassemble into the original positions.Result: time and memory drop from O(L²) to O(L · log L) — for a sequence of length 1000, that’s roughly a 100× reduction in the dominant cost.
Step 2 — Encoder with self-attention distilling¶
The encoder ingests a very long history. After each attention block, a distilling operation halves the sequence length, keeping only the most dominant features:
X_{j+1} = MaxPool( ELU( Conv1d( AttentionBlock(X_j) ) ) ) # stride-2 max-pool halves lengthConv1d (1-D convolution, kernel 3) mixes neighboring time steps.
ELU is a nonlinear activation function.
Max-pooling with stride 2 keeps the strongest signal in each pair, halving the length at every layer.
So the sequence pyramids down: L → L/2 → L/4 → .... This brings total encoder memory to about O((2−ε)·L·log L) and lets Informer accept much longer inputs.
For robustness, Informer runs parallel stacks on progressively halved copies of the input (one stack sees the full length L, another sees L/4, etc.), then concatenates their outputs into the final encoder representation. (The paper finds combining the L and L/4 stacks is the most robust choice.)
Encoder
Full input L ──► [attn]──distill──►[attn]──distill──►[attn] ─┐
Half input L/4 ─► [attn]──distill──►[attn] ──────────────────┤──► concat ──► encoder feature mapStep 3 — Generative-style decoder (one-shot forecasting)¶
This is what makes long forecasts fast and accurate. The decoder input is a concatenation of two pieces:
X_decoder = Concat( start_token , placeholders )
└ a real slice of └ zeros for every
recent history future step we want,
(gives context) but WITH their real timestampsConcrete example from the paper: to predict 168 points (7 days) of temperature, you feed the decoder the known 5 days right before the target week as the start token, plus zero placeholders for all 168 future steps — each placeholder carrying its true timestamp (so the model knows “this slot is Tuesday 3pm”).
A masked ProbSparse self-attention prevents any position from peeking at later positions (the mask sets forbidden dot products to −∞), so the decoder can’t cheat by looking at the future.
A final fully-connected layer maps to the output. Its width is
1for univariate forecasting (one target variable) or>1for multivariate (several variables at once) — switching between them is just changing this last layer.
The payoff: unlike a standard Transformer/RNN decoder that emits one value, feeds it back, emits the next, and so on (slow, and errors snowball), Informer produces all 168 future values in a single forward pass. The ablation study shows this also makes the forecast robust to offsets — you can ask it to predict a window starting at an arbitrary future point and it holds up, because each output depends on its own timestamp rather than on a chain of prior predictions.
Step 4 — Training¶
Train end-to-end with plain MSE loss between predictions and the true future values, backpropagated through the whole model. Data is normalized to zero mean / unit standard deviation. (Setup details: Adam optimizer, learning rate 1e-4 with decay, ~8 epochs with early stopping, model dimension 512, 16 attention heads in the encoder.)
A Worked Example¶
Task: Forecast 7 days (168 hourly points) of an electrical transformer’s oil temperature.
Gather input. Take a long recent history of hourly readings (say the last several hundred hours), each with its value(s) + timestamp.
Embed. Each hour becomes a vector = value projection + position-in-window + (hour, weekday, month, holiday) embeddings.
Encode. Feed the history through the encoder. ProbSparse attention spends effort only on the genuinely informative hours (e.g. the same hour on previous days, recent spikes); distilling halves the sequence layer by layer so even a long history fits in memory. Two parallel stacks (full + quarter length) are concatenated into one feature map.
Set up the decoder input. Concatenate the known last 5 days (start token) with 168 zero placeholders, each stamped with its real future timestamp (Mon 0:00, Mon 1:00, ... through Sun 23:00).
Decode in one shot. Masked ProbSparse self-attention + cross-attention to the encoder feature map → the final FC layer outputs all 168 temperature predictions simultaneously.
Done. No step-by-step loop, no error snowball. Want the 7 days starting a bit later? Just change the placeholder timestamps.
Input and Output¶
Input: A (possibly long) lookback window of a time series —
Lxtime steps, each adx-dimensional reading (univariatedx=1or multivariatedx>1), plus timestamp features (hour/day/week/month/holiday). Plus a short start-token slice of recent history for the decoder.Output: A long horizon
Lyof future values (the paper goes up to 480–960 steps — 20–40 days), univariate or multivariate, produced in a single forward pass.
Strengths¶
Scales to long horizons. On four datasets (two ETT electricity-transformer sets, ECL electricity load, and a Weather set), Informer’s error rises slowly and smoothly as the horizon grows, while baselines (LSTM, DeepAR, ARIMA, Prophet, Reformer, LogTrans, LSTnet) degrade fast. Reported MSE drops vs. an LSTM of ~27% at horizon 168, ~52% at 336, ~60% at 720.
Efficient. Per-layer attention is O(L·log L) time and memory (vs. O(L²) for vanilla Transformer); the generative decoder makes inference take 1 step instead of
Lsteps.Fits long inputs in memory thanks to distilling — competitors hit out-of-memory on the longest inputs where Informer keeps running.
Robust one-shot decoding. No error accumulation; forecasts hold up even at arbitrary prediction offsets (shown in ablations).
Easy univariate ↔ multivariate switch — just resize the final fully-connected layer.
Ablations confirm each piece pulls its weight: ProbSparse beats canonical attention in win-counts (32 vs 12 univariate), distilling enables longer inputs, and the generative decoder beats dynamic decoding.
Limitations¶
Approximate, not exact. ProbSparse relies on the empirical long-tail/sparsity assumption and a random sampling step to find important queries — usually fine, but it’s a heuristic, and the sampling factor
cis a hyperparameter to tune.Multivariate gains are smaller than univariate ones; the authors note an “anisotropy” across feature dimensions they leave to future work (some variables are just harder to predict than others, and treating them uniformly leaves performance on the table).
Needs lots of data and a GPU. Experiments used a 32GB V100. Like all deep models, it lacks the interpretability and theoretical guarantees of classical methods (ARIMA, state-space models).
Still a fixed-window model. It forecasts a chosen horizon from a chosen lookback; it isn’t an online/streaming or probabilistic-distribution forecaster out of the box (it predicts point values via MSE, not full uncertainty intervals like DeepAR).
Several moving parts. Distilling stacks, sampling factor, start-token length, encoder/decoder depths — more knobs than a simpler model.
Glossary¶
Time series: measurements ordered in time (e.g. hourly electricity load).
Lookback window: the stretch of recent history fed in as input.
Horizon: how many future steps you predict.
LSTF: Long Sequence Time-series Forecasting — predicting long horizons (this paper: up to hundreds of steps).
Univariate / multivariate: predicting one target variable vs. several at once.
Attention: mechanism that lets each position weight and pull information from other positions, via Query/Key/Value vectors.
Self-attention: attention where queries, keys, values all come from the same sequence (every moment attends to every moment).
Self-attention complexity: the cost of computing all pairwise comparisons — O(L²) for vanilla self-attention, the bottleneck Informer reduces to O(L·log L).
Query / Key / Value (Q/K/V): “what I want” / “what I offer” / “the content I contribute.”
Softmax: turns raw scores into weights that sum to 1.
Encoder / decoder: the part that reads/compresses the input vs. the part that generates the forecast.
ProbSparse attention: Informer’s attention that computes full scores only for the Top-
umost “selective” queries, found via a sampled sparsity measure.Sparsity / long-tail distribution: the finding that only a few attention pairs matter; most are near-uniform and negligible.
Self-attention distilling: conv + max-pool step that halves the sequence length between encoder layers to save memory.
Generative (non-autoregressive) decoder: produces the entire forecast in one forward pass, rather than one step at a time.
Dynamic / autoregressive decoding: the slow step-by-step alternative where each output is fed back to produce the next (prone to error accumulation).
Start token: a slice of real recent history prepended to the decoder input to give it context.
Error accumulation: the snowballing of small early mistakes during step-by-step decoding.
Stationarity: whether a series’ statistical properties stay constant over time.
MSE / MAE: Mean Squared Error / Mean Absolute Error — lower is better.
Sampling factor (c): hyperparameter controlling how many queries ProbSparse keeps (
u = c·ln L); set to 5 in the paper.