Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Informer

Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

Zhou, Zhang, Peng, Zhang, Li, Xiong & Zhang — AAAI 2021 (arXiv:2012.07436). Best Paper Award winner.

Jargon is defined the first time it appears, with concrete examples (mostly forecasting electricity and weather).

TL;DR (one paragraph)

Suppose you want to predict the next 20 days of an electrical transformer’s oil temperature, one reading per hour — that’s 480 numbers into the future, not just the next one. Classic models (and even the original Transformer) choke on this: they’re either too slow, use too much memory, or accumulate error as they crawl forward one step at a time. Informer is a redesigned Transformer built specifically for long-sequence time-series forecasting (LSTF). It keeps the Transformer’s superpower — directly relating any past moment to any future moment — but adds three tricks: (1) ProbSparse attention, which only does the expensive math for the handful of “important” past moments instead of all of them, cutting cost from quadratic to near-linear; (2) self-attention distilling, which progressively shrinks the sequence between layers so a very long history fits in memory; and (3) a generative decoder that spits out the entire future forecast in one shot instead of one step at a time. The result: longer forecasts, lower error, and dramatically faster inference.


The Problem

What is time-series forecasting?

A time series is a sequence of measurements ordered in time: hourly electricity demand, daily temperature, per-minute stock prices. Forecasting means: given the recent history (the lookback window — e.g. the last 5 days of hourly readings), predict future values over some horizon (e.g. the next 7 days).

What makes “long” sequence forecasting hard?

Most older methods were tuned for short horizons — predicting roughly 48 steps ahead or fewer. The paper’s motivating experiment runs an LSTM (a popular recurrent neural network) to predict an electrical transformer’s hourly temperature. It does fine predicting half a day ahead (12 points), but as the horizon grows past 48 points:

MSE (Mean Squared Error) and MAE (Mean Absolute Error) are the two error scores used throughout: average squared / absolute gap between prediction and truth. Lower is better.

So LSTF (Long Sequence Time-series Forecasting) demands what the authors call high prediction capacity: the ability to (a) capture long-range dependencies — relationships between a future point and a far-away past point — and (b) do it efficiently on long inputs and outputs.

Why not just use a vanilla Transformer?

The Transformer (Vaswani et al., 2017) is great at long-range dependencies because of self-attention (explained below). But applied naively to LSTF it has three fatal flaws:

  1. Quadratic cost. Self-attention compares every position to every other position. For a length-L sequence that’s L × L comparisons → O(L²) time and memory per layer. Double the sequence, quadruple the cost.

  2. Memory blows up when you stack layers. With J stacked layers it becomes O(J · L²), so you can’t feed in a long history.

  3. Slow step-by-step decoding. The standard decoder predicts output #2 only after producing output #1 (called dynamic / autoregressive decoding), so a long forecast is as slow as an RNN — and small early mistakes accumulate into big late ones (error accumulation).

Prior “efficient Transformers” (Sparse Transformer, LogSparse, Longformer, Reformer, Linformer) mostly attacked only flaw #1. Informer tackles all three.


Background a newcomer needs

Attention, in one analogy

Imagine you’re forecasting tomorrow’s electricity load. Tomorrow is a Monday. Intuitively, last Monday and the Monday before matter far more than last Thursday. Attention is the mechanism that lets a model learn to “look back at” and weight the relevant past moments more heavily.

Mechanically, each position produces three vectors:

For a given query, you score it against every key (a dot product), turn the scores into weights with softmax (a function that turns numbers into a probability-like distribution summing to 1), and take the weighted average of the values. Self-attention is just attention where the queries, keys, and values all come from the same sequence — every moment attending to every other moment.

score(query_i, key_j) = dot(query_i, key_j) / sqrt(d)
weights_i             = softmax over all j of score(query_i, key_j)
output_i              = sum_j  weights_i[j] * value_j

This is wonderfully expressive (any position can directly reach any other in a single hop — a “path length” of O(1), versus an RNN that must pass information step by step). But computing score(i, j) for all pairs (i, j) is the O(L²) problem.

Encoder / decoder

Stationarity (a quick aside)

A series is stationary if its statistical behavior (mean, variance, seasonality structure) doesn’t drift over time. Classical models like ARIMA assume near-stationarity and struggle when long-horizon data trends or shifts. Deep models like Informer don’t assume stationarity, which helps on messy real-world series — but they need lots of data and don’t come with the clean theoretical guarantees classical models offer.


Key Idea / Innovation

Informer’s central observation: self-attention is wasteful because it’s sparse in disguise. When the authors inspected a trained vanilla Transformer’s attention scores, they found a long-tail distribution — a few query–key pairs carry almost all the attention weight, and the vast majority contribute almost nothing (essentially a flat, uniform “I’m not really paying attention to anything in particular” pattern).

Attention weight per key (sorted):
█
█
██
███
████████
█████████████████████████████  <- long tail of near-useless pairs

So why spend O(L²) computing all pairs when only a handful matter? That insight drives the three innovations:

  1. ProbSparse self-attention — find and compute only the “important” queries.

  2. Self-attention distilling — shrink the sequence between encoder layers so long inputs fit in memory.

  3. Generative-style decoder — produce the whole forecast in a single forward pass.


How It Works, Step by Step (the execution flow)

Step 0 — Input representation (turning raw readings into vectors)

A raw time series is just numbers, but a Transformer has no built-in sense of order or calendar context. Informer builds each input vector from three added-together parts:

  1. Scalar projection — the actual measured value(s), projected up to the model’s dimension by a 1-D convolution (kernel width 3).

  2. Local positional embedding — fixed sine/cosine patterns encoding position within the window (so the model knows what came before what).

  3. Global timestamp embeddingslearnable vectors for calendar context: minute, hour, day-of-week, month, holiday flags, etc.

input_vector[i] = α · value_proj[i]  +  PositionEmbed[i]  +  Σ TimestampEmbed[i]
                   (the reading)        (order in window)     (Mon? July? holiday?)

The timestamp embeddings are what let the model know “this future point is a Monday 9am in July” even far into the horizon — crucial for the one-shot decoder later.

Step 1 — ProbSparse self-attention (the efficiency heart)

Goal: identify the few queries that actually attend selectively, and only compute full attention for those.

Algorithm (ProbSparse), in plain terms:
  1. Randomly sample U = L·ln(L) query-key dot products.
  2. Score each query by max(sample) − mean(sample).
  3. Keep the Top-u = c·ln(L) highest-scoring queries  → Q̄.
  4. Compute full softmax attention ONLY for Q̄.
  5. Fill every other query's output with mean(V).
  6. Reassemble into the original positions.

Result: time and memory drop from O(L²) to O(L · log L) — for a sequence of length 1000, that’s roughly a 100× reduction in the dominant cost.

Step 2 — Encoder with self-attention distilling

The encoder ingests a very long history. After each attention block, a distilling operation halves the sequence length, keeping only the most dominant features:

X_{j+1} = MaxPool( ELU( Conv1d( AttentionBlock(X_j) ) ) )    # stride-2 max-pool halves length

So the sequence pyramids down: L → L/2 → L/4 → .... This brings total encoder memory to about O((2−ε)·L·log L) and lets Informer accept much longer inputs.

For robustness, Informer runs parallel stacks on progressively halved copies of the input (one stack sees the full length L, another sees L/4, etc.), then concatenates their outputs into the final encoder representation. (The paper finds combining the L and L/4 stacks is the most robust choice.)

                     Encoder
Full input  L  ──► [attn]──distill──►[attn]──distill──►[attn] ─┐
Half input  L/4 ─► [attn]──distill──►[attn] ──────────────────┤──► concat ──► encoder feature map

Step 3 — Generative-style decoder (one-shot forecasting)

This is what makes long forecasts fast and accurate. The decoder input is a concatenation of two pieces:

X_decoder = Concat( start_token ,  placeholders )
            └ a real slice of      └ zeros for every
              recent history          future step we want,
              (gives context)         but WITH their real timestamps

Concrete example from the paper: to predict 168 points (7 days) of temperature, you feed the decoder the known 5 days right before the target week as the start token, plus zero placeholders for all 168 future steps — each placeholder carrying its true timestamp (so the model knows “this slot is Tuesday 3pm”).

The payoff: unlike a standard Transformer/RNN decoder that emits one value, feeds it back, emits the next, and so on (slow, and errors snowball), Informer produces all 168 future values in a single forward pass. The ablation study shows this also makes the forecast robust to offsets — you can ask it to predict a window starting at an arbitrary future point and it holds up, because each output depends on its own timestamp rather than on a chain of prior predictions.

Step 4 — Training

Train end-to-end with plain MSE loss between predictions and the true future values, backpropagated through the whole model. Data is normalized to zero mean / unit standard deviation. (Setup details: Adam optimizer, learning rate 1e-4 with decay, ~8 epochs with early stopping, model dimension 512, 16 attention heads in the encoder.)


A Worked Example

Task: Forecast 7 days (168 hourly points) of an electrical transformer’s oil temperature.

  1. Gather input. Take a long recent history of hourly readings (say the last several hundred hours), each with its value(s) + timestamp.

  2. Embed. Each hour becomes a vector = value projection + position-in-window + (hour, weekday, month, holiday) embeddings.

  3. Encode. Feed the history through the encoder. ProbSparse attention spends effort only on the genuinely informative hours (e.g. the same hour on previous days, recent spikes); distilling halves the sequence layer by layer so even a long history fits in memory. Two parallel stacks (full + quarter length) are concatenated into one feature map.

  4. Set up the decoder input. Concatenate the known last 5 days (start token) with 168 zero placeholders, each stamped with its real future timestamp (Mon 0:00, Mon 1:00, ... through Sun 23:00).

  5. Decode in one shot. Masked ProbSparse self-attention + cross-attention to the encoder feature map → the final FC layer outputs all 168 temperature predictions simultaneously.

  6. Done. No step-by-step loop, no error snowball. Want the 7 days starting a bit later? Just change the placeholder timestamps.


Input and Output


Strengths

Limitations


Glossary