PatchTST
"A Time Series is Worth 64 Words"
Nie, Nguyen, Sinthong & Kalagnanam — ICLR 2023 (arXiv:2211.14730)
TL;DR (one paragraph)¶
PatchTST is a Transformer for long-term time series forecasting — predicting, say, the next 720 hours of electricity demand from the past 512 hours. Its two big ideas are simple. First, patching: instead of feeding the network one number per time step, it chops the series into short overlapping windows (“patches”) of, e.g., 16 time steps and treats each patch as a single “word.” This is exactly what Vision Transformers do to images (16x16 pixel patches), hence the title’s nod to “an image is worth 16x16 words.” Second, channel-independence: when you have many sensors/series at once (electricity for 321 households, traffic on 862 roads), PatchTST forecasts each one separately through the same shared model rather than blending them all into one fat input vector. These two changes make the model far cheaper to run, let it look much further back in history, and — surprisingly — make it more accurate than the elaborate Transformer variants that came before it. It even beat the strong “just use a linear model” baseline that had recently embarrassed Transformers in this field.
The Problem¶
Forecasting means: given the recent history of a quantity, predict its future values.
The history you feed in is the lookback window (also called the look-back window or receptive field) — e.g., the last 336 hourly readings.
The stretch you predict is the prediction horizon (or just horizon) — e.g., the next 96, 192, 336, or 720 hours. “Long-term” forecasting means a long horizon, which is hard because errors compound and the far future depends on subtle long-range patterns.
A multivariate time series has several variables measured over time at once. Each variable is a channel (borrowing the word from image RGB channels). Example: a weather dataset with 21 channels (temperature, humidity, pressure, wind, ...), each a separate sequence over 52,696 time steps.
Before this paper, the trend was to build ever-more-complicated Transformer models (Informer, Autoformer, FEDformer) for this task. Then a 2022 paper (DLinear) showed a plain linear model could beat all of them — casting doubt on whether Transformers are even useful here. PatchTST’s mission: rescue the Transformer by fixing how it consumes a time series, rather than by adding more machinery.
Background a newcomer needs¶
Neural network. A function with millions of tunable numbers (“weights”) that you fit to data by gradient descent so its predictions match known answers.
Transformer. A neural architecture built around attention. It processes a sequence of tokens (in language, tokens are words/sub-words).
Token / embedding. A token is one unit of input. Each token is turned into an embedding — a vector of numbers (length D, e.g., 128) that the network manipulates. “Embedding” just means “represent this thing as a point in a high-dimensional space.”
Attention / self-attention. Attention lets every token look at every other token and decide how much to “pay attention” to each. Concretely, each token produces a query, a key, and a value vector. Token A’s relevance to token B is the dot-product of A’s query with B’s key; these scores are normalized (softmax) into weights, and each token’s output is the weighted sum of all tokens’ values. Self-attention is when the tokens attend to each other within the same sequence. Intuition: in “the river bank,” attention lets “bank” look at “river” to disambiguate its meaning. For a time series, attention lets a Monday-morning pattern look at every prior morning to find the relevant ones.
Encoder vs. decoder. An encoder reads the input and turns it into rich internal representations. A decoder generates an output sequence step by step. Older forecasting Transformers used both. PatchTST uses only a vanilla encoder plus a simple linear output layer — much simpler.
Self-attention complexity. The catch with attention: with N tokens, every token attends to every other, so cost grows as N² (quadratic) in both time and memory. Double the sequence length and you roughly quadruple the cost. This O(N²) blow-up is the bottleneck for long sequences and the thing most prior work tried to patch over with clever sparse approximations.
Stationarity / distribution shift. A series is stationary if its statistics (mean, variance) stay roughly constant over time. Real series usually aren’t: the average electricity load in winter differs from summer. Distribution shift means the training-period statistics differ from the test-period statistics, which can wreck a model. PatchTST addresses this with instance normalization (below).
Self-supervised / masked pretraining. Instead of training only on labeled “predict the future” examples, you can hide (mask) parts of the input and train the model to fill them back in — no labels needed. This is how BERT (language) and masked autoencoders (vision) learn powerful general representations. PatchTST imports this trick.
Key Idea / Innovation¶
Two ingredients, both deliberately simple:
1. Patching — group time steps into subseries-level tokens¶
A single time step (one number) has no meaning on its own, the way a single letter doesn’t. A patch of consecutive steps does — it can encode “a rising morning ramp” or “a weekend dip.” So PatchTST slices each series into patches and treats each patch as one token.
Raw series (one channel), L = 16 steps:
[ 3 4 6 9 8 7 5 4 3 5 8 9 7 6 4 3 ]
Patch length P = 4, stride S = 2 (patches overlap by 2):
patch1: [3 4 6 9]
patch2: [6 9 8 7]
patch3: [8 7 5 4] ... and so on
Each patch -> one input "word" for the Transformer.Three free benefits fall out of this:
Local semantic meaning is preserved inside each token (a patch captures a little shape, not a lone point).
Quadratic cost reduction. The number of tokens drops from
N ≈ Ldown toN ≈ L/S. Because attention isO(N²), dividing token count by the strideScuts attention cost byS². In the paper, withL=336,P=16,S=8, training ran up to 22x faster on the Traffic dataset.The model can look back much further. Because tokens are cheaper, you can afford a longer lookback window for the same compute — and longer history genuinely helps (see “longer is better” below).
2. Channel-independence — one shared model, applied per series¶
With M channels you have two choices:
Channel-mixing (what Informer/FEDformer do): at each time step, stack all
Mchannel values into one vector and embed that. Information across channels gets blended early.Channel-independence (PatchTST): split the multivariate series into
Mseparate univariate series. Feed each one through the same Transformer, independently, with shared weights. The weights are learned jointly across all channels, but at inference each channel is forecast on its own.
Channel-mixing: Channel-independence (PatchTST):
[ch1 ch2 ... chM] ─┐ ch1 ──► [shared Transformer] ──► forecast1
├─► one token ch2 ──► [shared Transformer] ──► forecast2
mixes all channels ┘ ... ...
chM ──► [shared Transformer] ──► forecastM
(same weights every time)Why this helps: it’s the same idea that already worked for CNNs and linear models in time series. It keeps each channel’s own dynamics clean (one road’s traffic isn’t smeared by 861 others), reduces overfitting, and lets the same pretrained model handle datasets with different numbers of channels. The price: it ignores correlations between channels — which the authors flag as future work.
How It Works, Step by Step (the execution flow)¶
Input: a multivariate window (x₁, ..., x_L), where each x_t is an M-dimensional vector. Goal: predict (x_{L+1}, ..., x_{L+T}).
┌─────────── done once PER CHANNEL, weights shared ───────────┐
multivariate window
(L steps, M channels)
│
▼
① split into M univariate series x⁽ⁱ⁾ ∈ R^(1×L), i = 1..M
│
▼
② instance-normalize each series (subtract its mean, divide by its std)
│
▼
③ patch it: P=16, S=8 → N ≈ L/S patches, each of length P
│
▼
④ linear projection W_p : each patch (length P) → embedding (length D)
+ add learnable position encoding (so the model knows patch order)
│
▼
⑤ vanilla Transformer encoder (n× of: multi-head self-attention →
Add&Norm with BatchNorm → feed-forward → Add&Norm, residual connections)
│
▼
⑥ flatten the encoder output → linear head → T predicted values x̂⁽ⁱ⁾ ∈ R^(1×T)
│
▼
⑦ de-normalize: add back the mean & std removed in ②
└──────────────────────────────────────────────────────────────────┘
│
▼
⑧ concatenate the M per-channel forecasts → full multivariate forecastDetail on each step:
Split into channels. The
M-channel window becomesMseparate length-Lunivariate series. Each travels through the network independently.Instance normalization. Each series instance is rescaled to zero mean and unit standard deviation before patching, and the mean/std are added back to the final prediction. This neutralizes distribution shift (a winter window and a summer window both get centered), so the Transformer only has to model shape, not absolute level.
Patching. Length-
Lseries → a sequence ofN = ⌊(L−P)/S⌋ + 2patches (the last value is paddedStimes so the windowing comes out even). Patches may overlap (supervised) or not (self-supervised).Projection + position embedding. A single trainable linear map
W_pturns each length-Ppatch into a length-Dembedding. A learnable position encoding is added so the encoder knows patch #1 comes before patch #2 (attention itself is order-blind).Transformer encoder. Standard multi-head self-attention: each head builds query/key/value matrices, computes
softmax(QKᵀ/√d_k)·V, and the heads’ outputs feed a feed-forward block. Residual connections and BatchNorm (the authors found BatchNorm beats LayerNorm for time series) stabilize training. Output: a representationz⁽ⁱ⁾per channel.Flatten + linear head. The encoder output is flattened and a linear layer maps it directly to all
Tfuture values at once (a direct multi-step forecast — no slow step-by-step decoding).De-normalize. Reverse step 2.
Concatenate the
Munivariate forecasts into the multivariate result.
Loss. Mean Squared Error (MSE) between predicted and true future values, averaged over all channels:
L = (1/M) Σᵢ ‖x̂⁽ⁱ⁾_{L+1:L+T} − x⁽ⁱ⁾_{L+1:L+T}‖².
Optional: self-supervised pretraining¶
Same encoder, but swap the forecasting head for a small reconstruction head (a D×P linear layer). Use non-overlapping patches, randomly mask 40% of them (set to zero), and train the model to reconstruct the masked patches from the visible ones via MSE. Masking whole patches (not single points) forces real understanding — a single missing point could be guessed by interpolation, but a missing patch can’t. Afterward you can:
linear-probe (freeze the encoder, train only a new forecasting head), or
fine-tune (briefly linear-probe, then unfreeze everything).
On large datasets this pretraining beats training from scratch, and a model pretrained on one dataset (e.g., Electricity) transfers to others with strong accuracy — a step toward time-series “foundation models.”
A Worked Example¶
Suppose you run a power grid and want to predict the next 96 hours of demand for 321 households from the past 512 hours.
Input: a window of shape
512 × 321(512 hourly steps, 321 channels).Channel-independence: treat it as 321 separate 512-long series, each going through the same model.
Instance norm: household #7 averages 2 kW, household #200 averages 9 kW. Each is centered to mean 0 / std 1, so the model sees comparable shapes and re-attaches the real scale at the end.
Patching with
P=16,S=8: each 512-long series becomes about64patches. (This is the “64 words” of the title — a time series of 512 readings is encoded as 64 patch-tokens, the PatchTST/64 variant.) Each 16-hour patch captures a meaningful chunk like “the evening cooking-time peak.”Encoder: self-attention lets, say, this Tuesday-evening patch attend to every previous evening patch in the 512-hour window to learn the recurring daily peak and the weekly weekday/weekend pattern — across 21 days of history at once.
Head: one linear layer emits all 96 future hourly values for that household in a single shot.
De-normalize + concatenate → a
96 × 321forecast.
Because patching shrank 512 steps to 64 tokens, the O(N²) attention is ~64x cheaper than feeding 512 raw points — which is exactly what lets you afford a 512-hour lookback in the first place.
“Longer is better,” empirically. On the Traffic case study, extending the lookback from L=96 to L=336 cut MSE from 0.518 to 0.397; adding patching pushed it to 0.367; self-supervised pretraining reached 0.349. Crucially, the older Transformers did not improve with longer lookbacks (a sign they couldn’t actually exploit long history) — PatchTST consistently did.
Strengths¶
Accuracy. Beats the best prior Transformer-based models by ~21% MSE / ~17% MAE on average across 8 benchmarks, and generally edges out the strong DLinear baseline, especially on large datasets (Weather, Traffic, Electricity) and ILI.
Efficiency. Patching cuts attention cost quadratically — up to 22x faster training on large data — and lets the model fit longer lookbacks in the same GPU memory. (Channel-mixing baselines literally ran out of 48 GB GPU memory in some ablations.)
Actually uses long history. Unlike earlier Transformers, MSE keeps dropping as the lookback window grows.
Strong representation learning. Masked self-supervised pretraining outperforms supervised-from-scratch on big datasets and transfers across datasets — and beats contrastive-learning methods (TS2Vec, BTSF, TNC, TS-TCC) by 34–49% in their comparison.
Simple & modular. A vanilla encoder, a linear head, and two tweaks. Patching is a drop-in operator other models can adopt; channel-independence handles variable channel counts.
Limitations¶
No cross-channel modeling. Channel-independence deliberately ignores correlations between series (e.g., neighboring roads, or temperature driving electricity load). Capturing these properly is left as future work; for datasets where channels strongly interact, this could leave accuracy on the table.
Fixed-length, regularly-sampled input assumed. Patching expects evenly spaced steps; irregular timestamps or missing data need preprocessing.
Patch hyperparameters matter. Patch length
Pand strideS(and the lookbackL) must be chosen/tuned; the “right” patch size depends on the data’s natural cycle.Benchmark scope. Results are on standard long-term forecasting benchmarks with MSE/MAE; the paper doesn’t focus on probabilistic/uncertainty forecasts or very high-frequency/streaming settings.
Direct multi-step head. Predicting the whole horizon with one linear layer is fast but ties the output layer size to
T; very long horizons grow that head.
Glossary¶
Attention / self-attention — mechanism letting each token weight and aggregate information from all other tokens; self-attention does this within one sequence. Core of the Transformer.
Channel — one variable/series in a multivariate dataset (one road, one household, one sensor).
Channel-independence — forecasting each channel separately through a single shared model (vs. channel-mixing, which blends channels into one input vector).
Direct multi-step forecast — predicting all
Tfuture values at once with one layer, instead of one-step-at-a-time decoding.Embedding — a learned vector representation of an input unit (here, a patch).
Encoder / decoder — encoder turns input into internal representations; decoder generates output sequences. PatchTST uses only an encoder + linear head.
Horizon (prediction horizon) — how many future steps you predict (e.g., 96, 720).
Instance normalization — per-series rescaling to zero mean / unit std before processing, undone afterward, to counter distribution shift.
Lookback window (
L) — how many past steps you feed in; also called receptive field.Masked self-supervised pretraining — hiding parts of the input and learning to reconstruct them, no labels required.
MSE / MAE — Mean Squared / Mean Absolute Error; lower is better.
Multivariate time series — multiple variables measured over time together.
Patch — a short window of consecutive time steps treated as a single token; here
P=16with strideS=8.Position encoding — added signal telling the order-blind attention which token came when.
Self-attention complexity — attention costs
O(N²)in time and memory forNtokens; patching shrinksNto ~L/S, cutting cost byS².Stationarity / distribution shift — whether series statistics stay constant; shift = train vs. test statistics differ, which instance norm mitigates.
Stride (
S) — step between consecutive patches; smaller stride = more overlap = more (but costlier) tokens.Transformer — attention-based neural architecture; PatchTST adapts it to time series by tokenizing patches.