Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

PatchTST

"A Time Series is Worth 64 Words"

Nie, Nguyen, Sinthong & Kalagnanam — ICLR 2023 (arXiv:2211.14730)

TL;DR (one paragraph)

PatchTST is a Transformer for long-term time series forecasting — predicting, say, the next 720 hours of electricity demand from the past 512 hours. Its two big ideas are simple. First, patching: instead of feeding the network one number per time step, it chops the series into short overlapping windows (“patches”) of, e.g., 16 time steps and treats each patch as a single “word.” This is exactly what Vision Transformers do to images (16x16 pixel patches), hence the title’s nod to “an image is worth 16x16 words.” Second, channel-independence: when you have many sensors/series at once (electricity for 321 households, traffic on 862 roads), PatchTST forecasts each one separately through the same shared model rather than blending them all into one fat input vector. These two changes make the model far cheaper to run, let it look much further back in history, and — surprisingly — make it more accurate than the elaborate Transformer variants that came before it. It even beat the strong “just use a linear model” baseline that had recently embarrassed Transformers in this field.


The Problem

Forecasting means: given the recent history of a quantity, predict its future values.

A multivariate time series has several variables measured over time at once. Each variable is a channel (borrowing the word from image RGB channels). Example: a weather dataset with 21 channels (temperature, humidity, pressure, wind, ...), each a separate sequence over 52,696 time steps.

Before this paper, the trend was to build ever-more-complicated Transformer models (Informer, Autoformer, FEDformer) for this task. Then a 2022 paper (DLinear) showed a plain linear model could beat all of them — casting doubt on whether Transformers are even useful here. PatchTST’s mission: rescue the Transformer by fixing how it consumes a time series, rather than by adding more machinery.


Background a newcomer needs

Neural network. A function with millions of tunable numbers (“weights”) that you fit to data by gradient descent so its predictions match known answers.

Transformer. A neural architecture built around attention. It processes a sequence of tokens (in language, tokens are words/sub-words).

Token / embedding. A token is one unit of input. Each token is turned into an embedding — a vector of numbers (length D, e.g., 128) that the network manipulates. “Embedding” just means “represent this thing as a point in a high-dimensional space.”

Attention / self-attention. Attention lets every token look at every other token and decide how much to “pay attention” to each. Concretely, each token produces a query, a key, and a value vector. Token A’s relevance to token B is the dot-product of A’s query with B’s key; these scores are normalized (softmax) into weights, and each token’s output is the weighted sum of all tokens’ values. Self-attention is when the tokens attend to each other within the same sequence. Intuition: in “the river bank,” attention lets “bank” look at “river” to disambiguate its meaning. For a time series, attention lets a Monday-morning pattern look at every prior morning to find the relevant ones.

Encoder vs. decoder. An encoder reads the input and turns it into rich internal representations. A decoder generates an output sequence step by step. Older forecasting Transformers used both. PatchTST uses only a vanilla encoder plus a simple linear output layer — much simpler.

Self-attention complexity. The catch with attention: with N tokens, every token attends to every other, so cost grows as (quadratic) in both time and memory. Double the sequence length and you roughly quadruple the cost. This O(N²) blow-up is the bottleneck for long sequences and the thing most prior work tried to patch over with clever sparse approximations.

Stationarity / distribution shift. A series is stationary if its statistics (mean, variance) stay roughly constant over time. Real series usually aren’t: the average electricity load in winter differs from summer. Distribution shift means the training-period statistics differ from the test-period statistics, which can wreck a model. PatchTST addresses this with instance normalization (below).

Self-supervised / masked pretraining. Instead of training only on labeled “predict the future” examples, you can hide (mask) parts of the input and train the model to fill them back in — no labels needed. This is how BERT (language) and masked autoencoders (vision) learn powerful general representations. PatchTST imports this trick.


Key Idea / Innovation

Two ingredients, both deliberately simple:

1. Patching — group time steps into subseries-level tokens

A single time step (one number) has no meaning on its own, the way a single letter doesn’t. A patch of consecutive steps does — it can encode “a rising morning ramp” or “a weekend dip.” So PatchTST slices each series into patches and treats each patch as one token.

Raw series (one channel), L = 16 steps:

  [ 3  4  6  9  8  7  5  4  3  5  8  9  7  6  4  3 ]

Patch length P = 4, stride S = 2 (patches overlap by 2):

  patch1: [3 4 6 9]
  patch2:        [6 9 8 7]
  patch3:                [8 7 5 4]   ... and so on

Each patch -> one input "word" for the Transformer.

Three free benefits fall out of this:

  1. Local semantic meaning is preserved inside each token (a patch captures a little shape, not a lone point).

  2. Quadratic cost reduction. The number of tokens drops from N ≈ L down to N ≈ L/S. Because attention is O(N²), dividing token count by the stride S cuts attention cost by . In the paper, with L=336, P=16, S=8, training ran up to 22x faster on the Traffic dataset.

  3. The model can look back much further. Because tokens are cheaper, you can afford a longer lookback window for the same compute — and longer history genuinely helps (see “longer is better” below).

2. Channel-independence — one shared model, applied per series

With M channels you have two choices:

Channel-mixing:                       Channel-independence (PatchTST):

 [ch1 ch2 ... chM] ─┐                  ch1 ──► [shared Transformer] ──► forecast1
                    ├─► one token       ch2 ──► [shared Transformer] ──► forecast2
 mixes all channels ┘                   ...                              ...
                                        chM ──► [shared Transformer] ──► forecastM
                                              (same weights every time)

Why this helps: it’s the same idea that already worked for CNNs and linear models in time series. It keeps each channel’s own dynamics clean (one road’s traffic isn’t smeared by 861 others), reduces overfitting, and lets the same pretrained model handle datasets with different numbers of channels. The price: it ignores correlations between channels — which the authors flag as future work.


How It Works, Step by Step (the execution flow)

Input: a multivariate window (x₁, ..., x_L), where each x_t is an M-dimensional vector. Goal: predict (x_{L+1}, ..., x_{L+T}).

                          ┌─────────── done once PER CHANNEL, weights shared ───────────┐
multivariate window
(L steps, M channels)
        │
        ▼
 ① split into M univariate series   x⁽ⁱ⁾ ∈ R^(1×L),  i = 1..M
        │
        ▼
 ② instance-normalize each series   (subtract its mean, divide by its std)
        │
        ▼
 ③ patch it:  P=16, S=8  →  N ≈ L/S patches,  each of length P
        │
        ▼
 ④ linear projection W_p : each patch (length P) → embedding (length D)
    + add learnable position encoding  (so the model knows patch order)
        │
        ▼
 ⑤ vanilla Transformer encoder  (n× of: multi-head self-attention →
    Add&Norm with BatchNorm → feed-forward → Add&Norm, residual connections)
        │
        ▼
 ⑥ flatten the encoder output  → linear head → T predicted values  x̂⁽ⁱ⁾ ∈ R^(1×T)
        │
        ▼
 ⑦ de-normalize: add back the mean & std removed in ②
        └──────────────────────────────────────────────────────────────────┘
        │
        ▼
 ⑧ concatenate the M per-channel forecasts → full multivariate forecast

Detail on each step:

  1. Split into channels. The M-channel window becomes M separate length-L univariate series. Each travels through the network independently.

  2. Instance normalization. Each series instance is rescaled to zero mean and unit standard deviation before patching, and the mean/std are added back to the final prediction. This neutralizes distribution shift (a winter window and a summer window both get centered), so the Transformer only has to model shape, not absolute level.

  3. Patching. Length-L series → a sequence of N = ⌊(L−P)/S⌋ + 2 patches (the last value is padded S times so the windowing comes out even). Patches may overlap (supervised) or not (self-supervised).

  4. Projection + position embedding. A single trainable linear map W_p turns each length-P patch into a length-D embedding. A learnable position encoding is added so the encoder knows patch #1 comes before patch #2 (attention itself is order-blind).

  5. Transformer encoder. Standard multi-head self-attention: each head builds query/key/value matrices, computes softmax(QKᵀ/√d_k)·V, and the heads’ outputs feed a feed-forward block. Residual connections and BatchNorm (the authors found BatchNorm beats LayerNorm for time series) stabilize training. Output: a representation z⁽ⁱ⁾ per channel.

  6. Flatten + linear head. The encoder output is flattened and a linear layer maps it directly to all T future values at once (a direct multi-step forecast — no slow step-by-step decoding).

  7. De-normalize. Reverse step 2.

  8. Concatenate the M univariate forecasts into the multivariate result.

Loss. Mean Squared Error (MSE) between predicted and true future values, averaged over all channels: L = (1/M) Σᵢ ‖x̂⁽ⁱ⁾_{L+1:L+T} − x⁽ⁱ⁾_{L+1:L+T}‖².

Optional: self-supervised pretraining

Same encoder, but swap the forecasting head for a small reconstruction head (a D×P linear layer). Use non-overlapping patches, randomly mask 40% of them (set to zero), and train the model to reconstruct the masked patches from the visible ones via MSE. Masking whole patches (not single points) forces real understanding — a single missing point could be guessed by interpolation, but a missing patch can’t. Afterward you can:

On large datasets this pretraining beats training from scratch, and a model pretrained on one dataset (e.g., Electricity) transfers to others with strong accuracy — a step toward time-series “foundation models.”


A Worked Example

Suppose you run a power grid and want to predict the next 96 hours of demand for 321 households from the past 512 hours.

Because patching shrank 512 steps to 64 tokens, the O(N²) attention is ~64x cheaper than feeding 512 raw points — which is exactly what lets you afford a 512-hour lookback in the first place.

“Longer is better,” empirically. On the Traffic case study, extending the lookback from L=96 to L=336 cut MSE from 0.518 to 0.397; adding patching pushed it to 0.367; self-supervised pretraining reached 0.349. Crucially, the older Transformers did not improve with longer lookbacks (a sign they couldn’t actually exploit long history) — PatchTST consistently did.


Strengths

Limitations


Glossary