TimesFM - Luca's Research @ PEG

Das, Kong, Sen & Zhou — ICML 2024 (arXiv:2310.10688)

TL;DR (one paragraph)¶

TimesFM is GPT, but for time series. Instead of training a fresh model for each forecasting problem (the way Informer, Autoformer and PatchTST do), the authors at Google pretrain a single ~200-million-parameter decoder-only transformer on a giant pile of time series — about 100 billion timepoints of Google Trends, Wikipedia pageviews, traffic, weather, plus synthetic data. After that one expensive pretraining run, the model can be handed a series it has never seen before and produce a forecast immediately — zero-shot, with no per-dataset training. The clever engineering bits: the input series is chopped into patches (chunks of 32 timepoints) that each become one transformer “token,” and the model emits forecasts in long output patches (128 timepoints) so it can cover a long horizon in just a handful of autoregressive steps. The headline result: out of the box, zero-shot, TimesFM gets close to or matches the accuracy of supervised models that were specifically trained on each test dataset.

The Problem¶

Classic deep forecasters have a workflow that looks like this: pick a dataset (say, a city’s electricity demand), train a model on that dataset’s history, then forecast that dataset’s future. PatchTST, Informer, Autoformer, DeepAR — all of them are per-dataset models. If tomorrow you get a new dataset (a different city, a retail SKU, server traffic), you start over: collect history, train, tune, validate.

That is slow, expensive, and brittle. It also wastes an obvious opportunity. In NLP, we stopped training a new model per task years ago. We pretrain one foundation model (a large model trained once on a broad corpus, then reused for many downstream tasks) like GPT, and then it answers questions, summarizes, translates — zero-shot, meaning with no task-specific training, just a prompt.

The question TimesFM asks: can we build the GPT of time series? One model, pretrained once, that forecasts any new series out of the box without retraining. The challenges that make this non-obvious:

Time series come at wildly different granularities (hourly, daily, weekly, monthly) and scales (web clicks vs. megawatts).
There is no shared “vocabulary” the way text has words — values are continuous real numbers.
Forecast horizons vary, context lengths (how much past history you feed in) vary, and you cannot assume the new dataset looks like anything in your training set.

Background a newcomer needs¶

A few terms, defined once:

Time series: a sequence of numbers measured over time, e.g. hourly electricity demand [412, 430, 455, ...].
Context length (lookback): how many past timepoints you feed the model as input. Bigger context = more history to reason from.
Horizon: how many future timepoints you want predicted. “Forecast the next 168 hours” → horizon = 168.
Univariate vs. multivariate: univariate = one signal at a time (just demand). Multivariate = many signals jointly (demand + temperature + price). TimesFM is univariate — it forecasts each series on its own.
Foundation model: a large model pretrained once on a broad corpus, then reused for many downstream tasks. (BERT, GPT in NLP.)
Zero-shot: applying that pretrained model to a brand-new dataset with no further training on it. The opposite of “train a model on this dataset first.”
Tokenization: turning raw input into the discrete units a transformer consumes. In text, tokens are word-pieces. In TimesFM, a token is a patch.
Patch: a contiguous chunk of the series (e.g. 32 consecutive timepoints) treated as a single unit. Patching is borrowed in spirit from PatchTST: instead of one token per timepoint, you get one token per chunk — far fewer tokens, and each carries local shape information.
Decoder-only / autoregressive: a transformer that reads left-to-right with causal attention (each position can only look at the past, never the future) and generates output one step at a time, feeding its own prediction back in to produce the next — exactly how GPT writes text. This is in contrast to one-shot forecasters (Informer/Autoformer/PatchTST) that emit the entire horizon in a single forward pass.
Residual MLP (Residual Block): a small multi-layer perceptron with a skip connection (output = MLP(x) + linear(x)). TimesFM uses these to convert a patch of raw numbers into a token vector, and to convert a token back into predicted numbers.

The GPT analogy, and where it breaks. TimesFM is to time series what GPT is to text: pretrain once on a huge corpus, then predict the “next chunk” zero-shot. But the analogy is not perfect:

GPT predicts a discrete token from a fixed vocabulary via classification (a softmax over ~50k words). TimesFM predicts continuous real values via regression — there is no vocabulary and no softmax.
GPT’s loss is cross-entropy over token classes; TimesFM’s loss is mean squared error (MSE) on the predicted numbers.
A GPT token is a word-piece; a TimesFM “token” is a learned embedding of a 32-number patch.

So: same skeleton (causal decoder, next-chunk prediction, pretrain-then-zero-shot), different flesh (continuous regression instead of discrete classification).

Key Idea / Innovation¶

Three ideas, stacked:

Treat forecasting as a foundation-model problem. Pretrain one decoder-only transformer on a massive, diverse corpus so it learns the general “grammar” of time series — trends, seasonality, level shifts, noise — and can then forecast anything zero-shot.
Patch the input; use a residual MLP as the tokenizer. Break the context into non-overlapping patches of input_patch_len = 32 timepoints; push each through a residual MLP to get one transformer token. Patching keeps the sequence short (1 token per 32 points instead of 1 per point) and gives each token useful local context.
Make output patches longer than input patches. The output residual MLP emits an entire patch of output_patch_len = 128 predicted timepoints at once — four times longer than the input patch. This is the efficiency trick. Because each autoregressive step produces 128 future points (not 32, and certainly not 1), a long horizon is covered in very few roll-forward steps. Forecasting 256 steps takes 2 autoregressive steps with 128-long output patches, versus 8 if output patches matched the 32-long input. Fewer steps means faster inference and less compounding of autoregressive error.

Why does the asymmetry help and not hurt? During pretraining the model is forced to predict 128 points from as few as 32 points of context (and 192 from 64, etc.), so it learns to make confident long jumps rather than timid one-patch nudges.

How It Works, Step by Step (the execution flow)¶

The flow from a raw context series to a forecast:

raw context series  (e.g. 256 past hourly values)
        │
        ▼
[1] (optional) normalize the context  →  zero-mean / unit-scale numbers
        │
        ▼
[2] PATCH:  split into chunks of 32        [p1][p2][p3] ... [p8]
        │
        ▼
[3] TOKENIZE: each patch → Residual MLP → token vector (dim 1280)
        │        + add positional encoding (PE_j)
        │        + optional date-derived features (day-of-week, month, ...)
        ▼
   [t1][t2][t3] ... [t8]
        │
        ▼
[4] STACKED CAUSAL DECODER  (20 transformer layers, 16 heads, causal mask)
        │   each token attends only to earlier tokens
        ▼
   [h1][h2][h3] ... [h8]   (one hidden state per input position)
        │
        ▼
[5] OUTPUT Residual MLP on the LAST hidden state
        │   emits a whole OUTPUT PATCH of 128 future values at once
        ▼
   ŷ[1..128]
        │
        ▼
[6] AUTOREGRESSIVE ROLL-FORWARD: append ŷ to context, re-patch, repeat
        │   until horizon is covered
        ▼
[7] DE-NORMALIZE  →  forecast back in original units (MW, clicks, ...)

Numbered:

(Optional) Normalize. The context is rescaled to a stable numeric range (so a series in megawatts and one in web-clicks look comparable to the model); the transformation is reversed on the output so predictions come back in real units.
Patch. The context is cut into non-overlapping patches of length 32. A random-length masking strategy is what lets the model handle any context length: during training the first r (a random number, 0…31) timepoints of the first patch are masked out, so the model practices forecasting from 1, 2, 3, … up to the maximum context length. A binary padding mask carries this through.
Tokenize. Each patch is fed through an input residual MLP that maps the 32 raw numbers to a 1280-dim token vector. A standard transformer positional encoding is added so the model knows token order; optional date features can be mixed in.
Stacked causal transformer. The tokens pass through 20 decoder layers with causal (masked) self-attention — every token sees only the past. This is the same machinery as a GPT layer; here it learns temporal dependencies across patches.
Output patch. The output residual MLP reads a hidden state and emits an entire 128-long output patch — the model’s point forecast for the next 128 timepoints — in one shot.
Autoregressive roll-forward. To go beyond 128 steps, append the just-predicted patch to the context, re-patch, and run the model again. Repeat until the horizon is filled. Because each step yields 128 points, few steps are needed.
De-normalize. Undo step 1’s scaling so the forecast is in the original physical units.

Training objective: minimize MSE between predicted and true future patches, averaged over all output patches in the batch — a pure point-forecast regression loss, with no per-dataset fine-tuning.

A Worked Example¶

Scenario (shared with the sibling notes): predict the next 7 days = 168 hours of a city’s electricity demand from a recent lookback window.

The PatchTST way (for contrast): you would collect that grid’s historical demand, train a PatchTST model on it (split into train/val/test, tune, etc.), and only then forecast. The model only knows this one grid.

The TimesFM way — zero-shot: TimesFM was pretrained months ago on Google Trends, Wikipedia, traffic, weather and synthetic data. It has never seen this electricity grid. You simply hand it the recent demand history and ask for 168 hours. No training, no tuning.

Concretely, suppose you feed a context of 256 past hourly values:

context = 256 hours  →  256 / 32 = 8 input patches
                        each patch → token → 8 tokens into the decoder

Now decoding to a horizon of 168 hours, with output_patch_len = 128:

step 1:  model emits hours   1 .. 128   (one 128-long output patch)
         append those 128 predictions to the running context
step 2:  re-patch, model emits hours 129 .. 256
         we only needed up to 168, so keep 129 .. 168, discard the rest

So 2 autoregressive steps cover the full 168-hour week. Had the output patch matched the 32-long input patch, you would need ⌈168/32⌉ = 6 steps, with six chances for error to compound. The long-output-patch design is what keeps autoregression cheap here.

The remarkable part: those 168 predicted hours come from a model that learned “what electricity-demand-shaped daily/weekly seasonality looks like” purely from other time series during pretraining — and it lands close to a PatchTST that was trained directly on this grid.

Strengths¶

True zero-shot forecasting. No per-dataset training; point at a new series and go. This is the whole foundation-model payoff — and out of the box it comes close to or matches supervised models trained on each dataset.
One model, many domains. Pretrained once on ~100B timepoints across finance, web, traffic, weather, energy and synthetic data; generalizes to held-out benchmarks (Monash, Darts, ETT) it never saw.
Efficient long-horizon decoding. Long output patches (128) mean few autoregressive steps, so inference is fast and accumulated error is limited.
Handles variable context lengths and granularities. The random-length masking strategy trains it to forecast from short or long histories; granularity-aware training context lengths and optional date features cover hourly→monthly data.
Patching is efficient. One token per 32 points keeps attention cheap and gives each token local shape.
Strong, broad benchmarks. Top-3 on Monash (>10% better than the llmtime GPT-3 baseline), within significance of seasonal ARIMA on Darts, best on the ETT long-horizon tasks (with PatchTST close behind).

Limitations¶

Univariate only. It forecasts one channel at a time; it does not natively use cross-series / covariate information (e.g. temperature helping demand) the way multivariate models can.
Point forecasts, not full distributions. The headline model emits a single predicted value per step (MSE/regression), not calibrated uncertainty/quantiles out of the box.
Autoregressive error accumulation. Although long output patches reduce the number of steps, very long horizons still roll the model forward and can drift.
Huge pretraining cost and data. The zero-shot magic depends on a ~200M-parameter model trained on ~100B timepoints — out of reach to reproduce casually; you rely on the released weights.
Fixed patch granularity. Patch lengths (32 in, 128 out) and per-granularity max context lengths are design choices baked into pretraining, not adapted per dataset.
Not always the winner. Zero-shot is competitive but a model trained on the target dataset can still edge it out on some tasks; this is “close to supervised,” not “beats everything.”

Glossary¶

Foundation model — a large model pretrained once on a broad corpus, then reused for many downstream tasks without retraining (GPT, BERT in NLP; TimesFM here).
Zero-shot — applying a pretrained model to a brand-new dataset with no further training on it.
Decoder-only / autoregressive — a left-to-right transformer with causal attention that generates output one chunk at a time, feeding its own predictions back in (GPT-style).
Causal (masked) attention — attention where each position can only attend to earlier positions, never future ones.
Patch — a contiguous chunk of the series (32 timepoints in, 128 out) treated as one unit/token.
Patching — splitting a series into patches before tokenizing; reduces token count and captures local shape (borrowed in spirit from PatchTST).
Residual MLP (Residual Block) — a small MLP with a skip connection; used to turn a patch into a token (input) and a token into predicted values (output).
Input patch length (32) — number of timepoints per input token.
Output patch length (128) — number of future timepoints emitted per autoregressive step; deliberately longer than the input patch for efficiency.
Context length / lookback — how many past timepoints are fed in.
Horizon — how many future timepoints to predict.
Tokenization — turning raw input into the units a transformer consumes (patches here).
Univariate — modeling one signal at a time, with no cross-channel dependencies.
MSE (mean squared error) — the regression loss minimized during pretraining for point forecasts.
Random-length masking — training trick (mask the first r points, r random in 0…patch_len−1) that exposes the model to all possible context lengths.
Granularity / frequency — the temporal resolution of a series (hourly, daily, weekly, monthly).
Positional encoding — added vectors that tell the transformer the order of tokens.