Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

TimesFM

A Decoder-Only Foundation Model for Time-Series Forecasting

Das, Kong, Sen & Zhou — ICML 2024 (arXiv:2310.10688)

TL;DR (one paragraph)

TimesFM is GPT, but for time series. Instead of training a fresh model for each forecasting problem (the way Informer, Autoformer and PatchTST do), the authors at Google pretrain a single ~200-million-parameter decoder-only transformer on a giant pile of time series — about 100 billion timepoints of Google Trends, Wikipedia pageviews, traffic, weather, plus synthetic data. After that one expensive pretraining run, the model can be handed a series it has never seen before and produce a forecast immediately — zero-shot, with no per-dataset training. The clever engineering bits: the input series is chopped into patches (chunks of 32 timepoints) that each become one transformer “token,” and the model emits forecasts in long output patches (128 timepoints) so it can cover a long horizon in just a handful of autoregressive steps. The headline result: out of the box, zero-shot, TimesFM gets close to or matches the accuracy of supervised models that were specifically trained on each test dataset.


The Problem

Classic deep forecasters have a workflow that looks like this: pick a dataset (say, a city’s electricity demand), train a model on that dataset’s history, then forecast that dataset’s future. PatchTST, Informer, Autoformer, DeepAR — all of them are per-dataset models. If tomorrow you get a new dataset (a different city, a retail SKU, server traffic), you start over: collect history, train, tune, validate.

That is slow, expensive, and brittle. It also wastes an obvious opportunity. In NLP, we stopped training a new model per task years ago. We pretrain one foundation model (a large model trained once on a broad corpus, then reused for many downstream tasks) like GPT, and then it answers questions, summarizes, translates — zero-shot, meaning with no task-specific training, just a prompt.

The question TimesFM asks: can we build the GPT of time series? One model, pretrained once, that forecasts any new series out of the box without retraining. The challenges that make this non-obvious:


Background a newcomer needs

A few terms, defined once:

The GPT analogy, and where it breaks. TimesFM is to time series what GPT is to text: pretrain once on a huge corpus, then predict the “next chunk” zero-shot. But the analogy is not perfect:

So: same skeleton (causal decoder, next-chunk prediction, pretrain-then-zero-shot), different flesh (continuous regression instead of discrete classification).


Key Idea / Innovation

Three ideas, stacked:

  1. Treat forecasting as a foundation-model problem. Pretrain one decoder-only transformer on a massive, diverse corpus so it learns the general “grammar” of time series — trends, seasonality, level shifts, noise — and can then forecast anything zero-shot.

  2. Patch the input; use a residual MLP as the tokenizer. Break the context into non-overlapping patches of input_patch_len = 32 timepoints; push each through a residual MLP to get one transformer token. Patching keeps the sequence short (1 token per 32 points instead of 1 per point) and gives each token useful local context.

  3. Make output patches longer than input patches. The output residual MLP emits an entire patch of output_patch_len = 128 predicted timepoints at once — four times longer than the input patch. This is the efficiency trick. Because each autoregressive step produces 128 future points (not 32, and certainly not 1), a long horizon is covered in very few roll-forward steps. Forecasting 256 steps takes 2 autoregressive steps with 128-long output patches, versus 8 if output patches matched the 32-long input. Fewer steps means faster inference and less compounding of autoregressive error.

Why does the asymmetry help and not hurt? During pretraining the model is forced to predict 128 points from as few as 32 points of context (and 192 from 64, etc.), so it learns to make confident long jumps rather than timid one-patch nudges.


How It Works, Step by Step (the execution flow)

The flow from a raw context series to a forecast:

raw context series  (e.g. 256 past hourly values)
        │
        ▼
[1] (optional) normalize the context  →  zero-mean / unit-scale numbers
        │
        ▼
[2] PATCH:  split into chunks of 32        [p1][p2][p3] ... [p8]
        │
        ▼
[3] TOKENIZE: each patch → Residual MLP → token vector (dim 1280)
        │        + add positional encoding (PE_j)
        │        + optional date-derived features (day-of-week, month, ...)
        ▼
   [t1][t2][t3] ... [t8]
        │
        ▼
[4] STACKED CAUSAL DECODER  (20 transformer layers, 16 heads, causal mask)
        │   each token attends only to earlier tokens
        ▼
   [h1][h2][h3] ... [h8]   (one hidden state per input position)
        │
        ▼
[5] OUTPUT Residual MLP on the LAST hidden state
        │   emits a whole OUTPUT PATCH of 128 future values at once
        ▼
   ŷ[1..128]
        │
        ▼
[6] AUTOREGRESSIVE ROLL-FORWARD: append ŷ to context, re-patch, repeat
        │   until horizon is covered
        ▼
[7] DE-NORMALIZE  →  forecast back in original units (MW, clicks, ...)

Numbered:

  1. (Optional) Normalize. The context is rescaled to a stable numeric range (so a series in megawatts and one in web-clicks look comparable to the model); the transformation is reversed on the output so predictions come back in real units.

  2. Patch. The context is cut into non-overlapping patches of length 32. A random-length masking strategy is what lets the model handle any context length: during training the first r (a random number, 0…31) timepoints of the first patch are masked out, so the model practices forecasting from 1, 2, 3, … up to the maximum context length. A binary padding mask carries this through.

  3. Tokenize. Each patch is fed through an input residual MLP that maps the 32 raw numbers to a 1280-dim token vector. A standard transformer positional encoding is added so the model knows token order; optional date features can be mixed in.

  4. Stacked causal transformer. The tokens pass through 20 decoder layers with causal (masked) self-attention — every token sees only the past. This is the same machinery as a GPT layer; here it learns temporal dependencies across patches.

  5. Output patch. The output residual MLP reads a hidden state and emits an entire 128-long output patch — the model’s point forecast for the next 128 timepoints — in one shot.

  6. Autoregressive roll-forward. To go beyond 128 steps, append the just-predicted patch to the context, re-patch, and run the model again. Repeat until the horizon is filled. Because each step yields 128 points, few steps are needed.

  7. De-normalize. Undo step 1’s scaling so the forecast is in the original physical units.

Training objective: minimize MSE between predicted and true future patches, averaged over all output patches in the batch — a pure point-forecast regression loss, with no per-dataset fine-tuning.


A Worked Example

Scenario (shared with the sibling notes): predict the next 7 days = 168 hours of a city’s electricity demand from a recent lookback window.

The PatchTST way (for contrast): you would collect that grid’s historical demand, train a PatchTST model on it (split into train/val/test, tune, etc.), and only then forecast. The model only knows this one grid.

The TimesFM way — zero-shot: TimesFM was pretrained months ago on Google Trends, Wikipedia, traffic, weather and synthetic data. It has never seen this electricity grid. You simply hand it the recent demand history and ask for 168 hours. No training, no tuning.

Concretely, suppose you feed a context of 256 past hourly values:

context = 256 hours  →  256 / 32 = 8 input patches
                        each patch → token → 8 tokens into the decoder

Now decoding to a horizon of 168 hours, with output_patch_len = 128:

step 1:  model emits hours   1 .. 128   (one 128-long output patch)
         append those 128 predictions to the running context
step 2:  re-patch, model emits hours 129 .. 256
         we only needed up to 168, so keep 129 .. 168, discard the rest

So 2 autoregressive steps cover the full 168-hour week. Had the output patch matched the 32-long input patch, you would need ⌈168/32⌉ = 6 steps, with six chances for error to compound. The long-output-patch design is what keeps autoregression cheap here.

The remarkable part: those 168 predicted hours come from a model that learned “what electricity-demand-shaped daily/weekly seasonality looks like” purely from other time series during pretraining — and it lands close to a PatchTST that was trained directly on this grid.


Strengths

Limitations


Glossary