Part XIII · Specialized ML Methods · Chapter 01

Time Series Analysis & Forecasting, signals through time.

Time series are the data type behind every forecasting problem in finance, retail, energy, climate, healthcare, and operations. They look simple — a list of numbers indexed by time — and turn out to be one of the most subtle data shapes in machine learning. This chapter covers the classical statistical foundations, the deep-learning era that began with WaveNet, and the foundation-model wave that has reshaped forecasting since 2023.

Prerequisites & orientation

This chapter assumes basic familiarity with linear algebra (Part I Ch 01), probability and statistics (Part I Ch 04–05), and the deep-learning fundamentals of Part V — particularly CNNs (Part V Ch 04) for the temporal-CNN sections, attention (Part V Ch 06) for the transformer sections, and pretraining paradigms (Part VI Ch 05) for the foundation-model section. Readers from a classical statistics background will find the deep-learning sections the most novel; readers from an ML background will find the stationarity, decomposition, and Box-Jenkins sections the most novel.

Two threads run through the chapter. The first is the long-running tension between classical methods (ARIMA, exponential smoothing, state-space models) and deep-learning methods (TCNs, transformers, foundation models). On the standard benchmarks of 2017–2022, classical methods often won; on the standard benchmarks of 2024–2026, foundation models have started to dominate. The second thread is the difference between point forecasting (predicting one number for the future) and probabilistic forecasting (predicting a distribution). Most production decisions need the distribution; most popular methods produce only the point. This chapter argues for taking probability seriously throughout.

In this chapter

Why Time Series Are Different temporal dependence · non-iid · stationarity · forecasting vs. modelling
Decomposition: Trend, Seasonality, Residuals additive · multiplicative · STL · classical decomposition
Stationarity and Differencing ADF · KPSS · integrated processes · unit roots
ARIMA and the Box-Jenkins Method AR · MA · ARIMA · SARIMA · ACF/PACF
Exponential Smoothing and ETS simple · Holt · Holt-Winters · state-space form
Probabilistic and Bayesian Time Series state-space models · Kalman filter · Prophet · BSTS
Temporal CNNs and TCN Architectures dilated causal convolutions · WaveNet · TCN · receptive field
Transformers for Time Series Informer · Autoformer · PatchTST · long-context forecasting
Foundation Models for Time Series TimeGPT · Chronos · Lag-Llama · Moirai · zero-shot forecasting
Frontier: Probabilistic, Multivariate, and Causal Forecasting distribution forecasting · cross-series transfer · counterfactuals

Why Time Series Are Different

Most of machine learning assumes that the data points are independent and identically distributed. Time series violate both halves of that assumption: the points are temporally ordered, each one depends on the ones before it, and the underlying distribution often shifts over time. Treating a time series as ordinary tabular data is the most common mistake practitioners make, and it is responsible for most overstated accuracy claims in the field.

Temporal dependence

The defining property of a time series is that consecutive observations are correlated. Today's stock price depends on yesterday's; this hour's electricity demand depends on the last several hours; this month's retail sales depend on the same month a year ago. The classical statistics terminology calls this temporal dependence, and it is the structure that every method in this chapter is trying to model. Methods that ignore it — random splits for cross-validation, shuffling the rows before training — produce optimistic accuracy estimates that collapse when the model is deployed on truly future data.

Two consequences follow immediately. First, the standard ML toolkit needs adaptation: train/test splits must be chronological, cross-validation must be a rolling-window or expanding-window scheme rather than k-fold, and metrics must be computed on truly held-out future data. Second, the order of operations inside the model matters: a recurrent network reads the series left to right; a causal convolution must mask the future; an attention head used for forecasting cannot peek beyond the current time step. Many time-series failures trace back to a single accidentally-leaky operation that gives the model access to information it would not have at inference time.

Stationarity, briefly

Classical time-series statistics is organised around the concept of stationarity: a series is stationary if its statistical properties (mean, variance, autocovariance) are constant over time. Stationary series are mathematically tractable; non-stationary series — those with a trend, with changing variance, or with structural breaks — require either transformation (differencing, deflation, log scaling) into stationary form or methods designed to handle non-stationarity directly. Section 03 develops this in detail; for now, the point is that stationarity is the central modelling decision in classical methods, and even modern deep-learning forecasters benefit from explicit handling of trends and seasonality.

Forecasting vs. modelling

The two main goals of time-series work are forecasting and modelling, and they call for different techniques. Forecasting aims to predict future values of a series given the past — what will the demand be next quarter, what will the temperature be tomorrow, what will the sensor read in five minutes? Modelling aims to understand the underlying structure of a series — what is the seasonal cycle, what is the trend, what was the effect of a known intervention? Many methods do both; some are specialised. Most of this chapter is about forecasting because it is the dominant practical application, but the modelling sections (decomposition, state-space methods, Bayesian approaches) are essential for problems where understanding matters more than prediction.

Univariate, multivariate, panel

One more axis. Univariate time series have a single value at each time step (the daily closing price of one stock, the hourly temperature at one station). Multivariate time series have multiple values at each time step that may interact (a vector of related stock prices, or temperature plus humidity plus pressure at one station). Panel data is the matrix-like setting where many distinct series are observed in parallel — every stock in an index, every store in a chain, every household in a panel survey — and where the question is often about cross-series patterns and shared dynamics. Modern deep-learning methods (and the foundation models of Section 09) excel especially on the panel case, where they can pool information across series; classical methods tend to fit one model per series and fare worse when individual series are short.

The "It Looked Great in Backtest" Failure

Almost every time-series practitioner has the same career-shaping moment: a model that scored brilliantly in backtest collapses in production. The cause is almost always one of three things — a leak from the future into the training set, a stationarity assumption that broke, or an evaluation metric that was insensitive to the failure mode that mattered. Section 03 covers the stationarity issue, the cross-validation discussion in this section addresses the leakage issue, and Section 10 will return to the metrics problem. Reading the rest of this chapter with that triple in mind will save real money.

Decomposition: Trend, Seasonality, Residuals

A useful first move with any time series is to decompose it into structural components: a slow-moving trend, periodic seasonal patterns, and the residual variation that those two together do not explain. The decomposition is rarely the model itself; it is the lens through which everything else gets clearer.

Additive and multiplicative decomposition

The classical decomposition writes a series y_t as a sum or product of three components:

Time-series decomposition additive : y t = T t + S t + R t multiplicative : y t = T t \cdot S t \cdot R t T is the trend (slow-moving level), S is the seasonal pattern (a periodic cycle), and R is the residual (everything left over). Additive decomposition is appropriate when the seasonal swings have constant amplitude regardless of the trend level; multiplicative is appropriate when the swings scale with the level (typical for retail sales, web traffic, anything growing exponentially). Taking logs converts a multiplicative series into an additive one, which is why log-transforms are so common in time-series practice.

STL and modern decomposition

The classical method estimates the trend by a centred moving average, the seasonal component by averaging within each seasonal position, and the residual by subtraction. STL (Seasonal-Trend decomposition using Loess, Cleveland et al. 1990) is the modern default: it estimates trend and seasonal components by repeated local-regression smoothing, handles missing values robustly, and lets the seasonal pattern itself evolve slowly over time. STL is available in nearly every modern statistical package and has largely replaced the classical approach.

The decomposition serves several uses. It produces seasonally adjusted series for downstream analysis. It surfaces structural breaks as discontinuities in the trend that need to be modelled or accounted for. It separates the interesting question (what is the underlying trend doing?) from the predictable nuisance (the seasonal pattern). And it provides residuals that can be examined for autocorrelation — if the residuals are not white noise, the decomposition has missed something, and the something is the signal worth modelling next.

Time-series decomposition. The observed series (top) is split into a slow-moving trend (gold), a periodic seasonal pattern (violet), and the residual variation that those two do not explain (teal). STL produces this decomposition automatically and is the modern default. The components can be modelled and forecast separately, then recombined.

Stationarity and Differencing

Stationarity is the central assumption of classical time-series analysis. A series is strictly stationary if its joint distribution at any set of times is invariant to shifting; weak stationarity (the practical definition) requires only that the mean, variance, and autocovariances be time-invariant. Most real-world series are not stationary, and getting them into stationary form via differencing or transformation is the precondition for the methods of the next two sections.

Why stationarity matters

The probabilistic theory underlying ARIMA, exponential smoothing, and most other classical time-series methods is built on the assumption that the series being modelled is stationary. If you fit an AR(1) model to a series with a strong upward trend, the fit will produce parameter estimates that try to capture both the trend and the autocorrelation in a single autoregressive coefficient — and the resulting forecast will systematically underpredict the trend's continuation. The fix is to remove the trend first (by differencing or detrending) and fit the AR model to the stationary residual.

Differencing

The simplest way to make a series stationary is differencing: replace each value with the difference between it and the previous value, Δy_t = y_t − y_t−1. A series whose first difference is stationary is called integrated of order 1, written I(1); a series that requires two differencings is I(2); and so on. Most economic and financial series are I(1) — they have a stochastic trend (a unit root) that one differencing removes.

Seasonal differencing handles seasonal patterns: subtract the value from one full season ago instead of from one step ago. Δ_sy_t = y_t − y_t−s, where s is the seasonal period (12 for monthly data with annual seasonality, 7 for daily data with weekly seasonality, etc.). Seasonal and ordinary differencing can be combined, and the resulting differencing pattern is what the "I" in ARIMA stands for.

Tests for stationarity

Two standard tests appear in nearly every time-series toolkit. The Augmented Dickey-Fuller (ADF) test has a null hypothesis that the series is non-stationary (has a unit root); rejecting the null gives evidence of stationarity. The KPSS test has the opposite null — that the series is stationary — and is a useful complement when the ADF test is borderline. Modern automated forecasting libraries run both tests, count how many ordinary and seasonal differences are needed to make the series stationary, and use that as the integration order in subsequent ARIMA fitting.

Beyond differencing

Differencing handles series whose mean changes over time. Series whose variance changes over time need a different treatment — typically a Box-Cox transformation, of which the log transform is a special case. Series with structural breaks (a change in regime) need either piecewise modelling or methods explicitly designed to detect and adapt to the breaks. The honest practice is that real series often need several stabilising operations stacked together: log-transform to stabilise variance, seasonally difference to remove the seasonal pattern, ordinary-difference to remove the trend, and only then fit a stationary model to what remains.

ARIMA and the Box-Jenkins Method

ARIMA is the classical workhorse of time-series forecasting. It combines three building blocks — an autoregressive component, a moving-average component, and integration via differencing — into a single tractable family of models. For decades, ARIMA was the default forecasting method, and on many small-to-medium-data forecasting problems it remains a strong baseline that more sophisticated methods often fail to beat.

The three building blocks

An autoregressive (AR) model of order p writes the current value as a linear combination of the previous p values plus noise: y_t = c + φ₁y_t−1 + φ₂y_t−2 + … + φ_py_t−p + ε_t. A moving-average (MA) model of order q writes the current value as a linear combination of the previous q noise terms plus a new noise term: y_t = μ + ε_t + θ₁ε_t−1 + … + θ_qε_t−q. Integration means working with a differenced series rather than the raw one. Combining all three gives the ARIMA(p, d, q) family: difference the series d times to make it stationary, then fit AR(p) plus MA(q) to the result.

Identification: ACF and PACF

The classical Box-Jenkins methodology selects orders p, d, q by inspecting two diagnostic plots. The autocorrelation function (ACF) plots the correlation between y_t and y_t−k for each lag k; the partial autocorrelation function (PACF) plots the same correlation after controlling for the intermediate lags. The shape of these plots reveals the model structure: a pure AR(p) process has a PACF that cuts off after lag p, an MA(q) process has an ACF that cuts off after lag q, and a mixed ARMA process has both decaying gradually. The first step of the Box-Jenkins workflow is plotting both, identifying the cut-off pattern, and reading off the orders.

In practice, automated order selection (the auto.arima function in the R forecast package, or pmdarima in Python) has largely replaced manual inspection. The automated methods test all combinations of orders within a reasonable range, score each by AIC or BIC, and return the best fit. They are not always right, but they are good enough that hand-tuning ARIMA is now a niche activity.

SARIMA and exogenous regressors

Seasonal ARIMA (SARIMA) extends the framework with seasonal AR, seasonal MA, and seasonal differencing terms, written ARIMA(p, d, q)(P, D, Q)_s. SARIMAX adds exogenous regressors: external variables that may explain part of the series's variation (the day-of-week indicator, the price of a competitor's product, a marketing-campaign indicator). Both extensions are fully automated in modern packages and are the form of ARIMA that most production deployments actually use.

What ARIMA does well, and where it stops

ARIMA's strengths are interpretability (the parameters have direct meaning), uncertainty quantification (the model produces calibrated prediction intervals out of the box), and data efficiency (it works on series with a few hundred observations, where deep learning would just memorise). Its weaknesses are also direct: it is linear, it assumes the series is locally stationary after differencing, and it has no good way to share information across many parallel series. For any single short series, ARIMA is hard to beat; for a panel of thousands of series, the panel-pooling methods of Sections 8–9 routinely win.

Exponential Smoothing and ETS

Exponential smoothing methods are an alternative classical lineage: instead of modelling autocorrelations explicitly, they update an evolving estimate of the series's level (and optionally trend and seasonality) by exponentially-decaying-weighted averaging. They are simple, fast, and on many practical forecasting problems competitive with ARIMA — sometimes better.

Simple, double, and triple exponential smoothing

Simple exponential smoothing models a series with no trend or seasonality. The forecast is a weighted average of all past observations with weights that decay exponentially into the past, controlled by a smoothing parameter α ∈ [0, 1]. The recursive form is intuitive: the level at time t is α times the new observation plus (1 − α) times the previous level. Larger α puts more weight on recent observations and produces a more responsive (but noisier) forecast.

Double exponential smoothing (Holt's method) adds a trend component, with its own smoothing parameter β. The level update remains as before, plus a trend update: the new trend is β times the change in level plus (1 − β) times the previous trend. Forecasts add the trend extrapolated forward. Triple exponential smoothing (Holt-Winters) adds a seasonal component on top, with smoothing parameter γ. The three components — level, trend, seasonality — each have their own smoothing rate, and the forecast composes them.

Holt-Winters update equations (additive form) level: ℓ t = α(y t - s t-m) + (1 - α)(ℓ t-1 + b t-1) trend: b t = β(ℓ t - ℓ t-1) + (1 - β)b t-1 season: s t = γ(y t - ℓ t-1 - b t-1) + (1 - γ)s t-m h-step forecast: ŷ t+h = ℓ t + h\cdotb t + s t+h-m\cdot⌈h/m⌉ m is the seasonal period, h is the forecast horizon. Each component is a weighted average of the new observation's contribution and the previous estimate, with weights α, β, γ controlling responsiveness. The forecast extrapolates the level and trend linearly and re-uses the most recent seasonal estimates.

The ETS framework

The exponential smoothing methods can be unified under the ETS framework (Error, Trend, Seasonality), which writes them as state-space models with three modelling choices: error type (additive vs. multiplicative), trend type (none, additive, additive damped, multiplicative, multiplicative damped), and seasonality type (none, additive, multiplicative). The ETS taxonomy enumerates 30 specific models; an automated selector picks among them by maximum likelihood and information criteria (AIC, BIC), much like auto.arima picks among ARIMA orders. The R forecast package's ets() function is the reference implementation.

The M-competition lessons

The M-competitions (Makridakis competitions, running since 1982 with successive editions of increasing scope) are the long-running empirical benchmark of forecasting methods on heterogeneous business time series. The M3 (2000) and M4 (2018) competitions established a humbling result: simple methods like exponential smoothing and ARIMA, and even simpler ones like the seasonal naive forecast, were extremely hard to beat. Sophisticated machine-learning methods often underperformed. The M5 (2020) competition, on retail data, finally saw machine-learning methods (LightGBM-based ensembles) win — but the margin was narrow and the simple methods remained competitive baselines. The M6 (2022) competition on financial data was more equivocal still.

The lesson is consequential: classical methods are not obsolete. On modest datasets, on heterogeneous business series, on problems where interpretability matters, ETS and ARIMA remain the right starting point. The deep-learning methods of the next sections begin to dominate when the dataset is large, the series share structure across instances, and the scale of the problem rewards a single shared model over thousands of per-series fits.

Probabilistic and Bayesian Time Series

A point forecast is rarely what a decision-maker actually needs. The decision is usually "given my uncertainty about the future, what is the best action?" — and answering it requires a forecast distribution, not a single number. Probabilistic and Bayesian time-series methods produce that distribution natively.

State-space models and the Kalman filter

The unifying mathematical framework for probabilistic time-series modelling is the linear Gaussian state-space model, which writes the observed series as a noisy observation of a latent state that evolves according to its own dynamics. The state evolution is a linear transformation plus Gaussian noise; the observation is a linear function of the state plus Gaussian noise. Almost every classical time-series model — ARIMA, ETS, Bayesian structural time series — can be expressed in state-space form, and once expressed there, the optimal forecast and its uncertainty are computed by the Kalman filter (and its smoothing companion, the Kalman smoother). Chapter 01 of Part XII covered the Kalman filter for robotics; the same machinery underlies most probabilistic time-series practice.

Bayesian Structural Time Series

Bayesian Structural Time Series (BSTS, Scott & Varian 2014) is a particular formulation that explicitly decomposes a series into trend, seasonality, regression-on-covariates, and unobserved components, with priors on each. The full posterior is sampled via MCMC or variational methods, and the resulting forecasts come with calibrated uncertainty intervals. Google's CausalImpact library uses BSTS as its underlying engine to estimate causal effects of interventions — a direct application of the modelling-vs-forecasting distinction from Section 01.

Prophet and the engineer-friendly Bayesian approach

Facebook (now Meta) released Prophet in 2017 — a decomposable forecasting model designed to be robust to missing data, outliers, and changes in trend, with a Stan-based MCMC fit underneath. Prophet writes the series as a sum of a piecewise-linear trend, a Fourier-series seasonal component, and a holiday/event component, with sensible default priors. The result is a forecasting tool that produces reasonable results out of the box on most business time series, with calibrated uncertainty intervals, and that is interpretable enough that analysts trust it. Prophet became enormously popular because it was the first probabilistic time-series tool that did not require deep statistical expertise to use.

Why probabilistic forecasting matters

Most forecasting consumers — inventory planners, energy traders, capacity engineers, financial risk managers — are making decisions where the cost of being wrong in one direction is different from the cost of being wrong in the other. Inventory shortfalls cost more than excess; under-provisioning cloud capacity is more painful than over-provisioning. The right decision depends not on the point forecast but on a quantile of the forecast distribution. Methods that produce only a point forecast force the consumer to estimate uncertainty themselves, usually badly. The shift toward probabilistic forecasting in the foundation-model era (Section 09) is in part a recognition that consumers want the distribution, and that producing it natively is more reliable than constructing it post-hoc.

Calibration as a Forecasting Goal

A well-calibrated forecast distribution is one where, in the long run, the actual outcomes fall in the predicted 50% interval 50% of the time, in the predicted 90% interval 90% of the time, and so on. Calibration is checked separately from accuracy and is the quality that determines whether a probabilistic forecast is actually usable. Badly-calibrated forecasts that produce 90% intervals containing the truth only 60% of the time look superficially impressive on point-forecast metrics and are dangerous in operations. Modern probabilistic forecasting methods report calibration explicitly; the older methods often did not, and this is one of the underappreciated differences between the eras.

Temporal CNNs and TCN Architectures

Before transformers ate everything, the dominant deep-learning architecture for sequence modelling was the recurrent neural network — and before transformers ate sequence modelling, a quieter revolution had already replaced RNNs with convolutions. The Temporal Convolutional Network is the canonical instance: a stack of dilated causal convolutions that achieves long-range temporal modelling without the gradient pathologies of recurrence.

Causal and dilated convolutions

A standard 1D convolution processes a sequence by sliding a kernel across it; the output at time t depends on values in a window centred at t. For forecasting, this is wrong — it leaks future information. A causal convolution shifts the kernel so that the output at time t depends only on values at time t and earlier. Stacking causal convolutions gives a model whose output at any time is a function of only the past.

The remaining problem is the receptive field. A standard kernel of size k sees only the last k time steps; stacking L layers extends the receptive field linearly to L(k − 1) + 1. For forecasting problems where the relevant history might be hundreds of steps long, this requires impractically deep stacks. Dilated convolutions solve this elegantly: instead of looking at the last k consecutive time steps, the kernel looks at k time steps spaced d apart (the dilation factor). Stacking layers with exponentially-increasing dilation produces a receptive field that grows exponentially with depth.

Dilated causal convolutions. Each layer applies a kernel of size 2 with exponentially-increasing dilation (1, 2, 4, ...). The receptive field of the output (top) grows exponentially with depth, covering many time steps with few layers. The convolutions are causal — the output depends only on past inputs, never on future ones — which makes the architecture appropriate for forecasting.

WaveNet and TCN

Two papers established the dilated-causal-convolution architecture for sequence modelling. WaveNet (van den Oord et al. 2016, DeepMind) was the original — a generative model for raw audio waveforms that used dilated causal convolutions to produce 16 kHz speech samples. The architecture was almost immediately recognised as more general than its audio application, and a series of papers showed it could compete with or exceed RNNs on language modelling, machine translation, and time-series forecasting. TCN (Bai, Kolter & Koltun 2018) was the explicit "this is a general sequence-modelling architecture" formulation, with residual connections, weight normalisation, and dropout as standard ingredients. The TCN paper showed that across a wide range of sequence-modelling tasks, dilated causal convolutions matched or exceeded RNNs while being far easier to parallelise during training.

Why TCN was a step forward over RNNs

RNNs have to be unrolled sequentially during training, which limits parallelism and produces vanishing-gradient problems on long sequences. TCNs are convolutions — fully parallelisable across the time dimension, with stable gradients regardless of sequence length, and with explicit and tunable receptive-field sizes via the dilation pattern. For most forecasting tasks at the time, TCN was a strict upgrade over the LSTM and GRU architectures that had dominated through the mid-2010s. The transformers of the next section eventually replaced TCN at the cutting edge, but TCN-style architectures remain common in production today, especially where inference speed matters.

Transformers for Time Series

Transformers came late to time-series forecasting. Despite their dominance in language and vision, the first attempts to apply them to time series were inconsistent — sometimes great, often beaten by simpler baselines. The 2022–2024 papers solved most of the issues with careful tokenisation and the right inductive biases, and by 2026 transformers are the dominant architecture for serious time-series modelling.

The early attempts and their problems

The first wave of time-series transformers — Informer (2021), Autoformer (2021), FEDformer (2022), Pyraformer (2022) — were direct adaptations of the language transformer, with various tricks to handle the quadratic attention cost on long sequences (sparse attention, frequency-domain attention, hierarchical attention). Each paper reported strong results on standard benchmarks. Then in 2022 the paper "Are Transformers Effective for Time Series Forecasting?" (Zeng et al.) showed that a simple linear baseline (DLinear) consistently matched or beat all of these transformers on the same benchmarks. The result was deflating but instructive: the transformers had been over-fitting to the benchmarks, and their architectural sophistication was solving problems the benchmarks did not actually contain.

PatchTST and the rehabilitation

PatchTST (Nie et al., ICLR 2023) is the paper that rehabilitated the transformer for time-series forecasting. The key innovation was simple: instead of attending over individual time steps, divide the time series into patches of consecutive values (analogous to image patches in vision transformers) and attend over those patches. The patching dramatically reduces sequence length, makes attention effectively local-then-global, and provides a strong inductive bias toward the structure of time series. PatchTST set a new state of the art across the standard benchmarks and — crucially — also beat the simple linear baselines that had embarrassed the earlier transformers.

The lessons from PatchTST shaped the 2024–2026 literature. The architectural pattern that has stabilised is: tokenise the series into patches, run a transformer encoder over the patches with channel-independent processing for multivariate series (each channel processed separately, attention only across time within a channel), and apply a small per-patch linear head to produce forecasts. This is the architectural backbone of most of the foundation models in Section 09.

The long-context question

One of the supposed advantages of transformers is their ability to handle long contexts. For time series, this turns out to be a more nuanced question than for language. Many time-series benchmarks have only a few hundred time steps in the input window; on those, a transformer's long-context capability is wasted. Other benchmarks (high-frequency financial data, dense IoT sensor streams) have millions of time steps where long-context handling genuinely matters. The PatchTST-style patching helps in both regimes — it handles long contexts efficiently and does not penalise short ones. The transformers that have shipped in 2025–2026 production deployments are almost universally patched, with patch lengths chosen to match the natural frequency of the data.

Foundation Models for Time Series

The foundation-model wave that reshaped language, vision, and robotics arrived in time series during 2023–2024. The first wave of models — TimeGPT, Chronos, Lag-Llama, Moirai — demonstrated that a single model trained on huge collections of heterogeneous time series could produce zero-shot forecasts competitive with per-series fitted models. The implications for forecasting practice are still working themselves out.

The recipe

A time-series foundation model is conceptually similar to a language foundation model. Take a transformer architecture with patch-based tokenisation. Train it on a very large, very heterogeneous corpus of time series — millions of series spanning finance, retail, energy, climate, IoT, healthcare, and synthetic generators. Use a forecasting objective: given a context window, predict the next h values, with a probabilistic loss that captures uncertainty. The result is a single model that, given a new series at inference time, can produce forecasts without any series-specific fine-tuning. This is zero-shot forecasting: the kind of cross-task generalisation that made LLMs significant, applied to time series.

The major models

TimeGPT (Nixtla, 2023) was the first commercial foundation model marketed for time-series forecasting. The architecture is undisclosed but described as transformer-based with patch tokenisation; the training corpus is a large mixture of public and proprietary series. TimeGPT is API-accessed rather than open-weight, and its empirical performance on the M-competition splits is competitive with the best per-series-fitted methods.

Chronos (Amazon, 2024) was the first open-weight foundation forecaster. It tokenises time-series values by quantising them into a fixed-size vocabulary and then trains a vanilla T5 language-model architecture on the resulting tokens. The approach is unusually clean: no time-series-specific architectural changes, just standard language-model training applied to quantised time-series tokens. Chronos achieves strong zero-shot performance and is widely used as a baseline for academic foundation-model research.

Lag-Llama (Rasul et al., 2024) is another open-weight model, focused on probabilistic forecasting with explicit handling of lagged covariates. Moirai (Salesforce, 2024) emphasises multivariate handling and pre-trains on the LOTSA corpus (a 27-billion-observation collection assembled specifically for time-series foundation training). TimesFM (Google, 2024) uses a decoder-only transformer with patches and trains on a 100-billion-time-point corpus that includes Google Trends and Wikipedia pageview data. Each model has its own architectural and training-data choices; the empirical leaderboard between them moves often.

Performance and the new normal

The headline empirical result from the 2024–2025 foundation-model wave is that these models, used zero-shot, are competitive with or better than per-series-fitted classical methods (ARIMA, ETS) and per-series-fitted deep-learning methods (PatchTST, N-BEATS) on the standard forecasting benchmarks. They are not yet better than the best fitted-and-ensembled custom models on every benchmark, but the gap is small and shrinking — and the practical advantage of zero-shot is enormous: no per-series training, no hyperparameter tuning, no model-selection effort. For a forecasting team responsible for tens of thousands of series, this is a step change in operational simplicity.

The forecasting workflow shift

The implications for forecasting practice are still settling, but the direction is clear. The 2018-era workflow was: per-series exploratory analysis, per-series model selection, per-series hyperparameter tuning, per-series fitting, ensembling. The 2026-era workflow is: pick a foundation model, optionally fine-tune on your data if you have a lot of it, deploy. The classical methods are not obsolete — they remain strong baselines, especially on small data — but they are no longer the default starting point in serious teams. The foundation models are the default; everything else is justified by reference to them.

Frontier: Probabilistic, Multivariate, and Causal Forecasting

The classical-vs-deep-vs-foundation arc covers most of this chapter. Three further directions are active in 2026 and worth flagging because they will likely define forecasting practice over the next several years: probabilistic forecasting at scale, multivariate generalisation, and the slow integration of causal reasoning.

Probabilistic forecasting at scale

The shift from point forecasts to distribution forecasts — discussed in Section 06 in the context of Bayesian methods — is now spreading through the deep-learning lineage. Modern forecasting models increasingly emit either a parametric distribution (Gaussian, mixture of Gaussians, Student-t) or a quantile prediction (the 10th, 25th, 50th, 75th, 90th percentiles of the future). The foundation models are largely built around probabilistic objectives from the start. The remaining engineering question is calibration: are the predicted distributions actually calibrated against the realised outcomes? Tools for measuring and recalibrating forecast distributions (conformal prediction for time series, isotonic recalibration) are becoming standard parts of the deployment pipeline.

Multivariate and panel forecasting

A second frontier is the move from univariate to genuinely multivariate forecasting. Most of the methods in this chapter were designed for univariate series; multivariate handling has been a side feature at best. The 2024–2026 foundation models have started to take multivariate seriously — Moirai in particular is built around the multivariate case — but the question of how to model cross-series dependencies efficiently is still open. The related problem is panel forecasting: handling thousands or millions of related but distinct series (every product in a catalogue, every house in a real-estate market, every patient in a clinical trial), with appropriate sharing of statistical strength across them. Hierarchical and grouped time-series methods classical have been around for decades; the foundation-model lineage is starting to absorb them.

Causal forecasting

Most forecasting methods produce conditional forecasts: given the past, what will the future look like under business as usual? But the questions decision-makers actually want answered are counterfactual: what will the future look like if we run a particular promotion, change a price, deploy a new product? Causal forecasting — which connects directly to Chapter 03 on causal inference later in this Part — is the discipline of producing forecasts that are valid under interventions, not just under continuation of historical patterns. The intersection of causal inference and forecasting is one of the most consequential research areas in 2026, and the answers will reshape how decisions get made on top of forecasts. The chapters on causal inference and causal ML in this Part develop the theoretical framework; this chapter has set up the forecasting infrastructure on which causal forecasting builds.

What this chapter does not cover

Anomaly detection on time-series data is the subject of Chapter 02 (Anomaly Detection). The full causal framework — potential outcomes, DAGs, do-calculus, and the methods for estimating causal effects from observational data — is the subject of Chapter 03 (Causal Inference) and Chapter 04 (Causal Machine Learning). Survival analysis, which models time-to-event data with censoring (when did this customer churn? when will this machine fail?), is the subject of Chapter 06 (Survival Analysis & Event Modeling). And the broader question of how foundation models for time series will integrate with the foundation-model trends in language and vision — multimodal time series, language-conditioned forecasting, agents that reason over time-series streams — is the most active research frontier in the field as of 2026.

Time series is the data type behind every forecasting decision in modern operations. The classical methods of this chapter remain the foundation; the deep-learning era refined the architectures; the foundation-model era has fundamentally changed the workflow. A practitioner who understands all three layers — and who knows when to reach for which — is the one who can build forecasting systems that decision-makers actually trust.