Time Series Analysis & Forecasting, signals through time.
Time series are the data type behind every forecasting problem in finance, retail, energy, climate, healthcare, and operations. They look simple — a list of numbers indexed by time — and turn out to be one of the most subtle data shapes in machine learning. This chapter covers the classical statistical foundations, the deep-learning era that began with WaveNet, and the foundation-model wave that has reshaped forecasting since 2023.
Prerequisites & orientation
This chapter assumes basic familiarity with linear algebra (Part I Ch 01), probability and statistics (Part I Ch 04–05), and the deep-learning fundamentals of Part V — particularly CNNs (Part V Ch 04) for the temporal-CNN sections, attention (Part V Ch 06) for the transformer sections, and pretraining paradigms (Part VI Ch 05) for the foundation-model section. Readers from a classical statistics background will find the deep-learning sections the most novel; readers from an ML background will find the stationarity, decomposition, and Box-Jenkins sections the most novel.
Two threads run through the chapter. The first is the long-running tension between classical methods (ARIMA, exponential smoothing, state-space models) and deep-learning methods (TCNs, transformers, foundation models). On the standard benchmarks of 2017–2022, classical methods often won; on the standard benchmarks of 2024–2026, foundation models have started to dominate. The second thread is the difference between point forecasting (predicting one number for the future) and probabilistic forecasting (predicting a distribution). Most production decisions need the distribution; most popular methods produce only the point. This chapter argues for taking probability seriously throughout.
Why Time Series Are Different
Most of machine learning assumes that the data points are independent and identically distributed. Time series violate both halves of that assumption: the points are temporally ordered, each one depends on the ones before it, and the underlying distribution often shifts over time. Treating a time series as ordinary tabular data is the most common mistake practitioners make, and it is responsible for most overstated accuracy claims in the field.
Temporal dependence
The defining property of a time series is that consecutive observations are correlated. Today's stock price depends on yesterday's; this hour's electricity demand depends on the last several hours; this month's retail sales depend on the same month a year ago. The classical statistics terminology calls this temporal dependence, and it is the structure that every method in this chapter is trying to model. Methods that ignore it — random splits for cross-validation, shuffling the rows before training — produce optimistic accuracy estimates that collapse when the model is deployed on truly future data.
Two consequences follow immediately. First, the standard ML toolkit needs adaptation: train/test splits must be chronological, cross-validation must be a rolling-window or expanding-window scheme rather than k-fold, and metrics must be computed on truly held-out future data. Second, the order of operations inside the model matters: a recurrent network reads the series left to right; a causal convolution must mask the future; an attention head used for forecasting cannot peek beyond the current time step. Many time-series failures trace back to a single accidentally-leaky operation that gives the model access to information it would not have at inference time.
Stationarity, briefly
Classical time-series statistics is organised around the concept of stationarity: a series is stationary if its statistical properties (mean, variance, autocovariance) are constant over time. Stationary series are mathematically tractable; non-stationary series — those with a trend, with changing variance, or with structural breaks — require either transformation (differencing, deflation, log scaling) into stationary form or methods designed to handle non-stationarity directly. Section 03 develops this in detail; for now, the point is that stationarity is the central modelling decision in classical methods, and even modern deep-learning forecasters benefit from explicit handling of trends and seasonality.
Forecasting vs. modelling
The two main goals of time-series work are forecasting and modelling, and they call for different techniques. Forecasting aims to predict future values of a series given the past — what will the demand be next quarter, what will the temperature be tomorrow, what will the sensor read in five minutes? Modelling aims to understand the underlying structure of a series — what is the seasonal cycle, what is the trend, what was the effect of a known intervention? Many methods do both; some are specialised. Most of this chapter is about forecasting because it is the dominant practical application, but the modelling sections (decomposition, state-space methods, Bayesian approaches) are essential for problems where understanding matters more than prediction.
Univariate, multivariate, panel
One more axis. Univariate time series have a single value at each time step (the daily closing price of one stock, the hourly temperature at one station). Multivariate time series have multiple values at each time step that may interact (a vector of related stock prices, or temperature plus humidity plus pressure at one station). Panel data is the matrix-like setting where many distinct series are observed in parallel — every stock in an index, every store in a chain, every household in a panel survey — and where the question is often about cross-series patterns and shared dynamics. Modern deep-learning methods (and the foundation models of Section 09) excel especially on the panel case, where they can pool information across series; classical methods tend to fit one model per series and fare worse when individual series are short.
Almost every time-series practitioner has the same career-shaping moment: a model that scored brilliantly in backtest collapses in production. The cause is almost always one of three things — a leak from the future into the training set, a stationarity assumption that broke, or an evaluation metric that was insensitive to the failure mode that mattered. Section 03 covers the stationarity issue, the cross-validation discussion in this section addresses the leakage issue, and Section 10 will return to the metrics problem. Reading the rest of this chapter with that triple in mind will save real money.
Decomposition: Trend, Seasonality, Residuals
A useful first move with any time series is to decompose it into structural components: a slow-moving trend, periodic seasonal patterns, and the residual variation that those two together do not explain. The decomposition is rarely the model itself; it is the lens through which everything else gets clearer.
Additive and multiplicative decomposition
The classical decomposition writes a series yt as a sum or product of three components:
multiplicative: yt = Tt · St · Rt
STL and modern decomposition
The classical method estimates the trend by a centred moving average, the seasonal component by averaging within each seasonal position, and the residual by subtraction. STL (Seasonal-Trend decomposition using Loess, Cleveland et al. 1990) is the modern default: it estimates trend and seasonal components by repeated local-regression smoothing, handles missing values robustly, and lets the seasonal pattern itself evolve slowly over time. STL is available in nearly every modern statistical package and has largely replaced the classical approach.
The decomposition serves several uses. It produces seasonally adjusted series for downstream analysis. It surfaces structural breaks as discontinuities in the trend that need to be modelled or accounted for. It separates the interesting question (what is the underlying trend doing?) from the predictable nuisance (the seasonal pattern). And it provides residuals that can be examined for autocorrelation — if the residuals are not white noise, the decomposition has missed something, and the something is the signal worth modelling next.
Stationarity and Differencing
Stationarity is the central assumption of classical time-series analysis. A series is strictly stationary if its joint distribution at any set of times is invariant to shifting; weak stationarity (the practical definition) requires only that the mean, variance, and autocovariances be time-invariant. Most real-world series are not stationary, and getting them into stationary form via differencing or transformation is the precondition for the methods of the next two sections.
Why stationarity matters
The probabilistic theory underlying ARIMA, exponential smoothing, and most other classical time-series methods is built on the assumption that the series being modelled is stationary. If you fit an AR(1) model to a series with a strong upward trend, the fit will produce parameter estimates that try to capture both the trend and the autocorrelation in a single autoregressive coefficient — and the resulting forecast will systematically underpredict the trend's continuation. The fix is to remove the trend first (by differencing or detrending) and fit the AR model to the stationary residual.
Differencing
The simplest way to make a series stationary is differencing: replace each value with the difference between it and the previous value, Δyt = yt − yt−1. A series whose first difference is stationary is called integrated of order 1, written I(1); a series that requires two differencings is I(2); and so on. Most economic and financial series are I(1) — they have a stochastic trend (a unit root) that one differencing removes.
Seasonal differencing handles seasonal patterns: subtract the value from one full season ago instead of from one step ago. Δsyt = yt − yt−s, where s is the seasonal period (12 for monthly data with annual seasonality, 7 for daily data with weekly seasonality, etc.). Seasonal and ordinary differencing can be combined, and the resulting differencing pattern is what the "I" in ARIMA stands for.
Tests for stationarity
Two standard tests appear in nearly every time-series toolkit. The Augmented Dickey-Fuller (ADF) test has a null hypothesis that the series is non-stationary (has a unit root); rejecting the null gives evidence of stationarity. The KPSS test has the opposite null — that the series is stationary — and is a useful complement when the ADF test is borderline. Modern automated forecasting libraries run both tests, count how many ordinary and seasonal differences are needed to make the series stationary, and use that as the integration order in subsequent ARIMA fitting.
Beyond differencing
Differencing handles series whose mean changes over time. Series whose variance changes over time need a different treatment — typically a Box-Cox transformation, of which the log transform is a special case. Series with structural breaks (a change in regime) need either piecewise modelling or methods explicitly designed to detect and adapt to the breaks. The honest practice is that real series often need several stabilising operations stacked together: log-transform to stabilise variance, seasonally difference to remove the seasonal pattern, ordinary-difference to remove the trend, and only then fit a stationary model to what remains.
ARIMA and the Box-Jenkins Method
ARIMA is the classical workhorse of time-series forecasting. It combines three building blocks — an autoregressive component, a moving-average component, and integration via differencing — into a single tractable family of models. For decades, ARIMA was the default forecasting method, and on many small-to-medium-data forecasting problems it remains a strong baseline that more sophisticated methods often fail to beat.
The three building blocks
An autoregressive (AR) model of order p writes the current value as a linear combination of the previous p values plus noise: yt = c + φ1yt−1 + φ2yt−2 + … + φpyt−p + εt. A moving-average (MA) model of order q writes the current value as a linear combination of the previous q noise terms plus a new noise term: yt = μ + εt + θ1εt−1 + … + θqεt−q. Integration means working with a differenced series rather than the raw one. Combining all three gives the ARIMA(p, d, q) family: difference the series d times to make it stationary, then fit AR(p) plus MA(q) to the result.
Identification: ACF and PACF
The classical Box-Jenkins methodology selects orders p, d, q by inspecting two diagnostic plots. The autocorrelation function (ACF) plots the correlation between yt and yt−k for each lag k; the partial autocorrelation function (PACF) plots the same correlation after controlling for the intermediate lags. The shape of these plots reveals the model structure: a pure AR(p) process has a PACF that cuts off after lag p, an MA(q) process has an ACF that cuts off after lag q, and a mixed ARMA process has both decaying gradually. The first step of the Box-Jenkins workflow is plotting both, identifying the cut-off pattern, and reading off the orders.
In practice, automated order selection (the auto.arima function in the R forecast package, or pmdarima in Python) has largely replaced manual inspection. The automated methods test all combinations of orders within a reasonable range, score each by AIC or BIC, and return the best fit. They are not always right, but they are good enough that hand-tuning ARIMA is now a niche activity.
SARIMA and exogenous regressors
Seasonal ARIMA (SARIMA) extends the framework with seasonal AR, seasonal MA, and seasonal differencing terms, written ARIMA(p, d, q)(P, D, Q)s. SARIMAX adds exogenous regressors: external variables that may explain part of the series's variation (the day-of-week indicator, the price of a competitor's product, a marketing-campaign indicator). Both extensions are fully automated in modern packages and are the form of ARIMA that most production deployments actually use.
What ARIMA does well, and where it stops
ARIMA's strengths are interpretability (the parameters have direct meaning), uncertainty quantification (the model produces calibrated prediction intervals out of the box), and data efficiency (it works on series with a few hundred observations, where deep learning would just memorise). Its weaknesses are also direct: it is linear, it assumes the series is locally stationary after differencing, and it has no good way to share information across many parallel series. For any single short series, ARIMA is hard to beat; for a panel of thousands of series, the panel-pooling methods of Sections 8–9 routinely win.
Exponential Smoothing and ETS
Exponential smoothing methods are an alternative classical lineage: instead of modelling autocorrelations explicitly, they update an evolving estimate of the series's level (and optionally trend and seasonality) by exponentially-decaying-weighted averaging. They are simple, fast, and on many practical forecasting problems competitive with ARIMA — sometimes better.
Simple, double, and triple exponential smoothing
Simple exponential smoothing models a series with no trend or seasonality. The forecast is a weighted average of all past observations with weights that decay exponentially into the past, controlled by a smoothing parameter α ∈ [0, 1]. The recursive form is intuitive: the level at time t is α times the new observation plus (1 − α) times the previous level. Larger α puts more weight on recent observations and produces a more responsive (but noisier) forecast.
Double exponential smoothing (Holt's method) adds a trend component, with its own smoothing parameter β. The level update remains as before, plus a trend update: the new trend is β times the change in level plus (1 − β) times the previous trend. Forecasts add the trend extrapolated forward. Triple exponential smoothing (Holt-Winters) adds a seasonal component on top, with smoothing parameter γ. The three components — level, trend, seasonality — each have their own smoothing rate, and the forecast composes them.
trend: bt = β(ℓt − ℓt−1) + (1 − β)bt−1
season: st = γ(yt − ℓt−1 − bt−1) + (1 − γ)st−m
h-step forecast: ŷt+h = ℓt + h·bt + st+h−m·⌈h/m⌉
The ETS framework
The exponential smoothing methods can be unified under the ETS framework (Error, Trend, Seasonality), which writes them as state-space models with three modelling choices: error type (additive vs. multiplicative), trend type (none, additive, additive damped, multiplicative, multiplicative damped), and seasonality type (none, additive, multiplicative). The ETS taxonomy enumerates 30 specific models; an automated selector picks among them by maximum likelihood and information criteria (AIC, BIC), much like auto.arima picks among ARIMA orders. The R forecast package's ets() function is the reference implementation.
The M-competition lessons
The M-competitions (Makridakis competitions, running since 1982 with successive editions of increasing scope) are the long-running empirical benchmark of forecasting methods on heterogeneous business time series. The M3 (2000) and M4 (2018) competitions established a humbling result: simple methods like exponential smoothing and ARIMA, and even simpler ones like the seasonal naive forecast, were extremely hard to beat. Sophisticated machine-learning methods often underperformed. The M5 (2020) competition, on retail data, finally saw machine-learning methods (LightGBM-based ensembles) win — but the margin was narrow and the simple methods remained competitive baselines. The M6 (2022) competition on financial data was more equivocal still.
The lesson is consequential: classical methods are not obsolete. On modest datasets, on heterogeneous business series, on problems where interpretability matters, ETS and ARIMA remain the right starting point. The deep-learning methods of the next sections begin to dominate when the dataset is large, the series share structure across instances, and the scale of the problem rewards a single shared model over thousands of per-series fits.
Probabilistic and Bayesian Time Series
A point forecast is rarely what a decision-maker actually needs. The decision is usually "given my uncertainty about the future, what is the best action?" — and answering it requires a forecast distribution, not a single number. Probabilistic and Bayesian time-series methods produce that distribution natively.
State-space models and the Kalman filter
The unifying mathematical framework for probabilistic time-series modelling is the linear Gaussian state-space model, which writes the observed series as a noisy observation of a latent state that evolves according to its own dynamics. The state evolution is a linear transformation plus Gaussian noise; the observation is a linear function of the state plus Gaussian noise. Almost every classical time-series model — ARIMA, ETS, Bayesian structural time series — can be expressed in state-space form, and once expressed there, the optimal forecast and its uncertainty are computed by the Kalman filter (and its smoothing companion, the Kalman smoother). Chapter 01 of Part XII covered the Kalman filter for robotics; the same machinery underlies most probabilistic time-series practice.
Bayesian Structural Time Series
Bayesian Structural Time Series (BSTS, Scott & Varian 2014) is a particular formulation that explicitly decomposes a series into trend, seasonality, regression-on-covariates, and unobserved components, with priors on each. The full posterior is sampled via MCMC or variational methods, and the resulting forecasts come with calibrated uncertainty intervals. Google's CausalImpact library uses BSTS as its underlying engine to estimate causal effects of interventions — a direct application of the modelling-vs-forecasting distinction from Section 01.
Prophet and the engineer-friendly Bayesian approach
Facebook (now Meta) released Prophet in 2017 — a decomposable forecasting model designed to be robust to missing data, outliers, and changes in trend, with a Stan-based MCMC fit underneath. Prophet writes the series as a sum of a piecewise-linear trend, a Fourier-series seasonal component, and a holiday/event component, with sensible default priors. The result is a forecasting tool that produces reasonable results out of the box on most business time series, with calibrated uncertainty intervals, and that is interpretable enough that analysts trust it. Prophet became enormously popular because it was the first probabilistic time-series tool that did not require deep statistical expertise to use.
Why probabilistic forecasting matters
Most forecasting consumers — inventory planners, energy traders, capacity engineers, financial risk managers — are making decisions where the cost of being wrong in one direction is different from the cost of being wrong in the other. Inventory shortfalls cost more than excess; under-provisioning cloud capacity is more painful than over-provisioning. The right decision depends not on the point forecast but on a quantile of the forecast distribution. Methods that produce only a point forecast force the consumer to estimate uncertainty themselves, usually badly. The shift toward probabilistic forecasting in the foundation-model era (Section 09) is in part a recognition that consumers want the distribution, and that producing it natively is more reliable than constructing it post-hoc.
A well-calibrated forecast distribution is one where, in the long run, the actual outcomes fall in the predicted 50% interval 50% of the time, in the predicted 90% interval 90% of the time, and so on. Calibration is checked separately from accuracy and is the quality that determines whether a probabilistic forecast is actually usable. Badly-calibrated forecasts that produce 90% intervals containing the truth only 60% of the time look superficially impressive on point-forecast metrics and are dangerous in operations. Modern probabilistic forecasting methods report calibration explicitly; the older methods often did not, and this is one of the underappreciated differences between the eras.
Temporal CNNs and TCN Architectures
Before transformers ate everything, the dominant deep-learning architecture for sequence modelling was the recurrent neural network — and before transformers ate sequence modelling, a quieter revolution had already replaced RNNs with convolutions. The Temporal Convolutional Network is the canonical instance: a stack of dilated causal convolutions that achieves long-range temporal modelling without the gradient pathologies of recurrence.
Causal and dilated convolutions
A standard 1D convolution processes a sequence by sliding a kernel across it; the output at time t depends on values in a window centred at t. For forecasting, this is wrong — it leaks future information. A causal convolution shifts the kernel so that the output at time t depends only on values at time t and earlier. Stacking causal convolutions gives a model whose output at any time is a function of only the past.
The remaining problem is the receptive field. A standard kernel of size k sees only the last k time steps; stacking L layers extends the receptive field linearly to L(k − 1) + 1. For forecasting problems where the relevant history might be hundreds of steps long, this requires impractically deep stacks. Dilated convolutions solve this elegantly: instead of looking at the last k consecutive time steps, the kernel looks at k time steps spaced d apart (the dilation factor). Stacking layers with exponentially-increasing dilation produces a receptive field that grows exponentially with depth.
WaveNet and TCN
Two papers established the dilated-causal-convolution architecture for sequence modelling. WaveNet (van den Oord et al. 2016, DeepMind) was the original — a generative model for raw audio waveforms that used dilated causal convolutions to produce 16 kHz speech samples. The architecture was almost immediately recognised as more general than its audio application, and a series of papers showed it could compete with or exceed RNNs on language modelling, machine translation, and time-series forecasting. TCN (Bai, Kolter & Koltun 2018) was the explicit "this is a general sequence-modelling architecture" formulation, with residual connections, weight normalisation, and dropout as standard ingredients. The TCN paper showed that across a wide range of sequence-modelling tasks, dilated causal convolutions matched or exceeded RNNs while being far easier to parallelise during training.
Why TCN was a step forward over RNNs
RNNs have to be unrolled sequentially during training, which limits parallelism and produces vanishing-gradient problems on long sequences. TCNs are convolutions — fully parallelisable across the time dimension, with stable gradients regardless of sequence length, and with explicit and tunable receptive-field sizes via the dilation pattern. For most forecasting tasks at the time, TCN was a strict upgrade over the LSTM and GRU architectures that had dominated through the mid-2010s. The transformers of the next section eventually replaced TCN at the cutting edge, but TCN-style architectures remain common in production today, especially where inference speed matters.
Transformers for Time Series
Transformers came late to time-series forecasting. Despite their dominance in language and vision, the first attempts to apply them to time series were inconsistent — sometimes great, often beaten by simpler baselines. The 2022–2024 papers solved most of the issues with careful tokenisation and the right inductive biases, and by 2026 transformers are the dominant architecture for serious time-series modelling.
The early attempts and their problems
The first wave of time-series transformers — Informer (2021), Autoformer (2021), FEDformer (2022), Pyraformer (2022) — were direct adaptations of the language transformer, with various tricks to handle the quadratic attention cost on long sequences (sparse attention, frequency-domain attention, hierarchical attention). Each paper reported strong results on standard benchmarks. Then in 2022 the paper "Are Transformers Effective for Time Series Forecasting?" (Zeng et al.) showed that a simple linear baseline (DLinear) consistently matched or beat all of these transformers on the same benchmarks. The result was deflating but instructive: the transformers had been over-fitting to the benchmarks, and their architectural sophistication was solving problems the benchmarks did not actually contain.
PatchTST and the rehabilitation
PatchTST (Nie et al., ICLR 2023) is the paper that rehabilitated the transformer for time-series forecasting. The key innovation was simple: instead of attending over individual time steps, divide the time series into patches of consecutive values (analogous to image patches in vision transformers) and attend over those patches. The patching dramatically reduces sequence length, makes attention effectively local-then-global, and provides a strong inductive bias toward the structure of time series. PatchTST set a new state of the art across the standard benchmarks and — crucially — also beat the simple linear baselines that had embarrassed the earlier transformers.
The lessons from PatchTST shaped the 2024–2026 literature. The architectural pattern that has stabilised is: tokenise the series into patches, run a transformer encoder over the patches with channel-independent processing for multivariate series (each channel processed separately, attention only across time within a channel), and apply a small per-patch linear head to produce forecasts. This is the architectural backbone of most of the foundation models in Section 09.
The long-context question
One of the supposed advantages of transformers is their ability to handle long contexts. For time series, this turns out to be a more nuanced question than for language. Many time-series benchmarks have only a few hundred time steps in the input window; on those, a transformer's long-context capability is wasted. Other benchmarks (high-frequency financial data, dense IoT sensor streams) have millions of time steps where long-context handling genuinely matters. The PatchTST-style patching helps in both regimes — it handles long contexts efficiently and does not penalise short ones. The transformers that have shipped in 2025–2026 production deployments are almost universally patched, with patch lengths chosen to match the natural frequency of the data.
Foundation Models for Time Series
The foundation-model wave that reshaped language, vision, and robotics arrived in time series during 2023–2024. The first wave of models — TimeGPT, Chronos, Lag-Llama, Moirai — demonstrated that a single model trained on huge collections of heterogeneous time series could produce zero-shot forecasts competitive with per-series fitted models. The implications for forecasting practice are still working themselves out.
The recipe
A time-series foundation model is conceptually similar to a language foundation model. Take a transformer architecture with patch-based tokenisation. Train it on a very large, very heterogeneous corpus of time series — millions of series spanning finance, retail, energy, climate, IoT, healthcare, and synthetic generators. Use a forecasting objective: given a context window, predict the next h values, with a probabilistic loss that captures uncertainty. The result is a single model that, given a new series at inference time, can produce forecasts without any series-specific fine-tuning. This is zero-shot forecasting: the kind of cross-task generalisation that made LLMs significant, applied to time series.
The major models
TimeGPT (Nixtla, 2023) was the first commercial foundation model marketed for time-series forecasting. The architecture is undisclosed but described as transformer-based with patch tokenisation; the training corpus is a large mixture of public and proprietary series. TimeGPT is API-accessed rather than open-weight, and its empirical performance on the M-competition splits is competitive with the best per-series-fitted methods.
Chronos (Amazon, 2024) was the first open-weight foundation forecaster. It tokenises time-series values by quantising them into a fixed-size vocabulary and then trains a vanilla T5 language-model architecture on the resulting tokens. The approach is unusually clean: no time-series-specific architectural changes, just standard language-model training applied to quantised time-series tokens. Chronos achieves strong zero-shot performance and is widely used as a baseline for academic foundation-model research.
Lag-Llama (Rasul et al., 2024) is another open-weight model, focused on probabilistic forecasting with explicit handling of lagged covariates. Moirai (Salesforce, 2024) emphasises multivariate handling and pre-trains on the LOTSA corpus (a 27-billion-observation collection assembled specifically for time-series foundation training). TimesFM (Google, 2024) uses a decoder-only transformer with patches and trains on a 100-billion-time-point corpus that includes Google Trends and Wikipedia pageview data. Each model has its own architectural and training-data choices; the empirical leaderboard between them moves often.
Performance and the new normal
The headline empirical result from the 2024–2025 foundation-model wave is that these models, used zero-shot, are competitive with or better than per-series-fitted classical methods (ARIMA, ETS) and per-series-fitted deep-learning methods (PatchTST, N-BEATS) on the standard forecasting benchmarks. They are not yet better than the best fitted-and-ensembled custom models on every benchmark, but the gap is small and shrinking — and the practical advantage of zero-shot is enormous: no per-series training, no hyperparameter tuning, no model-selection effort. For a forecasting team responsible for tens of thousands of series, this is a step change in operational simplicity.
The forecasting workflow shift
The implications for forecasting practice are still settling, but the direction is clear. The 2018-era workflow was: per-series exploratory analysis, per-series model selection, per-series hyperparameter tuning, per-series fitting, ensembling. The 2026-era workflow is: pick a foundation model, optionally fine-tune on your data if you have a lot of it, deploy. The classical methods are not obsolete — they remain strong baselines, especially on small data — but they are no longer the default starting point in serious teams. The foundation models are the default; everything else is justified by reference to them.
Frontier: Probabilistic, Multivariate, and Causal Forecasting
The classical-vs-deep-vs-foundation arc covers most of this chapter. Three further directions are active in 2026 and worth flagging because they will likely define forecasting practice over the next several years: probabilistic forecasting at scale, multivariate generalisation, and the slow integration of causal reasoning.
Probabilistic forecasting at scale
The shift from point forecasts to distribution forecasts — discussed in Section 06 in the context of Bayesian methods — is now spreading through the deep-learning lineage. Modern forecasting models increasingly emit either a parametric distribution (Gaussian, mixture of Gaussians, Student-t) or a quantile prediction (the 10th, 25th, 50th, 75th, 90th percentiles of the future). The foundation models are largely built around probabilistic objectives from the start. The remaining engineering question is calibration: are the predicted distributions actually calibrated against the realised outcomes? Tools for measuring and recalibrating forecast distributions (conformal prediction for time series, isotonic recalibration) are becoming standard parts of the deployment pipeline.
Multivariate and panel forecasting
A second frontier is the move from univariate to genuinely multivariate forecasting. Most of the methods in this chapter were designed for univariate series; multivariate handling has been a side feature at best. The 2024–2026 foundation models have started to take multivariate seriously — Moirai in particular is built around the multivariate case — but the question of how to model cross-series dependencies efficiently is still open. The related problem is panel forecasting: handling thousands or millions of related but distinct series (every product in a catalogue, every house in a real-estate market, every patient in a clinical trial), with appropriate sharing of statistical strength across them. Hierarchical and grouped time-series methods classical have been around for decades; the foundation-model lineage is starting to absorb them.
Causal forecasting
Most forecasting methods produce conditional forecasts: given the past, what will the future look like under business as usual? But the questions decision-makers actually want answered are counterfactual: what will the future look like if we run a particular promotion, change a price, deploy a new product? Causal forecasting — which connects directly to Chapter 03 on causal inference later in this Part — is the discipline of producing forecasts that are valid under interventions, not just under continuation of historical patterns. The intersection of causal inference and forecasting is one of the most consequential research areas in 2026, and the answers will reshape how decisions get made on top of forecasts. The chapters on causal inference and causal ML in this Part develop the theoretical framework; this chapter has set up the forecasting infrastructure on which causal forecasting builds.
What this chapter does not cover
Anomaly detection on time-series data is the subject of Chapter 02 (Anomaly Detection). The full causal framework — potential outcomes, DAGs, do-calculus, and the methods for estimating causal effects from observational data — is the subject of Chapter 03 (Causal Inference) and Chapter 04 (Causal Machine Learning). Survival analysis, which models time-to-event data with censoring (when did this customer churn? when will this machine fail?), is the subject of Chapter 06 (Survival Analysis & Event Modeling). And the broader question of how foundation models for time series will integrate with the foundation-model trends in language and vision — multimodal time series, language-conditioned forecasting, agents that reason over time-series streams — is the most active research frontier in the field as of 2026.
Time series is the data type behind every forecasting decision in modern operations. The classical methods of this chapter remain the foundation; the deep-learning era refined the architectures; the foundation-model era has fundamentally changed the workflow. A practitioner who understands all three layers — and who knows when to reach for which — is the one who can build forecasting systems that decision-makers actually trust.
Further Reading
-
Forecasting: Principles and Practice (3rd Edition)The standard textbook for classical and contemporary time-series forecasting, written by the maintainers of the R forecast package and freely available online. Covers decomposition, ARIMA, ETS, hierarchical forecasting, and a serious treatment of probabilistic forecasting and forecast evaluation. Reading the first few chapters is the right preparation for any forecasting project. The single best free resource for classical forecasting.
-
Time Series Analysis: Forecasting and Control (5th Edition)The original Box-Jenkins methodology, in the textbook that named it. Dense, mathematical, and authoritative — the place to go when you need to understand why ARIMA works rather than just how to use it. The fifth edition adds modern developments while preserving the classical material. The canonical reference for ARIMA.
-
The M4 Competition: 100,000 Time Series and 61 Forecasting MethodsThe full report of the M4 forecasting competition. Establishes empirically that simple methods (exponential smoothing, ARIMA, even seasonal naive) remain competitive with sophisticated machine-learning methods on heterogeneous business time series, and that ensembling across methods consistently helps. The single most useful empirical paper for calibrating expectations about forecasting accuracy. The reference for what actually works on real forecasting problems.
-
A Time Series is Worth 64 Words: Long-term Forecasting with TransformersThe PatchTST paper. Establishes patch-based tokenisation as the architectural pattern for time-series transformers and presents the empirical results that rehabilitated transformers for forecasting. Reading it grounds the architectural choices in nearly every modern foundation model. The branch point where transformers became viable for time series.
-
Chronos: Learning the Language of Time SeriesThe Chronos paper. Demonstrates that vanilla T5-based language-model architectures, applied to quantised time-series tokens, produce competitive zero-shot forecasters with no time-series-specific architectural changes. The accompanying code and weights make Chronos the standard open-weight baseline for foundation-model time-series research. The most accessible foundation forecaster and the cleanest entry point.
-
A Decoder-Only Foundation Model for Time-Series Forecasting (TimesFM)The TimesFM paper. Presents Google's time-series foundation model, trained on 100 billion time points including Google Trends and Wikipedia pageviews, with competitive zero-shot performance across the standard forecasting benchmarks. Comparing the architectural choices in TimesFM, Chronos, Moirai, and Lag-Llama is the natural next step after reading this chapter. A representative entry from the 2024 foundation-model wave.
-
WaveNet: A Generative Model for Raw AudioThe WaveNet paper that introduced dilated causal convolutions and showed they could model long-range dependencies in sequences without RNN-style gradient pathologies. Essential reading for understanding why TCN-style architectures matter and why they preceded the transformer wave. The architectural breakthrough that preceded the transformer era for sequences.
-
Forecasting at Scale (Prophet)The Prophet paper. Describes the decomposable model, the engineering choices that made it robust enough to ship as a default tool, and the philosophy of building forecasting tools that analysts can actually use. Even if you do not use Prophet, the paper's discussion of forecasting workflows in production is unusually clear. The clearest single account of how to build a forecasting tool people will actually use.