Part IV · Classical Machine Learning · Chapter 06

Probabilistic graphical models, where we draw the joint distribution as a picture and let the picture do the reasoning.

Almost every interesting learning problem reduces to reasoning about a joint probability distribution over a large set of variables — pixels and object labels, words and topics, symptoms and diagnoses, genes and phenotypes, state variables and observations across time. Written out as a single table, that joint is astronomical: a distribution over twenty binary variables is already a million entries, and the worlds we actually care about contain millions of variables. Probabilistic graphical models rescue the enterprise by factoring the joint into small local pieces whose dependencies are laid out in a graph: each node is a random variable, each edge encodes a direct statistical relation, and the graph itself becomes a compact, readable, and computable representation of an otherwise intractable distribution. Two dialects dominate the field: Bayesian networks — directed acyclic graphs that factor the joint into a product of conditionals, the natural way to express causal or generative stories — and Markov random fields — undirected graphs that factor the joint into a product of local potentials, the natural way to express symmetric constraints and pairwise compatibility. A third, factor graphs, unifies the two for inference. On top of the representation sits a toolkit for answering queries about the distribution — exact inference (variable elimination, belief propagation, junction trees) when the graph is tractable and approximate inference (variational methods, MCMC) when it is not — and a toolkit for learning the parameters and sometimes the structure from data. The chapter ends with the dynamical-systems specialisations that dominated pre-deep-learning sequence modelling — hidden Markov models, Kalman filters, conditional random fields — and with topic models, the PGM idea that anchors most of modern unsupervised text mining. PGMs gave the field its vocabulary for latent variables, conditional independence, causal structure, and principled uncertainty; deep learning absorbed the vocabulary but could not replace it, and today the two live in productive dialogue in variational autoencoders, normalising flows, diffusion, and probabilistic programming.

How to read this chapter

Section one motivates the enterprise by walking through the four things a graphical representation of a joint distribution gives you — compactness, readability, a tractable inference story, and a principled way to combine data with prior knowledge — and where each of those properties becomes indispensable. Section two makes the key move concrete: a graph factorises a joint distribution into a product of smaller local distributions, and the factorisation is exactly what turns an intractable problem into a tractable one. Sections three through five introduce the two central representations — Bayesian networks (directed, one conditional per node, natural for causal stories) and Markov random fields (undirected, one potential per clique, natural for symmetric constraints) — and section four works through the conditional-independence structure that a directed graph encodes, captured by the d-separation criterion. Section six introduces the factor graph, a bipartite representation that unifies directed and undirected models and is the natural data structure for message-passing inference. Together these six sections give the representation side of the field: how to draw a distribution as a graph, and how to read factorisation and independence back off the graph.

Sections seven through ten are the inference toolkit. Section seven develops exact inference — variable elimination and its message-passing incarnation, the sum-product algorithm — and shows why it runs in time exponential in the graph's treewidth, tractable for trees and small-treewidth graphs and intractable for most of the rest. Section eight is the junction-tree algorithm, the general-purpose scheme for turning a graph into a tree of cliques on which exact inference can still be run. Sections nine and ten cover the two families of approximate inference that dominate when exact methods will not finish — variational inference (treat the posterior as an optimisation problem, minimise KL divergence against a tractable family, the route taken by mean-field VI, variational Bayes, and every modern VAE) and Markov-chain Monte Carlo (treat the posterior as a target to be sampled, use Gibbs sampling, Metropolis–Hastings, or Hamiltonian Monte Carlo, the route taken by Stan and PyMC). Sections eleven and twelve are the learning toolkit: parameter learning (maximum likelihood, maximum a posteriori, Bayesian updates, EM for latent variables) and structure learning (which edges should be in the graph at all — score-based search, constraint-based tests, Bayesian model averaging, and the stubborn difficulty of causal discovery from observational data).

Sections thirteen through sixteen turn to four landmark specialisations whose influence on the field is hard to overstate. Section thirteen is hidden Markov models, the PGM for discrete-state sequences, the engine of speech recognition from the 1970s until neural approaches caught up in the 2010s, and still the first tool to reach for when modelling latent-state dynamics. Section fourteen is the Kalman filter and the broader family of linear dynamical systems — the continuous-state analogue of HMMs, the algorithm that brought humans to the moon, and the underpinning of modern state-space models and the "S4/Mamba" lineage. Section fifteen is conditional random fields, the discriminative cousin of HMMs that dominated structured-prediction benchmarks in NLP and computer vision through the 2000s. Section sixteen is latent Dirichlet allocation and the broader family of topic models — the canonical application of Bayesian networks to unsupervised text understanding. Section seventeen surveys the practical toolkit — PyMC, Stan, Pyro, bnlearn, pgmpy — and the operational decisions that separate a working PGM from a non-starter. Section eighteen closes with where graphical-model thinking sits in modern ML: deep PGMs, variational autoencoders and normalising flows, neural structured prediction, the reemergence of state-space models in sequence modelling, probabilistic programming, and the causal-inference frontier.

Why probabilistic graphical modelsCompactness, readability, inference, priors meet data
Graphs factorise joint distributionsThe core move: big joint → product of small locals
Bayesian networksDirected acyclic graphs and local conditionals
Conditional independence and d-separationReading independence off a DAG
Markov random fieldsUndirected models, cliques, potentials
Factor graphsBipartite representation that unifies the two
Exact inferenceVariable elimination, sum-product, belief propagation
The junction-tree algorithmExact inference on general graphs
Variational inferencePosteriors as optimisation: mean field, VI, ELBO
Markov-chain Monte CarloGibbs, Metropolis–Hastings, HMC, sampling posteriors
Parameter learningMLE, MAP, Bayesian updates, EM for latents
Structure learningScore-based, constraint-based, causal discovery
Hidden Markov modelsDiscrete latent-state sequences
Kalman filters and state-space modelsContinuous-state linear dynamical systems
Conditional random fieldsDiscriminative structured prediction
Topic modelsLatent Dirichlet allocation and descendants
PGMs in practicePyMC, Stan, Pyro, pgmpy, operational recipes
Where it compounds in MLDeep PGMs, VAEs, SSMs, probabilistic programming, causality

Why probabilistic graphical models: making big joints tractable

Most of machine learning reduces to reasoning about a joint distribution over many variables. Written naively, those joints are astronomical — the distribution over twenty binary variables already has over a million entries, and real problems involve thousands or millions of variables. Probabilistic graphical models are the representation that rescues the enterprise: a graph whose nodes are variables and whose edges encode direct statistical dependencies, and a collection of small local distributions attached to that graph that, multiplied together, reproduce the full joint. The graph is the contract — it says which variables can ignore which others, and therefore which computations are cheap.

Four things a graph gives you

The first is compactness. A graph with maximum in-degree k over n binary variables is specified by at most n·2^k parameters rather than 2^n. For n=1000 and k=5, the difference is 32 000 parameters versus a number that does not fit in any computer. Every non-trivial probabilistic model fits in memory only because some such factorisation holds.

The second is readability. A PGM is literally a diagram of what depends on what. Domain experts who cannot read probability theory can still read the diagram, point at an edge, and say "that dependence is wrong" or "this variable should also feed into that one." No other representation of a probability distribution supports that conversation.

The third is inference. The graph structure determines which queries are cheap. For tree-structured graphs, computing any marginal or conditional is linear in the size of the graph. For graphs of small treewidth (the key complexity measure), exact inference still runs in polynomial time. For dense graphs, we fall back on approximation — variational methods and MCMC — but the fallback is itself guided by the graph.

The fourth is principled combination of prior and data. A graphical model specifies a prior, a likelihood, and a mechanism for updating beliefs given evidence; the posterior is what the graph becomes after inference, and it inherits all the compositional structure of the prior. When new data arrives, or new variables are observed, the update is well-defined and does not require relearning the model from scratch.

The core idea. A graphical model says: the big joint P(X₁,…,Xₙ) factorises as a product of small local distributions, one per node (Bayesian network) or per clique (Markov random field). The graph encodes which variables appear together in a local piece. Everything else — inference, learning, semantics — follows from that factorisation.

When PGMs are the right tool

Graphical models shine when the structure of the problem is natural or known: causal chains, hidden states in time, known conditional independences, expert-elicited medical or diagnostic networks, pedigrees, spatial-dependence problems in vision (pixels depend on neighbours), or when uncertainty needs to be carried end-to-end through a pipeline. They also shine when data is scarce relative to variables, because the factorisation acts as a strong inductive bias. They are a weaker fit when the joint is highly non-linear and we have enormous data — the regime where deep learning has won — though the two families meet in variational autoencoders, normalising flows, and neural structured-prediction models, which are graphical models whose local pieces are neural networks.

A map of the rest of the chapter

The next five sections build the representation: how to turn a factorisation into a graph, how directed (Bayesian) and undirected (Markov) graphs differ, how to read conditional independence off a graph, and how factor graphs unify the two. Sections 7–10 develop the two inference families: exact (variable elimination, junction trees) and approximate (variational inference, MCMC). Sections 11–12 are learning: parameters from data and, harder, structure. Sections 13–16 are the four classic specialisations that dominated pre-deep-learning sequential and latent-variable modelling — HMMs, Kalman filters, CRFs, topic models. Sections 17–18 close with practice and the place of PGMs in the modern ML landscape.

Graphs factorise joint distributions: the core move

Before committing to directed or undirected, pause on the single idea that makes the whole field work. A graph on variables X₁,…,Xₙ partitions the joint distribution into small pieces, and the full joint is the product of those pieces. The partition is what decouples variables that would otherwise be entangled; the product is what reassembles a coherent distribution from the pieces. Everything algorithmic in probabilistic graphical models — inference, learning, sampling — is a manipulation of that product, and the graph is the shape that tells us which manipulations are cheap.

The chain rule is the starting point

For any ordering of variables, the joint can be written as a product of conditionals via the chain rule of probability:

P(X₁,…,Xₙ) = P(X₁) · P(X₂ | X₁) · P(X₃ | X₁,X₂) · … · P(Xₙ | X₁,…,Xₙ₋₁).

This is a tautology — it holds for every joint — and by itself it saves nothing: the last conditional still depends on all prior variables. The PGM insight is that, in almost every real problem, most of those conditionals do not truly depend on all the earlier variables but only on a small subset of them. If we can write

P(Xₖ | X₁,…,Xₖ₋₁) = P(Xₖ | Pa(Xₖ))

where Pa(Xₖ) ⊆ {X₁,…,Xₖ₋₁} is a small set of parents, we have a Bayesian network. The graph draws an edge from each parent to the child, which gives a directed acyclic graph, and the joint factorises as a product of local conditionals:

P(X) = ∏ₖ P(Xₖ | Pa(Xₖ)).

The same idea without direction

Directed factorisations are natural when one variable generates another — a disease causes a symptom, a topic generates a word, a hidden state emits an observation. For symmetric relationships — two pixels with similar colours are likely neighbours, two papers that share many keywords are probably about the same topic — a direction is awkward. The undirected counterpart is a Markov random field: it factorises the joint into a product of potential functions, one per clique in the graph, normalised by a partition function Z:

P(X) = (1/Z) · ∏_C ψ_C(X_C), where Z = Σ_X ∏_C ψ_C(X_C).

The potentials are not probabilities — they are arbitrary non-negative functions — and Z is the awkward price of using them. Computing Z is in general as hard as inference itself, which is why undirected models are often harder to work with than directed ones.

The representation determines the complexity

The graph is not merely a drawing of a factorisation — it is the factorisation, and the algorithmic cost of every query against the model is determined by its shape. A graph that is a tree admits exact inference in linear time. A graph whose treewidth is small admits exact inference in time exponential in the treewidth but polynomial in the number of nodes. A graph with high treewidth forces us to approximate. When designing a model, the shape of the graph is as much an engineering choice as the choice of parameters.

Two-minute exercise. Write down the joint P(A,B,C,D,E) for a Markov chain A → B → C → D → E. How many parameters does the full joint have if every variable is binary? How many after factorisation? (Answer: the joint has 31 free parameters, the factorisation has 1+2+2+2+2 = 9. For a chain of n binary variables, the saving is 2ⁿ−1 → 2n−1.)

Bayesian networks: directed acyclic graphs over variables

A Bayesian network — also called a belief network, causal network, or directed graphical model — is the PGM you reach for when the problem has a natural generative story: causes precede effects, latent states precede observations, parents produce children. Formally it is a directed acyclic graph over the variables together with a conditional probability distribution (CPD) for each node, given its parents. The joint is the product of those local CPDs, and the graph makes every conditional-independence assumption in the model visible at a glance.

Definition and the factorisation

A Bayesian network over variables X₁,…,Xₙ consists of (i) a directed acyclic graph G with a node per variable, and (ii) for each node Xᵢ a conditional probability distribution P(Xᵢ | Pa(Xᵢ)) where Pa(Xᵢ) is the set of parents of Xᵢ in G. The joint is then

P(X₁,…,Xₙ) = ∏ᵢ P(Xᵢ | Pa(Xᵢ)).

For discrete variables the CPDs are tables; for continuous variables they are parametric densities (Gaussians are canonical); for mixed models they can be any well-defined conditional distribution. A root node (no parents) has an unconditional prior P(Xᵢ).

The canonical example

Judea Pearl's textbook example: an alarm network. The variables are Burglary, Earthquake, Alarm, JohnCalls, MaryCalls. The story is: a burglary or an earthquake can set off the alarm; the alarm can make John or Mary call. The DAG has Burglary → Alarm, Earthquake → Alarm, Alarm → JohnCalls, Alarm → MaryCalls. The joint

P(B,E,A,J,M) = P(B) · P(E) · P(A | B,E) · P(J | A) · P(M | A)

replaces 31 free parameters (the full binary joint minus the normalisation) with 1 + 1 + 4 + 2 + 2 = 10, and the graph says — correctly — that John and Mary are conditionally independent given the alarm, that the two causes are marginally independent, and that knowing a call increases our belief in an earthquake only indirectly via the alarm.

Local and global Markov properties

Two semantically equivalent ways of stating what a Bayesian network asserts. The local Markov property says each variable is conditionally independent of its non-descendants given its parents — intuitively, the parents screen off the ancestry. The global Markov property says conditional independence holds exactly when a graphical criterion called d-separation (next section) holds. The two are equivalent: every independence claimed by d-separation follows from the local property, and vice versa. The graph is a map of all the conditional independences implied by the factorisation.

Causal interpretation

A Bayesian network is a statistical object — a factorisation of a joint — and nothing in the definition commits to a causal reading. However, when the directed edges reflect causal mechanisms (B → A means "burglary causes alarm"), the network supports reasoning about interventions: what happens to JohnCalls if we manually set Alarm = true? Pearl's do-calculus formalises this, distinguishing observational probabilities P(X | Y) from interventional ones P(X | do(Y)). Causal structure learning from observational data — recovering the DAG when only joint samples are observed — is one of the hardest problems in the field and is covered in section 12.

Directed ≠ causal. A DAG can be drawn in many orderings consistent with the same joint. Two DAGs are Markov-equivalent if they imply the same conditional-independence structure — in which case no amount of observational data can distinguish them. Only with interventions, or with strong assumptions (faithfulness, acyclicity, causal sufficiency), does a DAG become identifiable as the causal graph.

Inference queries

Given a Bayesian network and some observed evidence E = e, the three queries that matter are: (i) the marginal P(Xᵢ | e) for a single variable; (ii) the most probable explanation or MAP — the assignment to all unobserved variables that maximises P(X_unobs | e); (iii) the conditional expectation E[f(X) | e] for some function f. All three are NP-hard in general but tractable on trees and small-treewidth graphs, which is what the next several sections develop.

Conditional independence and d-separation

The great conceptual gift of Bayesian networks is that conditional independence — the statistical property that would otherwise require painful algebra to check — can be read off the graph by a simple graphical criterion. The criterion is called d-separation (for "directed separation"), and it classifies every path between two sets of variables as either blocked (independence preserved) or active (dependence transmitted), conditioned on a third set of observed variables. Mastering d-separation is how you read a Bayesian network.

Three canonical three-variable patterns

Every path through a DAG is composed of three types of triples, and each has a different behaviour under conditioning:

A chain is A → B → C. A and C are marginally dependent; conditioning on the middle variable B renders them independent (B screens off A from C).

A fork is A ← B → C. A and C share a common cause B; they are marginally dependent; conditioning on B renders them independent.

A collider (or v-structure) is A → B ← C. A and C are marginally independent — they have no shared cause — but conditioning on the collider B (or any descendant of B) makes them dependent. This is the surprising case and the source of most mistakes in graphical reasoning.

Explain-away. The collider phenomenon is sometimes called explaining away: if the alarm is on and we learn there was an earthquake, our belief in a burglary drops, even though burglaries and earthquakes are a priori independent. Conditioning on a common effect induces dependence between its causes. Missing this is how non-experts get probabilistic reasoning wrong.

The d-separation criterion

Two sets of variables X and Y are d-separated by a third set Z if every undirected path between a node in X and a node in Y is blocked by Z. A path is blocked if it contains either (i) a chain or fork whose middle node is in Z, or (ii) a collider whose middle node — and all its descendants — are not in Z. If every path is blocked, the two sets are conditionally independent given Z in every distribution that factors according to the graph.

Markov blanket

The Markov blanket of a node X is the smallest set of variables that renders X independent of the rest of the graph. For a Bayesian network, it consists of the parents of X, the children of X, and the other parents of those children (the "co-parents"). For a Markov random field, it is simply the set of neighbours. The Markov blanket is the right notion of "local neighbourhood" for Gibbs sampling and for variational inference, because conditioning on it cuts a node free from everything else.

I-maps and faithfulness

A graph is an I-map of a distribution if every independence the graph asserts is true in the distribution. A distribution is faithful to the graph if every independence in the distribution is captured by the graph. A graph is a perfect map if both hold. The CI structure of most real distributions is not perfectly representable by any DAG — there exist independences that cannot be drawn as a DAG — which is one of the motivations for undirected and factor-graph representations.

Markov random fields: undirected models

When relationships are inherently symmetric — two adjacent pixels prefer to take similar values, two co-authored papers share topic, two neighbours in a social network influence each other — the directed formalism becomes awkward. A Markov random field (also called a Markov network or undirected graphical model) solves this by factorising the joint over the cliques of an undirected graph, with a non-negative potential function on each clique and a global normalisation constant. The representation is natural for problems where "which way does the arrow go?" has no good answer.

Definition

A Markov random field over variables X₁,…,Xₙ consists of (i) an undirected graph G with a node per variable, and (ii) a non-negative potential function ψ_C(X_C) for each maximal clique C of G. The joint is

P(X) = (1/Z) · ∏_C ψ_C(X_C), Z = Σ_X ∏_C ψ_C(X_C).

The potential is an arbitrary non-negative function — unnormalised and with no probabilistic interpretation on its own. The partition function Z is the global normaliser; computing it is in general #P-hard, which is what makes undirected models harder than directed ones.

The Hammersley–Clifford theorem

The cornerstone result, due to Hammersley and Clifford in 1971: a strictly positive distribution P(X) is a Markov random field with respect to graph G — meaning it satisfies the local Markov property that every variable is independent of the rest of the graph given its neighbours — if and only if it factorises over the cliques of G. The theorem is what lets the graph be both a representation of independence structure and a representation of a factorisation at the same time.

Log-linear form

Potentials are often written in log-linear form. Define features f_k(X_Ck) for each clique and its potential, with weights θ_k:

P(X) = (1/Z) · exp( Σ_k θ_k · f_k(X_Ck) ).

This is the Gibbs distribution, the exponential family, and the log-linear model all at once. Conditional random fields (section 15), Ising models, and every "exponential family" density sit in this form. The log-likelihood is concave in θ, which means parameter learning is a convex optimisation — one of the pleasant properties of MRFs.

Canonical examples

The Ising model is an MRF on a grid of binary spins: pairwise potentials prefer adjacent spins to align. It is the prototype for everything from ferromagnets to image denoising. The Potts model generalises it to multi-valued variables. In vision, Markov random fields with pairwise potentials are the classical model for image denoising, stereo matching, and segmentation — the energy E(X) = Σ unary(Xᵢ) + Σ pairwise(Xᵢ,Xⱼ) is minimised by techniques like graph cuts and iterated conditional modes. In NLP, conditional random fields (section 15) are MRFs conditioned on an input sequence.

Directed versus undirected

Directed models are natural for causal or generative stories; undirected models are natural for symmetric compatibilities. Directed models always have a tractable Z = 1; undirected models pay a price computing Z. Directed models have structured, interpretable conditionals; undirected models have features and weights. Neither subsumes the other: the independence structures representable by a DAG and those representable by an undirected graph are different, and each has examples the other cannot represent. In practice, Bayesian networks dominate in medical diagnostics, expert systems, and causal inference; MRFs dominate in vision, spatial statistics, and structured prediction.

Moralisation. A Bayesian network can be converted to an MRF by "moralising" it: connect every pair of parents that share a child (marrying them) and then drop all edge directions. The resulting undirected graph is an I-map of the original Bayesian network, though possibly not the tightest one — some conditional independences of the DAG may be lost.

Factor graphs: the unifying bipartite representation

Directed and undirected graphs are two different pictures of a distribution. For inference, a third picture — the factor graph — turns out to be the cleanest. A factor graph is a bipartite graph with two kinds of nodes: variable nodes (circles) and factor nodes (squares), with an edge connecting a factor to each variable in its scope. Any factorisation — directed, undirected, or a mix — can be drawn this way, and the resulting graph is what modern message-passing inference algorithms actually operate on.

Definition

A factor graph for a joint distribution P(X) = (1/Z) · ∏_j f_j(X_j) has a variable node for each Xᵢ, a factor node for each f_j, and an edge between factor f_j and each variable in its scope X_j. Bayesian-network conditionals P(Xᵢ | Pa(Xᵢ)) become factors with scope {Xᵢ} ∪ Pa(Xᵢ); MRF clique potentials ψ_C(X_C) become factors with scope C. The bipartite form makes the factorisation explicit and distinguishes two cliques that happen to share the same variable set but come from different factors.

Why bipartite matters for inference

An MRF drawn as an undirected graph loses information when two different factors have overlapping scopes: the union appears as a single clique. A factor graph keeps them separate. This matters because message-passing algorithms — sum-product, belief propagation, variational message passing — run on the factor graph and send distinct messages along each edge. On a tree-structured factor graph, the sum-product algorithm computes all marginals exactly in two passes (leaves-to-root, then root-to-leaves) in time linear in the number of edges.

The sum-product algorithm

Messages flow in two directions. Variable-to-factor: a variable X sends to a neighbouring factor f the product of messages it receives from all its other factor neighbours. Factor-to-variable: a factor f sends to a neighbouring variable X the sum over all its other variables of the product of f with the incoming variable messages. Iterate until the messages stop changing (one pass suffices on a tree). The marginal of each variable is then the product of all incoming messages at that variable node. This is the sum-product algorithm, also known as belief propagation.

Max-product and MAP inference

Replacing the sums with maxes turns the algorithm into max-product, which computes the most probable assignment (the MAP configuration) rather than marginals. This is the message-passing generalisation of the Viterbi algorithm used in HMMs, and of the many shortest-path algorithms used for decoding in convolutional codes and NLP taggers.

Loopy belief propagation

On a graph with cycles, running the sum-product algorithm as if the graph were a tree is no longer exact — the messages can oscillate, the marginals are wrong — but it often works remarkably well in practice, and there are settings (Gaussian graphical models, certain log-linear models) where it can be shown to converge to sensible approximations. Loopy belief propagation is one of the cheap approximate-inference workhorses, and its relationship to variational inference (section 9) is one of the surprising connections in the field.

Why factor graphs won the inference war. The representation is algorithmically neutral: directed and undirected models map to it losslessly, message-passing updates are defined in a single uniform way, and the same code handles any factorisation. Modern probabilistic-programming toolkits (Pyro, NumPyro, Infer.NET) use factor-graph semantics internally, even when users write directed-style generative code.

Exact inference: variable elimination and belief propagation

Given a factorisation and some evidence, the fundamental query is "what is the marginal (or MAP) of a subset of variables?" Exact inference is what we have when the graph is small enough, or tree-like enough, that the answer can be computed without approximation. Two algorithms do most of the work: variable elimination, a non-recursive algebraic procedure that sums out variables one at a time; and belief propagation (sum-product on a factor graph), a distributed message-passing procedure. They compute the same thing on a tree in the same asymptotic time; they differ in what they make reusable across queries.

Variable elimination

To compute the marginal P(Q | E), multiply together all factors, fix the evidence variables, and sum out the non-query non-evidence variables one at a time. Efficiency depends on the elimination order: a good order produces intermediate factors that stay small; a bad order produces intermediate factors the size of the joint. Finding the optimal order is NP-hard in general, but good heuristics exist (min-fill, min-weight). The runtime is exponential in the induced treewidth of the graph — the size of the largest intermediate factor generated during elimination — which is why treewidth is the complexity measure of graphical-model inference.

Belief propagation on trees

On a tree, the sum-product algorithm is exact. Two message passes — one from leaves to root, one from root to leaves — produce every marginal simultaneously. The runtime is linear in the number of variables and the complexity of the factors. For discrete variables with k states and pairwise potentials, the cost is O(nk²). On an HMM drawn as a chain, this is exactly the forward–backward algorithm (section 13). The fact that two quite different-looking algorithms — forward–backward, the Viterbi algorithm, Kalman smoothing, sum-product on a tree — are all the same message-passing algorithm on different factor graphs was one of the unifying discoveries of the 1990s.

Complexity

Exact inference is NP-hard in general graphical models (Cooper 1990). Even approximating the marginals to within a fixed factor is NP-hard in the worst case (Dagum and Luby 1993). Tractable cases: trees and polytrees (linear), small treewidth (exponential in treewidth only), singly-connected graphs, chains, grids of bounded width. Everything else requires approximation. The practical upshot: exact inference is the right tool for small expert systems, HMMs, small Bayesian networks, and Gaussian graphical models whose structure is tree-like; for medium and large networks, use variational inference or MCMC.

Gaussian graphical models are special

When all variables are jointly Gaussian, the joint is completely specified by its mean vector and covariance (or precision) matrix, and inference reduces to linear algebra. Conditional independence in a Gaussian corresponds exactly to zero entries in the precision matrix. Conditioning, marginalising, and computing maximum-likelihood parameters are all matrix operations. The Kalman filter (section 14) is the prototypical Gaussian graphical model, and the broader machinery underpins graphical Lasso, Gaussian Markov random fields in spatial statistics, and much of modern structured linear algebra.

The junction-tree algorithm: exact inference on general graphs

Variable elimination is efficient on trees and painful on anything else. The junction tree algorithm is the systematic way to extend exact inference to graphs with cycles: transform the original graph into a tree whose nodes are clusters (maximal cliques) of the original variables, then run belief propagation on that tree. The result is exact inference in time exponential in the treewidth of the graph rather than exponential in the number of variables — tractable whenever the treewidth is small.

Three-step construction

First, moralise the graph: for a directed model, marry co-parents (connect all parents that share a child) and drop edge directions. Second, triangulate it: add chords until no cycle of length four or more is chordless. Triangulation ensures that the maximal cliques form a tree structure; the minimum-treewidth triangulation is NP-hard but heuristics work well. Third, build the junction tree: nodes are maximal cliques of the triangulated graph, edges connect cliques that share variables, and the running-intersection property holds (every variable appearing in two cliques appears in every clique on the path between them).

Belief propagation on the junction tree

Each clique holds a potential — the product of all original factors whose scope is contained in the clique. Neighbouring cliques exchange messages over their separator (the intersection of their variables). Two passes — collect-to-root, then distribute-from-root — make every clique consistent with its neighbours. Marginals of any variable, or of any set of variables that lie in a single clique, are read off directly from the clique potentials. The runtime is exponential in the size of the largest clique (i.e., treewidth plus one) and linear in the number of cliques.

When junction trees work — and when they do not

Treewidth is a hard property of the graph. Tree-structured graphs have treewidth 1; polytrees have treewidth 1; grids of width k have treewidth k; random graphs have treewidth that grows with the number of nodes. An n×n grid has treewidth n, so exact inference on a 100×100 image MRF is forever out of reach — the largest clique would have 100 variables and 2¹⁰⁰ entries. This is where approximate methods (the next two sections) take over.

Why it still matters

Many practical problems have small natural treewidth: medical Bayesian networks, phylogenetic trees, pedigree analyses, small expert systems, many Gaussian networks. When the treewidth is in the range of 10–25, the junction tree is fast and exact. It is also the algorithm built into the classical inference engines of tools like Netica, Hugin, and GeNIe. For modern large-scale work the baton passes to variational inference and sampling, but the junction tree is the yardstick: an approximate method is "good" in part because it agrees with the junction tree on problems where both can run.

Treewidth is the complexity measure. "Can exact inference run in reasonable time?" reduces to "is the treewidth of this graph small?" — a purely graph-theoretic question, independent of the specific parameters of the model. Knowing the treewidth of the graph you are modelling is the single most predictive piece of information about whether your inference strategy needs to approximate.

Variational inference: posteriors as optimisation

When exact inference is intractable, one option is to pose approximate inference as an optimisation problem: pick a simple, tractable family of distributions q(Z) and find the member of that family that is closest to the true posterior p(Z | X). The closeness is measured in KL divergence, and the minimisation is equivalent to maximising a lower bound on the log marginal likelihood — the ELBO. Variational inference trades statistical consistency for deterministic runtime: it always returns an answer in finite time, though not necessarily the right one. It is the backbone of modern scalable probabilistic inference and the engine inside every variational autoencoder.

The ELBO

Start from the identity log p(X) = L(q) + KL(q(Z) ∥ p(Z | X)), where

L(q) = E_q[log p(X, Z)] − E_q[log q(Z)]

is the evidence lower bound. Since KL is non-negative, L(q) ≤ log p(X). Maximising L(q) over a tractable family of q's simultaneously tightens the bound on the marginal likelihood and pushes q towards the true posterior. All of variational inference is the design of the family and the optimisation of the ELBO.

Mean-field variational inference

The simplest choice of family is mean-field: assume the posterior factorises across the latent variables, q(Z) = ∏ᵢ qᵢ(Zᵢ). Under this assumption, the optimal qᵢ has a closed-form update: each factor is the exponential of the expected log joint under the other factors. Iterating the updates is the coordinate-ascent variational inference (CAVI) algorithm, and it always converges to a local optimum. Mean-field VI is the classic scheme for latent Dirichlet allocation, Bayesian mixture models, and Gaussian mixtures with conjugate priors.

Stochastic and amortised variants

Two extensions made variational inference scale to modern problems. Stochastic variational inference (Hoffman et al., 2013) replaces the expensive full-data expectations with noisy mini-batch estimates, enabling VI on datasets of any size. Amortised variational inference — the key idea in variational autoencoders (Kingma & Welling, 2013) — replaces the per-datapoint optimisation of q with a single neural network q_φ(Z | X) that predicts the approximate posterior as a function of the input. Amortisation is cheaper at inference time (one forward pass instead of an optimisation) and is the reason VAEs scale to millions of images.

ELBO gradients and the reparameterisation trick

Optimising the ELBO with gradient descent requires gradients through an expectation under q. The reparameterisation trick expresses a sample from a Gaussian q as a deterministic function of its parameters and auxiliary noise — z = μ + σ · ε, ε ∼ N(0, I) — which lets backpropagation flow through the sampling step. This single idea is what made VAEs, normalising flows, and much of modern deep generative modelling tractable.

Strengths and limitations

Variational inference is fast, deterministic, composable with gradient-based optimisers, and scalable to large datasets. Its weakness is that the mean-field factorisation systematically underestimates posterior variance — it has a well-documented tendency to pick one mode and ignore the others, because KL(q ∥ p) penalises q mass where p has none but not the reverse. Structured VI, normalising flows, and more expressive families (Gaussian processes, neural flow-based q's) partially fix this. For well-calibrated posteriors, MCMC (next section) is still the gold standard.

Markov-chain Monte Carlo: sampling the posterior

The alternative to optimising an approximation is sampling the real thing. MCMC constructs a Markov chain whose stationary distribution is exactly the posterior we cannot compute in closed form, runs the chain until it has "burned in," and then uses the samples as a Monte Carlo estimate of any expectation we want. It is slower and less predictable than variational inference, but with enough samples it converges to the truth, and it does not require a tractable approximating family. MCMC is the inference engine behind Stan and PyMC, and the gold-standard baseline against which every variational method is judged.

Metropolis–Hastings

The foundational algorithm, due to Metropolis et al. 1953 and generalised by Hastings 1970. Propose a new state z' from some proposal distribution q(z' | z). Accept the move with probability

α = min(1, [p(z') q(z | z')] / [p(z) q(z' | z)]).

Crucially, p need only be known up to a constant — the normalisation cancels. This is what lets MCMC sample an unnormalised Bayesian posterior without computing the evidence p(X). The proposal can be anything; the price of a poor proposal is slow mixing. Tuning proposals is most of the practical work of writing a Metropolis sampler.

Gibbs sampling

When the conditional p(Zᵢ | Z₋ᵢ, X) of each latent variable given the others is tractable — a common situation in Bayesian networks with conjugate priors — Gibbs sampling is the natural specialisation of Metropolis–Hastings. Cycle through variables, sampling each from its full conditional. Every step is accepted (the acceptance ratio is always 1). Gibbs is the algorithm behind topic-model inference in the classic LDA papers, and behind most MCMC for conditionally conjugate models.

Hamiltonian Monte Carlo

For continuous latents, ordinary random-walk proposals mix slowly: they take many small steps to explore the typical set. Hamiltonian Monte Carlo (Neal 2011) fixes this by treating the log-posterior as a potential energy, adding an auxiliary momentum variable, and simulating Hamiltonian dynamics with a symplectic integrator — so proposals can traverse large distances while respecting the geometry of the posterior. Stan's NUTS (no-U-turn sampler, Hoffman & Gelman 2014) automates the step-size and trajectory-length tuning that used to require expert hand-holding. HMC is the default algorithm for continuous Bayesian inference today.

Diagnostics

MCMC is correct only asymptotically, so diagnostics matter. Trace plots show whether chains have mixed. The Gelman–Rubin statistic R̂ compares variance within chains to variance between chains; values near 1 indicate convergence, values above 1.01 are warning signs. The effective sample size measures how many "independent" samples the chain has produced after autocorrelation. Good practice: run at least four chains from different starts; look at R̂, ESS, and divergences; treat suspicious posteriors with suspicion.

When to use MCMC

MCMC is the right choice when (i) you need calibrated uncertainty estimates, not just point estimates; (ii) the posterior is multi-modal and variational methods are getting stuck in one mode; (iii) the model is complex enough that no reasonable q-family approximates the posterior well; (iv) the dataset is small enough that sampling is affordable. For large-scale, fast-iteration regimes — training neural-network latent-variable models on millions of images — variational inference still wins. The modern ecosystem has both well integrated: Stan, PyMC, NumPyro for MCMC-first workflows; Pyro and TensorFlow Probability for VI-first workflows with MCMC as a fallback.

Variational vs MCMC, in one sentence. Variational inference is fast and wrong in a knowable way; MCMC is slow and eventually right, if you wait long enough and watch the diagnostics.

Parameter learning: MLE, MAP, Bayesian updates, EM

Given a graph and a dataset, how do we estimate the parameters of the local conditionals or potentials? Three routes dominate: maximum likelihood (pick the parameters that best explain the data), maximum a posteriori (MLE with a prior), and full Bayesian inference (treat parameters as random variables with their own posterior). When latent variables are present, the classical tool is the expectation-maximisation algorithm, which alternates between filling in the latents and re-estimating the parameters. Parameter learning in PGMs is what separates a workable model from a toy.

Maximum likelihood in Bayesian networks

For a fully observed Bayesian network, the likelihood factorises across nodes and the MLE decomposes into separate estimates for each conditional: P(Xᵢ | Pa(Xᵢ)) is estimated by counting (for discrete variables) or by fitting the local parametric form (Gaussian linear regression for continuous children of continuous parents, etc.). The decomposition is what makes directed models pleasant to learn: each local CPD can be trained independently.

Maximum likelihood in Markov random fields

For an MRF, the log-likelihood is

ℓ(θ) = Σᵢ log p(X⁽ⁱ⁾; θ) = Σᵢ ( Σ_k θ_k f_k(X⁽ⁱ⁾) − log Z(θ) ).

It is concave in θ, so in principle a simple gradient ascent works. In practice, the gradient contains an expectation under the model distribution — ∂/∂θ_k log Z(θ) = E_{p(X;θ)}[f_k(X)] — which requires inference at every step. For small-treewidth graphs this is feasible; for larger graphs, one resorts to pseudo-likelihood (replace the full likelihood with a product of conditionals), contrastive divergence (Hinton 2002, the algorithm behind Boltzmann machine training), or stochastic MCMC-based gradient estimates.

MAP and Bayesian estimation

Maximum a posteriori adds a prior: θ̂_MAP = arg max_θ p(θ | D) = arg max_θ p(D | θ) p(θ). For discrete Bayesian networks with Dirichlet priors, this is Laplace smoothing — a pseudo-count added to every cell — and makes the estimates well-behaved when some parent configurations are rare. Full Bayesian parameter inference treats θ as a latent variable and computes the posterior over parameters, yielding predictive distributions that integrate over parameter uncertainty. The machinery is MCMC or variational (sections 9–10).

Expectation-maximisation

When some variables are latent or some data are missing, direct MLE is intractable: the log-likelihood contains a sum (or integral) over the latents. The EM algorithm (Dempster, Laird & Rubin 1977) handles this iteratively. The E-step computes the posterior over latents given current parameters: q⁽ᵗ⁾(Z) = p(Z | X, θ⁽ᵗ⁾). The M-step updates parameters to maximise the expected complete-data log-likelihood under q⁽ᵗ⁾: θ⁽ᵗ⁺¹⁾ = arg max_θ E_{q⁽ᵗ⁾}[ log p(X, Z; θ) ]. Each iteration monotonically improves the observed-data log-likelihood. Convergence is to a local optimum; restarts mitigate.

EM generalises an enormous set of classical algorithms: Baum–Welch (EM on HMMs, section 13), the Gaussian-mixture EM that fits GMMs (Chapter 04), LDA's variational EM (section 16), and the factor-analysis iterations of Chapter 05. Every time a model has latents and closed-form conditional updates, EM is the natural algorithm.

Small data, conjugate priors, full posteriors

In practice, small PGMs — medical decision networks with a few dozen nodes, pedigree analyses, industrial reliability networks — are routinely learned end-to-end with Bayesian machinery: conjugate Dirichlet-multinomial updates for discrete CPDs, normal-inverse-Wishart for Gaussian CPDs, MCMC or VI for anything more exotic. The resulting model carries its parameter uncertainty through to predictions, which is particularly valuable in domains where calibrated uncertainty matters more than raw accuracy.

Structure learning: which edges belong in the graph

Parameter learning assumes the graph is given. Structure learning asks the harder question: which edges should the graph have? Two families of approaches — score-based (assign a score to every graph, search for the highest-scoring) and constraint-based (test conditional independences in the data, assemble a graph consistent with them) — dominate. Causal structure discovery, the deeply hard problem of recovering the causal DAG from observational data, is one of the most active frontiers in machine learning. Structure learning is where PGMs touch causal inference, symbolic reasoning, and scientific modelling.

The search space is enormous

The number of DAGs on n nodes grows super-exponentially — more than n!, more than 2^(n choose 2) — and enumerating them is hopeless past about ten nodes. Structure-learning algorithms work around this by searching over equivalence classes (CPDAGs), restricting attention to graphs with bounded in-degree, using greedy heuristics, or imposing a topological order.

Score-based search

Assign a score to each candidate graph that balances fit and complexity: the Bayesian Information Criterion (BIC), the Akaike Information Criterion (AIC), or the Bayesian Dirichlet equivalent (BDe) score for discrete networks. Search the space with hill-climbing, tabu search, genetic algorithms, or dynamic programming on topological orders. The canonical efficient algorithm is GES (Greedy Equivalence Search, Chickering 2002), which is provably consistent in the infinite-data limit under mild assumptions. Bayesian model averaging — averaging predictions over many graphs weighted by their posterior probability — sidesteps the "which graph is right?" question but requires MCMC over the graph space.

Constraint-based learning

Test conditional independences in the data using the chi-squared test (discrete) or partial correlation (Gaussian) and assemble a graph whose independence structure matches the tests. The canonical algorithm is the PC algorithm (Spirtes, Glymour & Scheines 1991): start from a complete undirected graph, remove edges whenever a conditioning set is found that renders the endpoints independent, then orient the remaining edges using collider patterns. The FCI algorithm (Fast Causal Inference) extends PC to handle latent confounders.

Causal discovery versus Markov-equivalent DAGs

Without interventions, the best any algorithm can do from observational data is recover the Markov equivalence class of the true graph — the set of DAGs that imply the same conditional independences. All DAGs in the class are statistically indistinguishable; only interventional or temporal information can single out the causal one. This is a fundamental limit, not a shortcoming of algorithms. Modern causal-discovery work extends the space with assumptions that break the symmetry: linear non-Gaussian models (LiNGAM, Shimizu 2006), additive-noise models, and causal-representation-learning techniques that introduce auxiliary variables (interventions, instruments, temporal structure).

Structure learning for MRFs

For Markov random fields, the canonical tool is the graphical Lasso (Friedman, Hastie & Tibshirani 2008): fit a sparse Gaussian inverse covariance by ℓ₁-penalised maximum likelihood. Zero entries in the inverse covariance correspond to missing edges in the Gaussian MRF. The same idea extends to discrete pairwise MRFs with group-Lasso penalties. Graphical Lasso is one of the few structure-learning methods that scale to hundreds or thousands of variables and is widely used in neuroscience and genomics.

When structure learning works — and when not. Structure learning is reliable when the sample size is large relative to the number of variables, the graph is sparse, the conditional-independence assumptions hold, and there are no unmeasured confounders. It is fragile when any of these fail — which is much of the time in observational data. In practice, expert-elicited structure, followed by parameter learning, often beats end-to-end structure learning.

Hidden Markov models: discrete latent-state sequences

A hidden Markov model is the PGM for sequences of observations generated by an unseen discrete state. The hidden state evolves as a Markov chain; each state emits an observation from a state-dependent distribution. HMMs dominated speech recognition, bioinformatics, and part-of-speech tagging from the 1970s through the 2010s, and remain the canonical first-pass model for any sequence where latent discrete dynamics are plausible. Every ML engineer should be able to draw an HMM, recognise the forward–backward algorithm, and explain Baum–Welch without hesitation.

Definition

An HMM is specified by: a set of hidden states S = {1, …, K}; an initial state distribution π = P(Z₁); a transition matrix A with A[i,j] = P(Zₜ = j | Zₜ₋₁ = i); an emission distribution P(Xₜ | Zₜ = i). The graph is a chain: Z₁ → Z₂ → … → Z_T, with each Zₜ → Xₜ. The joint factorises as

P(Z, X) = P(Z₁) · ∏ₜ P(Zₜ | Zₜ₋₁) · ∏ₜ P(Xₜ | Zₜ).

Three canonical problems

Evaluation: given observations X, compute P(X | θ) — useful for likelihood ratios and classifier construction. Solved by the forward algorithm, a dynamic program with runtime O(TK²).

Decoding: given observations X, find the most probable state sequence Z* = arg max_Z P(Z | X). Solved by the Viterbi algorithm (max-product on the chain), also O(TK²).

Learning: given observations and no state labels, estimate the parameters (π, A, B). Solved by the Baum–Welch algorithm — EM on the HMM, with the E-step being forward–backward and the M-step being weighted re-estimation of π, A, and the emission parameters.

Forward–backward

The forward variable αₜ(i) = P(X₁,…,Xₜ, Zₜ = i) and the backward variable βₜ(i) = P(Xₜ₊₁,…,X_T | Zₜ = i) satisfy recursions that compute all state marginals P(Zₜ = i | X) in a single pass forward and a single pass backward. This is exactly the sum-product algorithm (section 6) specialised to a chain-shaped factor graph, and it is identical in structure to Kalman smoothing for continuous-state models.

Where HMMs have shone

Speech recognition: phones as hidden states, acoustic features as observations — the entire pre-neural ASR pipeline. Part-of-speech tagging and named-entity recognition: tags as hidden states, words as observations. Bioinformatics: the CpG island problem, profile HMMs for protein family detection, and gene-finding (GENSCAN). Change-point detection and regime-switching models in finance. Activity recognition from wearable sensors. Molecular motor modelling from single-molecule experiments.

When HMMs are the wrong tool

HMMs impose two assumptions that are often wrong. First, the Markov assumption: the state at time t depends only on the state at t−1. Real processes have long-range dependencies. Second, the discrete-state assumption: the state is one of K categories. Continuous-state dynamics require Kalman filters (next section); long-range dependencies require higher-order HMMs, semi-Markov models, or neural sequence models (Chapter V). The decline of HMMs in speech and NLP is directly traceable to neural models' ability to handle both.

Kalman filters and state-space models

The continuous-state analogue of the HMM is the linear dynamical system, or linear-Gaussian state-space model: a continuous hidden state evolves linearly with Gaussian noise, and a continuous observation is a linear function of the state with Gaussian noise. The inference algorithm — the Kalman filter — is so efficient and so numerically stable that it is the default tool for tracking, navigation, and sensor fusion in every field that uses them. It put humans on the moon in 1969 and is the algorithmic backbone of every GPS unit, every inertial navigation system, and a great deal of modern state-space-based sequence modelling.

The linear-Gaussian model

The state-space equations:

Zₜ = A Zₜ₋₁ + wₜ, wₜ ~ N(0, Q)

Xₜ = H Zₜ + vₜ, vₜ ~ N(0, R)

with A the transition matrix, Q the process-noise covariance, H the observation matrix, R the observation-noise covariance. All hidden and observed variables are jointly Gaussian, which is what makes inference tractable — the entire filter is linear algebra on mean vectors and covariance matrices.

The Kalman filter in two sentences

At each step, the posterior over the current hidden state is a Gaussian with some mean and covariance. The predict step pushes this Gaussian forward through the transition: ẑₜ₊₁⁻ = A ẑₜ, P̂ₜ₊₁⁻ = A Pₜ Aᵀ + Q. The update step combines the predicted Gaussian with the new observation using Bayes' rule for Gaussians: compute the Kalman gain K = P̂⁻ Hᵀ (H P̂⁻ Hᵀ + R)⁻¹, and update ẑ = ẑ⁻ + K(x − H ẑ⁻), P = (I − KH) P̂⁻. Iterate. The filter is optimal (minimum-variance unbiased) for any linear-Gaussian system.

Smoothing

The Kalman filter is an online procedure — it computes the posterior over each hidden state given observations up to and including the current one. Kalman smoothing (the Rauch–Tung–Striebel algorithm) is the offline counterpart: given all observations, compute the posterior over every hidden state given the full sequence. It is the direct analogue of forward–backward for HMMs, and it is what you want when you can process the whole sequence at once.

Extensions for non-linear systems

Real dynamics are rarely exactly linear. The Extended Kalman Filter (EKF) linearises the transition and observation functions around the current estimate at each step. The Unscented Kalman Filter (UKF, Julier & Uhlmann 1997) uses a deterministic sample of sigma points propagated through the non-linear dynamics — more accurate than linearisation at comparable cost. For strongly non-linear or non-Gaussian systems, particle filters (sequential Monte Carlo) replace the Gaussian approximation with a weighted set of samples and are the go-to tool for robot localisation in non-parametric settings.

Kalman filters and modern sequence models

The mathematical lineage from the Kalman filter to modern state-space sequence models — Linear State-Space Layers, S4 (Gu, Goel & Ré 2022), Mamba (Gu & Dao 2023) — is direct: each is a linear dynamical system whose state dimension and transition structure have been adapted for the efficient processing of long sequences, and for the parameters to be learned end-to-end from gradients rather than estimated from physical first principles. The state-space revival in deep sequence modelling is one of the clearest examples of how classical PGM machinery keeps returning under new names.

Why Kalman endures. In engineering, most problems are "nearly linear, nearly Gaussian" if you pick the right coordinates. When that holds, the Kalman filter is not just a reasonable approximation — it is optimal. No other inference algorithm has a comparable claim on a comparable class of real-world problems.

Conditional random fields

A hidden Markov model is generative — it models the joint P(X, Z), which includes modelling how the observations were produced. For structured prediction — labelling every token in a sentence, every pixel in an image, every amino acid in a protein — we typically do not care about P(X); we care about P(Z | X), the distribution over labels given the input. Conditional random fields (Lafferty, McCallum & Pereira 2001) are the discriminative counterpart: undirected graphical models of the conditional distribution, with rich input-dependent features. CRFs dominated structured-prediction benchmarks in NLP and computer vision through the 2000s and remain the classical baseline.

The model

A CRF on observations X and labels Z defines

P(Z | X) = (1/Z(X)) · exp( Σₜ Σ_k θ_k f_k(Zₜ, Zₜ₋₁, X, t) ),

where the features f_k can depend on the current label, the previous label, the entire observation sequence X, and the position t. For a linear-chain CRF, the graph over labels is a chain — identical in structure to an HMM, but with one-sided conditioning on X. The partition function Z(X) depends on the input, which is the key advantage: rich, non-local input features can be used freely without worrying about modelling their joint distribution.

Why CRFs beat HMMs for labelling

Two reasons. First, HMMs require the observations to be conditionally independent given the states — a strong assumption that is usually wrong for rich inputs like words, where prefixes, suffixes, gazetteers, and capitalisation are all informative. CRFs sidestep this because they do not model P(X) at all. Second, HMMs are generative, so errors in modelling P(X) degrade label predictions; CRFs optimise the quantity we actually care about. Empirically, for tasks like named-entity recognition, part-of-speech tagging, and shallow parsing, CRFs were consistently 1–3 points more accurate than the analogous HMMs in the 2000s.

Inference and training

Inference on a linear-chain CRF is identical to forward–backward on an HMM — compute marginals for each label and each label pair. Decoding is Viterbi. Training maximises the conditional log-likelihood Σᵢ log P(Zⁱ | Xⁱ; θ) with L-BFGS or stochastic gradient descent. The gradient is the difference between the empirical feature expectation and the model's feature expectation — identical in structure to MRF training (section 11), but tractable because inference on each individual chain is cheap.

Beyond linear chains

Skip-chain CRFs add long-range edges between tokens likely to share labels (e.g., the same word appearing twice). Semi-Markov CRFs label segments rather than tokens and are the right tool when label boundaries need to be predicted explicitly. 2D CRFs on grid graphs are the structured-prediction backbone for semantic segmentation — conditional random fields over pixel labels with pairwise potentials encoding "adjacent pixels prefer similar labels". In the 2010s, neural architectures (bi-LSTMs, then transformers) replaced the hand-crafted features, and the CRF layer moved from the whole model to a final structured-output head — the "BiLSTM-CRF" architecture was state-of-the-art for sequence labelling from 2015 to about 2018 and is still a strong, cheap baseline.

Topic models: latent Dirichlet allocation and descendants

When PGM ideas met text mining, the result was latent Dirichlet allocation (Blei, Ng & Jordan 2003), the canonical unsupervised model for large document collections. Each document is modelled as a mixture over a small number of latent topics; each topic is a distribution over words; inference recovers both the topic-word distributions and the per-document topic proportions. LDA became one of the most widely used probabilistic models of the 2000s and introduced a generation of researchers to variational and Gibbs inference.

The generative story

For each document d, draw a topic proportion vector θ_d ~ Dirichlet(α). For each topic k, draw a word distribution φ_k ~ Dirichlet(β). For each word position n in document d: draw a topic z_{d,n} ~ Categorical(θ_d); draw a word w_{d,n} ~ Categorical(φ_{z_{d,n}}). The graph is a directed plate model with three plates (topics, documents, words). The joint factorises as a product of local Dirichlet and categorical distributions, and inference — recovering the per-document topic proportions and per-topic word distributions from observed word counts — is the algorithmic challenge.

Two canonical inference routes

Collapsed Gibbs sampling (Griffiths & Steyvers 2004): integrate out the Dirichlet parameters analytically, then Gibbs-sample the topic assignment of each word given all other assignments. Each iteration is cheap (a single categorical resample per token) and the algorithm mixes well in practice. This is the classic LDA training algorithm in Mallet and GibbsLDA++.

Variational inference (Blei, Ng & Jordan 2003): mean-field VI in which θ, z, and φ each have their own variational factor. The updates are closed-form given conjugacy. Scales better to very large corpora than Gibbs because it parallelises cleanly and is the approach used in online LDA (Hoffman, Blei & Bach 2010) and the scikit-learn implementation.

Extensions

A zoo of variations. Dynamic topic models (Blei & Lafferty 2006) let topics evolve over time, which is useful for scientific-literature analysis. Correlated topic models replace the Dirichlet prior with a logistic normal to allow topic correlations. Hierarchical Dirichlet processes (Teh et al. 2006) make the number of topics itself an inferred quantity rather than a hyperparameter. Supervised LDA adds a label predicted from the topic proportions. Author-topic models combine document authorship information with topic structure. Neural topic models — ProdLDA, ETM (Embedded Topic Models), BERTopic — replace or augment the categorical-Dirichlet machinery with neural networks and word embeddings, and are the modern default for topic discovery.

LDA today

LDA is no longer the default for topic discovery — embedding-based methods (BERTopic, UMAP-over-document-embeddings-plus-clustering) have taken over the middle of the market — but the core idea of "document is a mixture of topics, topic is a distribution over words" remains influential. The Bayesian hierarchy, the plate notation, and the variational and Gibbs inference recipes introduced by LDA are now standard vocabulary across the field, and many modern neural generative-retrieval and neural-topic systems are LDA's great-grandchildren.

Why LDA mattered beyond topic modelling. LDA taught a generation of machine learners how to read a plate diagram, how to write down a generative story, and how to derive a mean-field variational update or a collapsed Gibbs sampler. Many researchers whose names are now associated with deep generative modelling cut their teeth on LDA and its extensions, and the vocabulary carries over: ELBO, amortised inference, reparameterisation, and variational lower bound all trace back to the mid-2000s PGM community.

PGMs in practice: toolkits, workflow, and when to use them

A PGM on paper and a PGM in production are different artefacts. The practical workflow is roughly: (i) decide whether a PGM is the right tool at all, (ii) choose a directed or undirected representation, (iii) elicit or learn the structure, (iv) fit the parameters, (v) choose an inference algorithm compatible with the scale and the kind of query you need, (vi) validate with held-out data and posterior predictive checks. Each step has well-understood tools and well-known traps.

When a PGM beats a neural network

PGMs win when the problem is small-data, the structure is known or elicitable, interpretability is non-negotiable, calibrated uncertainty is required, or when prior knowledge needs to be combined with data. They also win for causal reasoning: no deep architecture, however large, answers interventional queries without some graphical-model scaffolding. They lose on perceptual tasks, very large unstructured datasets, and anywhere the inductive biases of convolution or attention fit the problem better than anything a graph would supply.

The practical toolkit

For Bayesian inference at scale: Stan (the reference MCMC engine, first-rate NUTS implementation, strong diagnostics), PyMC (the Pythonic alternative, deeply integrated with PyData), NumPyro (JAX-backed, GPU-fast NUTS), Pyro (deep probabilistic programming on PyTorch, variational-first), TensorFlow Probability (VI and MCMC on TF). For discrete Bayesian networks: pgmpy (Pythonic, supports exact and approximate inference, structure learning, causal tooling), bnlearn (R, strong structure-learning methods), Hugin / Netica / GeNIe (commercial, production-grade inference engines). For structured-prediction CRFs: sklearn-crfsuite, PyTorch's CRF layer inside BiLSTM/transformer pipelines. For topic models: gensim, Mallet, scikit-learn.

Validation and diagnostics

Posterior predictive checks — simulate data from the fitted posterior and compare the simulated data to the observed — are the first-line sanity check. Calibration curves check whether the model's 80% credible intervals contain the truth 80% of the time. For MCMC, R̂ and effective sample size are mandatory. For structure learning, hold-out log-likelihood and stability across bootstraps. For causal claims, sensitivity analysis to unobserved confounders.

Common traps

The most frequent: omitted variables confound every dependency in the learned graph (no algorithm recovers what cannot be seen); a plausible-looking topology hides wrong priors (e.g., overly diffuse Dirichlets in LDA producing uninterpretable topics); mean-field VI gives overconfident posteriors and collapsing modes; MCMC reports R̂ near 1 while hiding a missed mode (run many chains from diverse starts); causal interpretation of a DAG learned from observational data without the caveats that apply; treating a graphical model as a generative simulator when the fit is only for predictive accuracy on a small set of marginals.

A small operational recipe

A good default workflow for a new problem: (1) sketch the generative story on a whiteboard with an expert; (2) implement it in PyMC or NumPyro with weakly informative priors; (3) run NUTS with four chains and look at diagnostics; (4) check posterior predictives against held-out data; (5) only then decide whether the model is good enough — in which case ship it — or whether the structure needs revision, in which case go back to step 1. Most of the time, the iteration is about the structure and the priors, not the inference algorithm.

Where it compounds in modern machine learning

Probabilistic graphical models are sometimes written off as an artefact of the pre-deep-learning era. This is wrong. The vocabulary of PGMs — latent variables, conditional independence, message passing, priors, marginal likelihood, the ELBO — is the vocabulary of modern probabilistic deep learning; several of the most important modern architectures are directly descended from PGM ideas; and the renaissance of causal and structured reasoning makes graphical thinking more central, not less.

Variational autoencoders and deep latent-variable models

A variational autoencoder (Kingma & Welling 2013) is a Bayesian network with two nodes (Z → X) whose conditional P(X | Z) is a deep neural network, and whose approximate posterior q(Z | X) is another neural network trained by amortised variational inference. The VAE is a PGM whose parameters are neural and whose inference is variational. The same template animates normalising flows (invertible neural networks that make P(X) tractable), diffusion models (hierarchies of denoising latents), and deep mixture-of-experts models.

State-space models in sequence modelling

The state-space revival — Structured State-Space Sequence models (S4, Gu et al. 2022) and Mamba (Gu & Dao 2023) — recasts sequence modelling as linear dynamical systems with clever parameterisations. This is the direct descendant of Kalman filtering, reborn with learned dynamics, GPU-efficient implementations, and subquadratic scaling in context length. It is currently the most promising alternative to transformer attention for very long contexts.

Neural structured prediction

CRF layers on top of neural encoders — the BiLSTM-CRF, the transformer-CRF — remain strong for sequence labelling, span extraction, and constrained decoding. The combination is the clearest hybrid in the field: let the neural network learn features from raw data, let the graphical model enforce structural constraints (valid label sequences, tree structure, parse validity). In computer vision, CRFs-as-RNN layers over segmentation logits (Zheng et al. 2015) were state-of-the-art for semantic segmentation for several years.

Probabilistic programming

A probabilistic programming language — Stan, PyMC, Pyro, Gen, Turing.jl, NumPyro — is a general-purpose language in which one writes a generative model as ordinary code, with built-in inference primitives. PPLs are the practical face of modern PGMs: you describe the structure by writing the sample statements, the compiler builds the factor graph, and automatic differentiation plus MCMC or VI do the rest. Probabilistic programming is where most working PGM research actually happens today.

Causal inference and scientific ML

Causal inference — the discipline of reasoning about interventions and counterfactuals — is built on graphical models. Pearl's do-calculus, structural causal models, front-door and back-door adjustments, instrumental variables, and modern causal-representation learning all use the vocabulary of DAGs. As ML moves from prediction to decision support — "what would happen if we deploy this policy?" — causal DAGs become unavoidable. In scientific ML, PGMs are the natural way to combine a known physical mechanism with a learned residual, and they appear as the graphical backbone of modern systems in drug discovery, protein design, and climate modelling.

What to remember

Three ideas carry over from this chapter into everything that follows. Factorisation: every tractable probabilistic model factors a big joint into smaller local pieces, and the structure of that factorisation is what makes inference feasible. Conditional independence: knowing which variables ignore which, given what, is what lets you design models that scale. Inference-as-optimisation versus inference-as-sampling: variational and Monte-Carlo methods are the two canonical strategies for reasoning under uncertainty when exact inference is out of reach. These ideas structure the rest of classical machine learning — kernel methods, feature engineering, model evaluation — and they structure much of modern deep probabilistic learning too.

Bridge to the next chapter. Probabilistic graphical models give us a structured way to reason about distributions with dozens to thousands of variables; kernel methods — the subject of Chapter 07 — give us a geometric way to reason about distributions in infinite feature spaces without ever writing them down. The two are cousins: kernel density estimation is a PGM with one kernel per data point, and Gaussian processes are graphical models over continuous fields. The shift of emphasis is from what variables depend on what to what functions live in what space, and the resulting family — SVMs, kernel ridge regression, Gaussian processes — occupies much of the remaining ground in classical ML.

Where to go next

Probabilistic graphical models sit at the intersection of probability theory, statistics, computer science, and (increasingly) deep learning, and the literature spans sixty years and three distinct communities. The references below split into anchor textbooks that define the field, foundational papers that introduced each major algorithm, modern extensions that carry the ideas into 2026, and the software documentation where most practical work happens. Read the Koller–Friedman textbook if you only read one thing.

The anchor textbooks

Probabilistic Graphical Models: Principles and Techniques

Daphne Koller & Nir Friedman · MIT Press · 2009

The definitive textbook on PGMs — 1270 pages of Bayesian networks, Markov random fields, exact and approximate inference, parameter and structure learning, and applications. Part I (Representation) develops the graphical formalism; Part II (Inference) is the encyclopedic reference for variable elimination, junction trees, loopy BP, MCMC, and variational methods; Part III (Learning) covers parameter and structure estimation from data; Part IV (Actions and Decisions) covers influence diagrams and decision networks. Dense but comprehensive; the only book that covers everything in this chapter at full depth. Read Chapters 2–4 first (representation and independence), then whichever inference and learning chapters apply to your problem.

MIT Press
Pattern Recognition and Machine Learning

Christopher Bishop · Springer · 2006 · free

The classical ML textbook with unusually strong PGM coverage. Chapter 8 (Graphical Models) introduces directed and undirected models, factor graphs, and the sum-product algorithm. Chapter 9 (Mixture Models and EM) develops EM from first principles using the EM–variational connection. Chapter 10 (Approximate Inference) is the best textbook treatment of variational methods at this length, covering mean-field, expectation propagation, and variational linear regression. Chapter 13 (Sequential Data) covers HMMs and linear dynamical systems including Kalman filtering. Bishop's notation and pedagogical clarity are exceptional.

Microsoft
Probabilistic Machine Learning: Advanced Topics

Kevin Murphy · MIT Press · 2023 · free HTML

Murphy's second volume is the modern PGM reference — updated for the deep-learning era. Part IV (Inference) covers variational inference, amortised inference, normalising flows, MCMC, HMC, and modern ELBO-based methods. Part V (Predictive Models) covers Gaussian processes, neural latent-variable models, and probabilistic programming. Part VI (Generation) links PGMs to VAEs, GANs, diffusion, and autoregressive models. Where Koller–Friedman is the classical reference, Murphy's volume is the bridge between classical PGMs and contemporary probabilistic deep learning. Free HTML version at probml.github.io.

probml.github.io MIT Press
Causality: Models, Reasoning, and Inference

Judea Pearl · Cambridge · 2nd ed. 2009

The book that founded modern causal inference as a formal field. Pearl's do-calculus, structural causal models, the front-door and back-door criteria, counterfactuals — all are developed here from the graphical-model perspective. Essential reading for anyone working on causal discovery, observational-data interventions, or the intersection of PGMs and causal reasoning. The companion book The Book of Why (Pearl & Mackenzie 2018) is the popular-audience version, and Probabilistic Reasoning in Intelligent Systems (Pearl 1988) is the book that invented Bayesian networks as we now know them.

Cambridge
Bayesian Data Analysis

Gelman, Carlin, Stern, Dunson, Vehtari, Rubin · 3rd ed. · 2013 · free PDF

The standard reference for applied Bayesian inference, with strong PGM content. Chapter 11 (Basics of Markov Chain Simulation) and 12 (Computationally Efficient MCMC) are the canonical treatments of MCMC for practitioners. Chapter 13 (Modal and Distributional Approximations) covers variational and Laplace methods. Chapter 20+ covers hierarchical models, which are PGMs under a different name. Bayesian Data Analysis pairs exceptionally well with the Stan ecosystem — most of the examples are implementable directly in Stan.

Free PDF

Foundational papers

Fusion, Propagation, and Structuring in Belief Networks

Judea Pearl · Artificial Intelligence · 1986

Pearl's original paper on belief propagation — message passing on a polytree for probabilistic reasoning. Introduces the core ideas that became the sum-product algorithm, the lambda–pi messages, and the connection to Bayesian updating. The paper is a touchstone for the whole PGM field. Read alongside Pearl's 1988 book Probabilistic Reasoning in Intelligent Systems, which is the full book-length development of Bayesian networks.

UCLA PDF
Local Computations with Probabilities on Graphical Structures and their Application to Expert Systems

Lauritzen & Spiegelhalter · JRSS-B · 1988

The paper that introduced the junction-tree algorithm for exact inference on general Bayesian networks. Develops moralisation, triangulation, and clique trees, together with the two-pass message-passing scheme that computes all marginals exactly. Heavy reading but the canonical reference; the algorithm it describes is the one implemented in Hugin, Netica, and every classical Bayesian-network inference engine.

JSTOR
Maximum Likelihood from Incomplete Data via the EM Algorithm

Dempster, Laird, Rubin · JRSS-B · 1977 · free

The EM paper. Unifies a huge collection of prior algorithms (Baum–Welch, factor-analysis iteration, mixture-model fitting) under a single framework and proves the monotone-improvement property. One of the most-cited papers in statistics. The presentation is terse by modern standards but still readable. Pair with Neal & Hinton's 1998 paper A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants for the modern variational perspective.

PDF
A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition

Lawrence Rabiner · Proceedings of the IEEE · 1989 · free

The most cited single reference on HMMs. Walks through the three canonical problems (evaluation, decoding, learning), the forward–backward and Viterbi algorithms, and Baum–Welch estimation, with worked numerical examples and a clear pedagogical style. The tutorial that taught a generation of speech-recognition engineers how to think about hidden states. Despite its age, the treatment is still the most accessible single entry into HMMs.

PDF
A New Approach to Linear Filtering and Prediction Problems

Rudolf Kalman · Journal of Basic Engineering · 1960 · free

The Kalman filter paper. Written in engineering notation rather than statistical notation, and the connection to Bayesian inference took years to be fully recognised, but the algorithm itself is the one still in use. The filter was used on the Apollo program for lunar navigation and is the direct ancestor of modern state-space sequence models. Short, concrete, foundational.

PDF
Equation of State Calculations by Fast Computing Machines & Monte Carlo Sampling Methods Using Markov Chains

Metropolis et al. · J. Chem. Phys. · 1953 / Hastings · Biometrika · 1970

The two papers that founded MCMC. Metropolis et al. introduced the acceptance-rejection idea for sampling Boltzmann distributions in statistical physics; Hastings generalised it to arbitrary target distributions with arbitrary proposals, giving the Metropolis–Hastings algorithm that remains the most general MCMC scheme. Read Hastings first — it is the version of the algorithm you will actually implement. Pair with Geman & Geman's 1984 Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images, which introduced Gibbs sampling.

Metropolis PDF Hastings JSTOR
Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Lafferty, McCallum, Pereira · ICML · 2001 · free

The CRF paper. Introduces the discriminative counterpart of HMMs for structured prediction, solves the label-bias problem that plagued earlier discriminative sequence models, and opens the decade of CRF-dominated NLP. The linear-chain CRF formulation and the connection to generalised iterative scaling are both here. Essential reading for anyone doing sequence labelling, named-entity recognition, or structured prediction. Pair with Sutton & McCallum's An Introduction to Conditional Random Fields (2011) for the tutorial-length treatment.

PDF Tutorial
Latent Dirichlet Allocation

Blei, Ng, Jordan · JMLR · 2003 · free

The LDA paper. Introduces the three-plate generative story for documents, derives the variational-Bayes inference procedure, and presents experiments on news corpora. One of the most influential probabilistic models of the 2000s and the gateway drug to modern variational inference. Pair with Griffiths & Steyvers's 2004 Finding Scientific Topics for the collapsed-Gibbs sampling alternative, and with Hoffman, Blei, and Bach's 2010 Online Learning for Latent Dirichlet Allocation for the scalable online version.

JMLR
Auto-Encoding Variational Bayes

Kingma & Welling · ICLR · 2014 · free

The VAE paper. Introduces amortised variational inference via a neural encoder network, the reparameterisation trick for gradients through expectations, and the ELBO objective that every modern variational deep generative model uses. The single most important PGM–deep-learning bridge of the last fifteen years. Short and dense; worth reading multiple times. Pair with Rezende, Mohamed & Wierstra's Stochastic Backpropagation and Approximate Inference in Deep Generative Models (2014), the concurrent independent derivation.

arXiv
Stochastic Variational Inference

Hoffman, Blei, Wang, Paisley · JMLR · 2013 · free

The paper that scaled variational inference to massive datasets. Replaces full-data gradient computations with mini-batch stochastic estimates, making VI feasible for corpora with millions of documents and models with millions of parameters. The connection between SVI, natural gradients, and the conjugate-exponential-family structure is developed in full. Pair with Hoffman & Blei's 2015 Structured Stochastic Variational Inference for the modern extension that relaxes mean-field.

JMLR
The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo

Matthew Hoffman & Andrew Gelman · JMLR · 2014 · free

The NUTS paper — the algorithm that made Hamiltonian Monte Carlo actually usable by automating the two awkward hyperparameters (step size, trajectory length). NUTS is the inference engine inside Stan and PyMC, and the reason Bayesian modelling with continuous latents is a commodity rather than a specialist sport. Pair with Neal's 2011 chapter MCMC Using Hamiltonian Dynamics (in the Handbook of Markov Chain Monte Carlo) for the pedagogical introduction to HMC itself.

JMLR Neal HMC
An Algorithm for Fast Recovery of Sparse Causal Graphs & Causation, Prediction, and Search

Spirtes, Glymour, Scheines · Social Science Computer Review 1991 / MIT 2nd ed. 2000

The PC algorithm paper (and the book-length treatment that extended it). Constraint-based causal discovery from observational data: start with a complete graph, remove edges via conditional-independence tests, orient via collider patterns. The foundation of constraint-based structure learning, and the ancestor of FCI, GES, and the modern causal-discovery toolkit. Pair with Chickering's 2002 Optimal Structure Identification with Greedy Search for the GES score-based alternative.

MIT

Modern extensions

Efficiently Modeling Long Sequences with Structured State Spaces (S4)

Gu, Goel, Ré · ICLR · 2022 · free

The paper that brought state-space models back to the centre of sequence modelling. S4 is a parametric linear dynamical system whose transition matrix is chosen to capture long-range dependencies efficiently; it handles sequences of length 16 000 and more, beats transformers on the Long Range Arena benchmark, and established the lineage that leads to Mamba. The Kalman-filter ancestry is explicit. Pair with Gu & Dao's Mamba: Linear-Time Sequence Modeling with Selective State Spaces (2023) for the input-dependent selective variant that has become the most serious non-transformer contender.

S4 arXiv Mamba arXiv
Variational Inference with Normalizing Flows

Rezende & Mohamed · ICML · 2015 · free

The normalising-flows paper. Replaces the mean-field variational family with an expressive, invertible transformation of a base distribution, yielding variational posteriors with arbitrary flexibility. One of the two key ideas (with the reparameterisation trick) that made deep variational inference powerful. Pair with Papamakarios et al.'s 2021 Normalizing Flows for Probabilistic Modeling and Inference for the comprehensive modern survey.

arXiv Survey
Conditional Random Fields as Recurrent Neural Networks

Zheng et al. · ICCV · 2015 · free

The paper that fused dense CRFs with deep convolutional networks for semantic segmentation, setting state of the art for several years. Reformulates mean-field CRF inference as a recurrent network layer that can be trained end-to-end with the rest of the CNN. A canonical example of the "neural features + graphical-model structured output" template. Pair with Krähenbühl & Koltun's 2011 Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials for the fast dense-CRF inference machinery.

arXiv
Pyro: Deep Universal Probabilistic Programming

Bingham et al. · JMLR · 2019 · free

The Pyro paper. Describes the design of a PyTorch-based probabilistic programming language that supports both deep generative models and classical Bayesian inference. The focus on stochastic variational inference with neural q-families, the support for higher-order programs (models that sample other models), and the integration with automatic differentiation make Pyro the natural successor to Stan for deep-latent-variable work. Pair with the NumPyro paper (Phan, Pradhan, Jankowiak 2019) for the JAX-accelerated sibling.

JMLR NumPyro
Causal Effect Inference with Deep Latent-Variable Models

Louizos, Shalit, Mooij, Sontag, Zemel, Welling · NeurIPS · 2017 · free

A representative of the modern "causal inference meets deep latent-variable models" literature. Uses a VAE to learn a latent confounder representation for causal-effect estimation from observational data. Illustrates the continuing relevance of Pearl's causal-inference framework — now with deep-learning machinery for the non-parametric pieces. Pair with Schölkopf et al.'s 2021 Toward Causal Representation Learning for the field-shaping position paper.

arXiv CRL survey
Denoising Diffusion Probabilistic Models

Ho, Jain, Abbeel · NeurIPS · 2020 · free

The paper that made diffusion models central to modern generative modelling. A diffusion model is a deep hierarchical Bayesian network with a chain of latent variables, each a noised version of the data, and a variational lower bound on the data likelihood. The training objective reduces to a denoising regression. The PGM lineage (hierarchical latents, variational training) is explicit. Pair with Song & Ermon's 2019 Generative Modeling by Estimating Gradients of the Data Distribution (score-based view) and Kingma et al.'s 2021 Variational Diffusion Models (VDM) for the variational reformulation.

arXiv
The Book of Why

Judea Pearl & Dana Mackenzie · Basic Books · 2018

Pearl's popular-audience exposition of causal inference and the "three rungs" of the ladder of causation (observation, intervention, counterfactual). Not a technical reference, but the single best introduction to why graphical thinking matters for causal reasoning. Read it before Causality if you need the motivation. Pair with Hernán & Robins's Causal Inference: What If (2023, free online) for the epidemiology-driven applied treatment.

Basic Books What If

Software and documentation

Stan

Stan Development Team · open source · C++ core / interfaces in R, Python, Julia

The reference implementation of Bayesian inference at scale. Stan is a probabilistic programming language with a first-class NUTS implementation, strong diagnostics, and a large ecosystem of extensions (brms, rstanarm, cmdstanpy, PyStan). The Stan User's Guide and Stan Functions Reference are the definitive working documentation; the Stan Reference Manual covers the language formally. For anyone serious about Bayesian modelling, Stan is the tool.

mc-stan.org User's Guide
PyMC

PyMC Developers · open source · Python · PyTensor backend

The Pythonic Bayesian-inference stack. PyMC provides a declarative API for writing generative models, NUTS and variational backends, and deep integration with the PyData ecosystem (NumPy, pandas, xarray, ArviZ). The PyMC Examples gallery is an unusually rich library of working models — hierarchical regressions, mixture models, topic models, time-series models, causal inference. For Python-first workflows this is the default.

pymc.io Examples
Pyro and NumPyro

Uber AI / JAX community · open source · Python · PyTorch / JAX

Deep probabilistic programming. Pyro is built on PyTorch and specialises in variational inference with neural q-networks; NumPyro ports the same API to JAX and is typically 5–10× faster for NUTS workflows. Both excel at deep latent-variable models, neural structured prediction, and the hybrid PGM-deep-learning regimes where Stan starts to feel constrained. The Pyro Deep Universal Probabilistic Programming tutorials are the gateway.

pyro.ai NumPyro
pgmpy

Ankur Ankan et al. · open source · Python

The Python library for discrete Bayesian networks and Markov networks. Supports exact (variable elimination, junction tree) and approximate (likelihood weighting, Gibbs) inference, parameter learning (MLE, Bayesian estimation), structure learning (PC, GES, hill-climbing), and causal inference tooling. The natural starting point when the problem is a small-to-medium discrete Bayesian network rather than a continuous Bayesian model. Pair with bnlearn (Scutari's R package) for the most comprehensive structure-learning toolkit, or the causal-inference-focused DoWhy library.

pgmpy.org bnlearn DoWhy
hmmlearn, sklearn-crfsuite, and gensim

Python ecosystem · open source

The working-programmer toolkit for the specialised PGMs of sections 13, 15, and 16. hmmlearn implements Gaussian and discrete HMMs with a scikit-learn–style API. sklearn-crfsuite wraps the CRFsuite implementation of linear-chain CRFs for sequence labelling with the scikit-learn API. gensim is the NLP Swiss army knife that includes efficient LDA, HDP, and related topic-model implementations, plus the fastest Python implementations of word2vec-family embeddings. For ad-hoc HMM, CRF, and topic-model work that does not need a full Bayesian-inference stack, these are the fast path.

hmmlearn sklearn-crfsuite gensim
ArviZ and the Bayesian-workflow handbook

Kumar, Carroll, Hartikainen, Martin · open source · Python / multi-lingual

ArviZ is the universal diagnostic library for Bayesian inference — R-hat, ESS, energy diagnostics, posterior predictive checks, trace and rank plots — across Stan, PyMC, NumPyro, Pyro, and Turing. The companion Bayesian Workflow paper (Gelman, Vehtari, Simpson et al. 2020) codifies the full modelling-iterating-validating cycle that modern practitioners actually use. These are the tools and the recipe for evaluating every PGM you fit.

ArviZ Bayesian Workflow

This page is Chapter 06 of Part IV: Classical Machine Learning. Chapter 07 turns from graph-structured reasoning to kernel methods and support vector machines — a different way of handling non-linearity, rooted in Hilbert-space geometry rather than graphical factorisation. Where PGMs ask "what variables depend on what, and how do we compute the consequence?", kernels ask "what functions live in what space, and what margin separates my classes?" The two traditions together account for most of classical ML's remaining ground; when they are joined with the feature-engineering and model-evaluation toolkit of Chapters 08–09, you will have the full classical-ML picture, against which the deep-learning revolution of Part V can be properly set.

How to read this chapter

Contents

Why probabilistic graphical models: making big joints tractable

Four things a graph gives you

When PGMs are the right tool

A map of the rest of the chapter

Graphs factorise joint distributions: the core move

The chain rule is the starting point

The same idea without direction

The representation determines the complexity

Bayesian networks: directed acyclic graphs over variables

Definition and the factorisation

The canonical example

Local and global Markov properties

Causal interpretation

Inference queries

Conditional independence and d-separation

Three canonical three-variable patterns

The d-separation criterion

Markov blanket

I-maps and faithfulness

Markov random fields: undirected models

Definition

The Hammersley–Clifford theorem

Log-linear form

Canonical examples

Directed versus undirected

Factor graphs: the unifying bipartite representation

Definition

Why bipartite matters for inference

The sum-product algorithm

Max-product and MAP inference

Loopy belief propagation

Exact inference: variable elimination and belief propagation

Variable elimination

Belief propagation on trees

Complexity

Gaussian graphical models are special

The junction-tree algorithm: exact inference on general graphs

Three-step construction

Belief propagation on the junction tree

When junction trees work — and when they do not

Why it still matters

Variational inference: posteriors as optimisation

The ELBO

Mean-field variational inference

Stochastic and amortised variants

ELBO gradients and the reparameterisation trick

Strengths and limitations

Markov-chain Monte Carlo: sampling the posterior

Metropolis–Hastings

Gibbs sampling

Hamiltonian Monte Carlo

Diagnostics

When to use MCMC

Parameter learning: MLE, MAP, Bayesian updates, EM

Maximum likelihood in Bayesian networks

Maximum likelihood in Markov random fields

MAP and Bayesian estimation

Expectation-maximisation

Small data, conjugate priors, full posteriors

Structure learning: which edges belong in the graph

The search space is enormous

Score-based search

Constraint-based learning

Causal discovery versus Markov-equivalent DAGs

Structure learning for MRFs

Hidden Markov models: discrete latent-state sequences

Definition

Three canonical problems

Forward–backward

Where HMMs have shone

When HMMs are the wrong tool

Kalman filters and state-space models

The linear-Gaussian model

The Kalman filter in two sentences

Smoothing

Extensions for non-linear systems

Kalman filters and modern sequence models

Conditional random fields