Almost every interesting learning problem reduces to reasoning about a joint probability distribution over a large set of variables — pixels and object labels, words and topics, symptoms and diagnoses, genes and phenotypes, state variables and observations across time. Written out as a single table, that joint is astronomical: a distribution over twenty binary variables is already a million entries, and the worlds we actually care about contain millions of variables. Probabilistic graphical models rescue the enterprise by factoring the joint into small local pieces whose dependencies are laid out in a graph: each node is a random variable, each edge encodes a direct statistical relation, and the graph itself becomes a compact, readable, and computable representation of an otherwise intractable distribution. Two dialects dominate the field: Bayesian networks — directed acyclic graphs that factor the joint into a product of conditionals, the natural way to express causal or generative stories — and Markov random fields — undirected graphs that factor the joint into a product of local potentials, the natural way to express symmetric constraints and pairwise compatibility. A third, factor graphs, unifies the two for inference. On top of the representation sits a toolkit for answering queries about the distribution — exact inference (variable elimination, belief propagation, junction trees) when the graph is tractable and approximate inference (variational methods, MCMC) when it is not — and a toolkit for learning the parameters and sometimes the structure from data. The chapter ends with the dynamical-systems specialisations that dominated pre-deep-learning sequence modelling — hidden Markov models, Kalman filters, conditional random fields — and with topic models, the PGM idea that anchors most of modern unsupervised text mining. PGMs gave the field its vocabulary for latent variables, conditional independence, causal structure, and principled uncertainty; deep learning absorbed the vocabulary but could not replace it, and today the two live in productive dialogue in variational autoencoders, normalising flows, diffusion, and probabilistic programming.
Section one motivates the enterprise by walking through the four things a graphical representation of a joint distribution gives you — compactness, readability, a tractable inference story, and a principled way to combine data with prior knowledge — and where each of those properties becomes indispensable. Section two makes the key move concrete: a graph factorises a joint distribution into a product of smaller local distributions, and the factorisation is exactly what turns an intractable problem into a tractable one. Sections three through five introduce the two central representations — Bayesian networks (directed, one conditional per node, natural for causal stories) and Markov random fields (undirected, one potential per clique, natural for symmetric constraints) — and section four works through the conditional-independence structure that a directed graph encodes, captured by the d-separation criterion. Section six introduces the factor graph, a bipartite representation that unifies directed and undirected models and is the natural data structure for message-passing inference. Together these six sections give the representation side of the field: how to draw a distribution as a graph, and how to read factorisation and independence back off the graph.
Sections seven through ten are the inference toolkit. Section seven develops exact inference — variable elimination and its message-passing incarnation, the sum-product algorithm — and shows why it runs in time exponential in the graph's treewidth, tractable for trees and small-treewidth graphs and intractable for most of the rest. Section eight is the junction-tree algorithm, the general-purpose scheme for turning a graph into a tree of cliques on which exact inference can still be run. Sections nine and ten cover the two families of approximate inference that dominate when exact methods will not finish — variational inference (treat the posterior as an optimisation problem, minimise KL divergence against a tractable family, the route taken by mean-field VI, variational Bayes, and every modern VAE) and Markov-chain Monte Carlo (treat the posterior as a target to be sampled, use Gibbs sampling, Metropolis–Hastings, or Hamiltonian Monte Carlo, the route taken by Stan and PyMC). Sections eleven and twelve are the learning toolkit: parameter learning (maximum likelihood, maximum a posteriori, Bayesian updates, EM for latent variables) and structure learning (which edges should be in the graph at all — score-based search, constraint-based tests, Bayesian model averaging, and the stubborn difficulty of causal discovery from observational data).
Sections thirteen through sixteen turn to four landmark specialisations whose influence on the field is hard to overstate. Section thirteen is hidden Markov models, the PGM for discrete-state sequences, the engine of speech recognition from the 1970s until neural approaches caught up in the 2010s, and still the first tool to reach for when modelling latent-state dynamics. Section fourteen is the Kalman filter and the broader family of linear dynamical systems — the continuous-state analogue of HMMs, the algorithm that brought humans to the moon, and the underpinning of modern state-space models and the "S4/Mamba" lineage. Section fifteen is conditional random fields, the discriminative cousin of HMMs that dominated structured-prediction benchmarks in NLP and computer vision through the 2000s. Section sixteen is latent Dirichlet allocation and the broader family of topic models — the canonical application of Bayesian networks to unsupervised text understanding. Section seventeen surveys the practical toolkit — PyMC, Stan, Pyro, bnlearn, pgmpy — and the operational decisions that separate a working PGM from a non-starter. Section eighteen closes with where graphical-model thinking sits in modern ML: deep PGMs, variational autoencoders and normalising flows, neural structured prediction, the reemergence of state-space models in sequence modelling, probabilistic programming, and the causal-inference frontier.
Most of machine learning reduces to reasoning about a joint distribution over many variables. Written naively, those joints are astronomical — the distribution over twenty binary variables already has over a million entries, and real problems involve thousands or millions of variables. Probabilistic graphical models are the representation that rescues the enterprise: a graph whose nodes are variables and whose edges encode direct statistical dependencies, and a collection of small local distributions attached to that graph that, multiplied together, reproduce the full joint. The graph is the contract — it says which variables can ignore which others, and therefore which computations are cheap.
The first is compactness. A graph with maximum in-degree k over n binary variables is specified by at most n·2^k parameters rather than 2^n. For n=1000 and k=5, the difference is 32 000 parameters versus a number that does not fit in any computer. Every non-trivial probabilistic model fits in memory only because some such factorisation holds.
The second is readability. A PGM is literally a diagram of what depends on what. Domain experts who cannot read probability theory can still read the diagram, point at an edge, and say "that dependence is wrong" or "this variable should also feed into that one." No other representation of a probability distribution supports that conversation.
The third is inference. The graph structure determines which queries are cheap. For tree-structured graphs, computing any marginal or conditional is linear in the size of the graph. For graphs of small treewidth (the key complexity measure), exact inference still runs in polynomial time. For dense graphs, we fall back on approximation — variational methods and MCMC — but the fallback is itself guided by the graph.
The fourth is principled combination of prior and data. A graphical model specifies a prior, a likelihood, and a mechanism for updating beliefs given evidence; the posterior is what the graph becomes after inference, and it inherits all the compositional structure of the prior. When new data arrives, or new variables are observed, the update is well-defined and does not require relearning the model from scratch.
P(X₁,…,Xₙ) factorises as a product of small local distributions, one per node (Bayesian network) or per clique (Markov random field). The graph encodes which variables appear together in a local piece. Everything else — inference, learning, semantics — follows from that factorisation.
Graphical models shine when the structure of the problem is natural or known: causal chains, hidden states in time, known conditional independences, expert-elicited medical or diagnostic networks, pedigrees, spatial-dependence problems in vision (pixels depend on neighbours), or when uncertainty needs to be carried end-to-end through a pipeline. They also shine when data is scarce relative to variables, because the factorisation acts as a strong inductive bias. They are a weaker fit when the joint is highly non-linear and we have enormous data — the regime where deep learning has won — though the two families meet in variational autoencoders, normalising flows, and neural structured-prediction models, which are graphical models whose local pieces are neural networks.
The next five sections build the representation: how to turn a factorisation into a graph, how directed (Bayesian) and undirected (Markov) graphs differ, how to read conditional independence off a graph, and how factor graphs unify the two. Sections 7–10 develop the two inference families: exact (variable elimination, junction trees) and approximate (variational inference, MCMC). Sections 11–12 are learning: parameters from data and, harder, structure. Sections 13–16 are the four classic specialisations that dominated pre-deep-learning sequential and latent-variable modelling — HMMs, Kalman filters, CRFs, topic models. Sections 17–18 close with practice and the place of PGMs in the modern ML landscape.
Before committing to directed or undirected, pause on the single idea that makes the whole field work. A graph on variables X₁,…,Xₙ partitions the joint distribution into small pieces, and the full joint is the product of those pieces. The partition is what decouples variables that would otherwise be entangled; the product is what reassembles a coherent distribution from the pieces. Everything algorithmic in probabilistic graphical models — inference, learning, sampling — is a manipulation of that product, and the graph is the shape that tells us which manipulations are cheap.
For any ordering of variables, the joint can be written as a product of conditionals via the chain rule of probability:
P(X₁,…,Xₙ) = P(X₁) · P(X₂ | X₁) · P(X₃ | X₁,X₂) · … · P(Xₙ | X₁,…,Xₙ₋₁).
This is a tautology — it holds for every joint — and by itself it saves nothing: the last conditional still depends on all prior variables. The PGM insight is that, in almost every real problem, most of those conditionals do not truly depend on all the earlier variables but only on a small subset of them. If we can write
P(Xₖ | X₁,…,Xₖ₋₁) = P(Xₖ | Pa(Xₖ))
where Pa(Xₖ) ⊆ {X₁,…,Xₖ₋₁} is a small set of parents, we have a Bayesian network. The graph draws an edge from each parent to the child, which gives a directed acyclic graph, and the joint factorises as a product of local conditionals:
P(X) = ∏ₖ P(Xₖ | Pa(Xₖ)).
Directed factorisations are natural when one variable generates another — a disease causes a symptom, a topic generates a word, a hidden state emits an observation. For symmetric relationships — two pixels with similar colours are likely neighbours, two papers that share many keywords are probably about the same topic — a direction is awkward. The undirected counterpart is a Markov random field: it factorises the joint into a product of potential functions, one per clique in the graph, normalised by a partition function Z:
P(X) = (1/Z) · ∏_C ψ_C(X_C), where Z = Σ_X ∏_C ψ_C(X_C).
The potentials are not probabilities — they are arbitrary non-negative functions — and Z is the awkward price of using them. Computing Z is in general as hard as inference itself, which is why undirected models are often harder to work with than directed ones.
The graph is not merely a drawing of a factorisation — it is the factorisation, and the algorithmic cost of every query against the model is determined by its shape. A graph that is a tree admits exact inference in linear time. A graph whose treewidth is small admits exact inference in time exponential in the treewidth but polynomial in the number of nodes. A graph with high treewidth forces us to approximate. When designing a model, the shape of the graph is as much an engineering choice as the choice of parameters.
P(A,B,C,D,E) for a Markov chain A → B → C → D → E. How many parameters does the full joint have if every variable is binary? How many after factorisation? (Answer: the joint has 31 free parameters, the factorisation has 1+2+2+2+2 = 9. For a chain of n binary variables, the saving is 2ⁿ−1 → 2n−1.)
A Bayesian network — also called a belief network, causal network, or directed graphical model — is the PGM you reach for when the problem has a natural generative story: causes precede effects, latent states precede observations, parents produce children. Formally it is a directed acyclic graph over the variables together with a conditional probability distribution (CPD) for each node, given its parents. The joint is the product of those local CPDs, and the graph makes every conditional-independence assumption in the model visible at a glance.
A Bayesian network over variables X₁,…,Xₙ consists of (i) a directed acyclic graph G with a node per variable, and (ii) for each node Xᵢ a conditional probability distribution P(Xᵢ | Pa(Xᵢ)) where Pa(Xᵢ) is the set of parents of Xᵢ in G. The joint is then
P(X₁,…,Xₙ) = ∏ᵢ P(Xᵢ | Pa(Xᵢ)).
For discrete variables the CPDs are tables; for continuous variables they are parametric densities (Gaussians are canonical); for mixed models they can be any well-defined conditional distribution. A root node (no parents) has an unconditional prior P(Xᵢ).
Judea Pearl's textbook example: an alarm network. The variables are Burglary, Earthquake, Alarm, JohnCalls, MaryCalls. The story is: a burglary or an earthquake can set off the alarm; the alarm can make John or Mary call. The DAG has Burglary → Alarm, Earthquake → Alarm, Alarm → JohnCalls, Alarm → MaryCalls. The joint
P(B,E,A,J,M) = P(B) · P(E) · P(A | B,E) · P(J | A) · P(M | A)
replaces 31 free parameters (the full binary joint minus the normalisation) with 1 + 1 + 4 + 2 + 2 = 10, and the graph says — correctly — that John and Mary are conditionally independent given the alarm, that the two causes are marginally independent, and that knowing a call increases our belief in an earthquake only indirectly via the alarm.
Two semantically equivalent ways of stating what a Bayesian network asserts. The local Markov property says each variable is conditionally independent of its non-descendants given its parents — intuitively, the parents screen off the ancestry. The global Markov property says conditional independence holds exactly when a graphical criterion called d-separation (next section) holds. The two are equivalent: every independence claimed by d-separation follows from the local property, and vice versa. The graph is a map of all the conditional independences implied by the factorisation.
A Bayesian network is a statistical object — a factorisation of a joint — and nothing in the definition commits to a causal reading. However, when the directed edges reflect causal mechanisms (B → A means "burglary causes alarm"), the network supports reasoning about interventions: what happens to JohnCalls if we manually set Alarm = true? Pearl's do-calculus formalises this, distinguishing observational probabilities P(X | Y) from interventional ones P(X | do(Y)). Causal structure learning from observational data — recovering the DAG when only joint samples are observed — is one of the hardest problems in the field and is covered in section 12.
Given a Bayesian network and some observed evidence E = e, the three queries that matter are: (i) the marginal P(Xᵢ | e) for a single variable; (ii) the most probable explanation or MAP — the assignment to all unobserved variables that maximises P(X_unobs | e); (iii) the conditional expectation E[f(X) | e] for some function f. All three are NP-hard in general but tractable on trees and small-treewidth graphs, which is what the next several sections develop.
The great conceptual gift of Bayesian networks is that conditional independence — the statistical property that would otherwise require painful algebra to check — can be read off the graph by a simple graphical criterion. The criterion is called d-separation (for "directed separation"), and it classifies every path between two sets of variables as either blocked (independence preserved) or active (dependence transmitted), conditioned on a third set of observed variables. Mastering d-separation is how you read a Bayesian network.
Every path through a DAG is composed of three types of triples, and each has a different behaviour under conditioning:
A chain is A → B → C. A and C are marginally dependent; conditioning on the middle variable B renders them independent (B screens off A from C).
A fork is A ← B → C. A and C share a common cause B; they are marginally dependent; conditioning on B renders them independent.
A collider (or v-structure) is A → B ← C. A and C are marginally independent — they have no shared cause — but conditioning on the collider B (or any descendant of B) makes them dependent. This is the surprising case and the source of most mistakes in graphical reasoning.
Two sets of variables X and Y are d-separated by a third set Z if every undirected path between a node in X and a node in Y is blocked by Z. A path is blocked if it contains either (i) a chain or fork whose middle node is in Z, or (ii) a collider whose middle node — and all its descendants — are not in Z. If every path is blocked, the two sets are conditionally independent given Z in every distribution that factors according to the graph.
The Markov blanket of a node X is the smallest set of variables that renders X independent of the rest of the graph. For a Bayesian network, it consists of the parents of X, the children of X, and the other parents of those children (the "co-parents"). For a Markov random field, it is simply the set of neighbours. The Markov blanket is the right notion of "local neighbourhood" for Gibbs sampling and for variational inference, because conditioning on it cuts a node free from everything else.
A graph is an I-map of a distribution if every independence the graph asserts is true in the distribution. A distribution is faithful to the graph if every independence in the distribution is captured by the graph. A graph is a perfect map if both hold. The CI structure of most real distributions is not perfectly representable by any DAG — there exist independences that cannot be drawn as a DAG — which is one of the motivations for undirected and factor-graph representations.
When relationships are inherently symmetric — two adjacent pixels prefer to take similar values, two co-authored papers share topic, two neighbours in a social network influence each other — the directed formalism becomes awkward. A Markov random field (also called a Markov network or undirected graphical model) solves this by factorising the joint over the cliques of an undirected graph, with a non-negative potential function on each clique and a global normalisation constant. The representation is natural for problems where "which way does the arrow go?" has no good answer.
A Markov random field over variables X₁,…,Xₙ consists of (i) an undirected graph G with a node per variable, and (ii) a non-negative potential function ψ_C(X_C) for each maximal clique C of G. The joint is
P(X) = (1/Z) · ∏_C ψ_C(X_C), Z = Σ_X ∏_C ψ_C(X_C).
The potential is an arbitrary non-negative function — unnormalised and with no probabilistic interpretation on its own. The partition function Z is the global normaliser; computing it is in general #P-hard, which is what makes undirected models harder than directed ones.
The cornerstone result, due to Hammersley and Clifford in 1971: a strictly positive distribution P(X) is a Markov random field with respect to graph G — meaning it satisfies the local Markov property that every variable is independent of the rest of the graph given its neighbours — if and only if it factorises over the cliques of G. The theorem is what lets the graph be both a representation of independence structure and a representation of a factorisation at the same time.
Potentials are often written in log-linear form. Define features f_k(X_Ck) for each clique and its potential, with weights θ_k:
P(X) = (1/Z) · exp( Σ_k θ_k · f_k(X_Ck) ).
This is the Gibbs distribution, the exponential family, and the log-linear model all at once. Conditional random fields (section 15), Ising models, and every "exponential family" density sit in this form. The log-likelihood is concave in θ, which means parameter learning is a convex optimisation — one of the pleasant properties of MRFs.
The Ising model is an MRF on a grid of binary spins: pairwise potentials prefer adjacent spins to align. It is the prototype for everything from ferromagnets to image denoising. The Potts model generalises it to multi-valued variables. In vision, Markov random fields with pairwise potentials are the classical model for image denoising, stereo matching, and segmentation — the energy E(X) = Σ unary(Xᵢ) + Σ pairwise(Xᵢ,Xⱼ) is minimised by techniques like graph cuts and iterated conditional modes. In NLP, conditional random fields (section 15) are MRFs conditioned on an input sequence.
Directed models are natural for causal or generative stories; undirected models are natural for symmetric compatibilities. Directed models always have a tractable Z = 1; undirected models pay a price computing Z. Directed models have structured, interpretable conditionals; undirected models have features and weights. Neither subsumes the other: the independence structures representable by a DAG and those representable by an undirected graph are different, and each has examples the other cannot represent. In practice, Bayesian networks dominate in medical diagnostics, expert systems, and causal inference; MRFs dominate in vision, spatial statistics, and structured prediction.
Directed and undirected graphs are two different pictures of a distribution. For inference, a third picture — the factor graph — turns out to be the cleanest. A factor graph is a bipartite graph with two kinds of nodes: variable nodes (circles) and factor nodes (squares), with an edge connecting a factor to each variable in its scope. Any factorisation — directed, undirected, or a mix — can be drawn this way, and the resulting graph is what modern message-passing inference algorithms actually operate on.
A factor graph for a joint distribution P(X) = (1/Z) · ∏_j f_j(X_j) has a variable node for each Xᵢ, a factor node for each f_j, and an edge between factor f_j and each variable in its scope X_j. Bayesian-network conditionals P(Xᵢ | Pa(Xᵢ)) become factors with scope {Xᵢ} ∪ Pa(Xᵢ); MRF clique potentials ψ_C(X_C) become factors with scope C. The bipartite form makes the factorisation explicit and distinguishes two cliques that happen to share the same variable set but come from different factors.
An MRF drawn as an undirected graph loses information when two different factors have overlapping scopes: the union appears as a single clique. A factor graph keeps them separate. This matters because message-passing algorithms — sum-product, belief propagation, variational message passing — run on the factor graph and send distinct messages along each edge. On a tree-structured factor graph, the sum-product algorithm computes all marginals exactly in two passes (leaves-to-root, then root-to-leaves) in time linear in the number of edges.
Messages flow in two directions. Variable-to-factor: a variable X sends to a neighbouring factor f the product of messages it receives from all its other factor neighbours. Factor-to-variable: a factor f sends to a neighbouring variable X the sum over all its other variables of the product of f with the incoming variable messages. Iterate until the messages stop changing (one pass suffices on a tree). The marginal of each variable is then the product of all incoming messages at that variable node. This is the sum-product algorithm, also known as belief propagation.
Replacing the sums with maxes turns the algorithm into max-product, which computes the most probable assignment (the MAP configuration) rather than marginals. This is the message-passing generalisation of the Viterbi algorithm used in HMMs, and of the many shortest-path algorithms used for decoding in convolutional codes and NLP taggers.
On a graph with cycles, running the sum-product algorithm as if the graph were a tree is no longer exact — the messages can oscillate, the marginals are wrong — but it often works remarkably well in practice, and there are settings (Gaussian graphical models, certain log-linear models) where it can be shown to converge to sensible approximations. Loopy belief propagation is one of the cheap approximate-inference workhorses, and its relationship to variational inference (section 9) is one of the surprising connections in the field.
Given a factorisation and some evidence, the fundamental query is "what is the marginal (or MAP) of a subset of variables?" Exact inference is what we have when the graph is small enough, or tree-like enough, that the answer can be computed without approximation. Two algorithms do most of the work: variable elimination, a non-recursive algebraic procedure that sums out variables one at a time; and belief propagation (sum-product on a factor graph), a distributed message-passing procedure. They compute the same thing on a tree in the same asymptotic time; they differ in what they make reusable across queries.
To compute the marginal P(Q | E), multiply together all factors, fix the evidence variables, and sum out the non-query non-evidence variables one at a time. Efficiency depends on the elimination order: a good order produces intermediate factors that stay small; a bad order produces intermediate factors the size of the joint. Finding the optimal order is NP-hard in general, but good heuristics exist (min-fill, min-weight). The runtime is exponential in the induced treewidth of the graph — the size of the largest intermediate factor generated during elimination — which is why treewidth is the complexity measure of graphical-model inference.
On a tree, the sum-product algorithm is exact. Two message passes — one from leaves to root, one from root to leaves — produce every marginal simultaneously. The runtime is linear in the number of variables and the complexity of the factors. For discrete variables with k states and pairwise potentials, the cost is O(nk²). On an HMM drawn as a chain, this is exactly the forward–backward algorithm (section 13). The fact that two quite different-looking algorithms — forward–backward, the Viterbi algorithm, Kalman smoothing, sum-product on a tree — are all the same message-passing algorithm on different factor graphs was one of the unifying discoveries of the 1990s.
Exact inference is NP-hard in general graphical models (Cooper 1990). Even approximating the marginals to within a fixed factor is NP-hard in the worst case (Dagum and Luby 1993). Tractable cases: trees and polytrees (linear), small treewidth (exponential in treewidth only), singly-connected graphs, chains, grids of bounded width. Everything else requires approximation. The practical upshot: exact inference is the right tool for small expert systems, HMMs, small Bayesian networks, and Gaussian graphical models whose structure is tree-like; for medium and large networks, use variational inference or MCMC.
When all variables are jointly Gaussian, the joint is completely specified by its mean vector and covariance (or precision) matrix, and inference reduces to linear algebra. Conditional independence in a Gaussian corresponds exactly to zero entries in the precision matrix. Conditioning, marginalising, and computing maximum-likelihood parameters are all matrix operations. The Kalman filter (section 14) is the prototypical Gaussian graphical model, and the broader machinery underpins graphical Lasso, Gaussian Markov random fields in spatial statistics, and much of modern structured linear algebra.
Variable elimination is efficient on trees and painful on anything else. The junction tree algorithm is the systematic way to extend exact inference to graphs with cycles: transform the original graph into a tree whose nodes are clusters (maximal cliques) of the original variables, then run belief propagation on that tree. The result is exact inference in time exponential in the treewidth of the graph rather than exponential in the number of variables — tractable whenever the treewidth is small.
First, moralise the graph: for a directed model, marry co-parents (connect all parents that share a child) and drop edge directions. Second, triangulate it: add chords until no cycle of length four or more is chordless. Triangulation ensures that the maximal cliques form a tree structure; the minimum-treewidth triangulation is NP-hard but heuristics work well. Third, build the junction tree: nodes are maximal cliques of the triangulated graph, edges connect cliques that share variables, and the running-intersection property holds (every variable appearing in two cliques appears in every clique on the path between them).
Each clique holds a potential — the product of all original factors whose scope is contained in the clique. Neighbouring cliques exchange messages over their separator (the intersection of their variables). Two passes — collect-to-root, then distribute-from-root — make every clique consistent with its neighbours. Marginals of any variable, or of any set of variables that lie in a single clique, are read off directly from the clique potentials. The runtime is exponential in the size of the largest clique (i.e., treewidth plus one) and linear in the number of cliques.
Treewidth is a hard property of the graph. Tree-structured graphs have treewidth 1; polytrees have treewidth 1; grids of width k have treewidth k; random graphs have treewidth that grows with the number of nodes. An n×n grid has treewidth n, so exact inference on a 100×100 image MRF is forever out of reach — the largest clique would have 100 variables and 2¹⁰⁰ entries. This is where approximate methods (the next two sections) take over.
Many practical problems have small natural treewidth: medical Bayesian networks, phylogenetic trees, pedigree analyses, small expert systems, many Gaussian networks. When the treewidth is in the range of 10–25, the junction tree is fast and exact. It is also the algorithm built into the classical inference engines of tools like Netica, Hugin, and GeNIe. For modern large-scale work the baton passes to variational inference and sampling, but the junction tree is the yardstick: an approximate method is "good" in part because it agrees with the junction tree on problems where both can run.
When exact inference is intractable, one option is to pose approximate inference as an optimisation problem: pick a simple, tractable family of distributions q(Z) and find the member of that family that is closest to the true posterior p(Z | X). The closeness is measured in KL divergence, and the minimisation is equivalent to maximising a lower bound on the log marginal likelihood — the ELBO. Variational inference trades statistical consistency for deterministic runtime: it always returns an answer in finite time, though not necessarily the right one. It is the backbone of modern scalable probabilistic inference and the engine inside every variational autoencoder.
Start from the identity log p(X) = L(q) + KL(q(Z) ∥ p(Z | X)), where
L(q) = E_q[log p(X, Z)] − E_q[log q(Z)]
is the evidence lower bound. Since KL is non-negative, L(q) ≤ log p(X). Maximising L(q) over a tractable family of q's simultaneously tightens the bound on the marginal likelihood and pushes q towards the true posterior. All of variational inference is the design of the family and the optimisation of the ELBO.
The simplest choice of family is mean-field: assume the posterior factorises across the latent variables, q(Z) = ∏ᵢ qᵢ(Zᵢ). Under this assumption, the optimal qᵢ has a closed-form update: each factor is the exponential of the expected log joint under the other factors. Iterating the updates is the coordinate-ascent variational inference (CAVI) algorithm, and it always converges to a local optimum. Mean-field VI is the classic scheme for latent Dirichlet allocation, Bayesian mixture models, and Gaussian mixtures with conjugate priors.
Two extensions made variational inference scale to modern problems. Stochastic variational inference (Hoffman et al., 2013) replaces the expensive full-data expectations with noisy mini-batch estimates, enabling VI on datasets of any size. Amortised variational inference — the key idea in variational autoencoders (Kingma & Welling, 2013) — replaces the per-datapoint optimisation of q with a single neural network q_φ(Z | X) that predicts the approximate posterior as a function of the input. Amortisation is cheaper at inference time (one forward pass instead of an optimisation) and is the reason VAEs scale to millions of images.
Optimising the ELBO with gradient descent requires gradients through an expectation under q. The reparameterisation trick expresses a sample from a Gaussian q as a deterministic function of its parameters and auxiliary noise — z = μ + σ · ε, ε ∼ N(0, I) — which lets backpropagation flow through the sampling step. This single idea is what made VAEs, normalising flows, and much of modern deep generative modelling tractable.
Variational inference is fast, deterministic, composable with gradient-based optimisers, and scalable to large datasets. Its weakness is that the mean-field factorisation systematically underestimates posterior variance — it has a well-documented tendency to pick one mode and ignore the others, because KL(q ∥ p) penalises q mass where p has none but not the reverse. Structured VI, normalising flows, and more expressive families (Gaussian processes, neural flow-based q's) partially fix this. For well-calibrated posteriors, MCMC (next section) is still the gold standard.
The alternative to optimising an approximation is sampling the real thing. MCMC constructs a Markov chain whose stationary distribution is exactly the posterior we cannot compute in closed form, runs the chain until it has "burned in," and then uses the samples as a Monte Carlo estimate of any expectation we want. It is slower and less predictable than variational inference, but with enough samples it converges to the truth, and it does not require a tractable approximating family. MCMC is the inference engine behind Stan and PyMC, and the gold-standard baseline against which every variational method is judged.
The foundational algorithm, due to Metropolis et al. 1953 and generalised by Hastings 1970. Propose a new state z' from some proposal distribution q(z' | z). Accept the move with probability
α = min(1, [p(z') q(z | z')] / [p(z) q(z' | z)]).
Crucially, p need only be known up to a constant — the normalisation cancels. This is what lets MCMC sample an unnormalised Bayesian posterior without computing the evidence p(X). The proposal can be anything; the price of a poor proposal is slow mixing. Tuning proposals is most of the practical work of writing a Metropolis sampler.
When the conditional p(Zᵢ | Z₋ᵢ, X) of each latent variable given the others is tractable — a common situation in Bayesian networks with conjugate priors — Gibbs sampling is the natural specialisation of Metropolis–Hastings. Cycle through variables, sampling each from its full conditional. Every step is accepted (the acceptance ratio is always 1). Gibbs is the algorithm behind topic-model inference in the classic LDA papers, and behind most MCMC for conditionally conjugate models.
For continuous latents, ordinary random-walk proposals mix slowly: they take many small steps to explore the typical set. Hamiltonian Monte Carlo (Neal 2011) fixes this by treating the log-posterior as a potential energy, adding an auxiliary momentum variable, and simulating Hamiltonian dynamics with a symplectic integrator — so proposals can traverse large distances while respecting the geometry of the posterior. Stan's NUTS (no-U-turn sampler, Hoffman & Gelman 2014) automates the step-size and trajectory-length tuning that used to require expert hand-holding. HMC is the default algorithm for continuous Bayesian inference today.
MCMC is correct only asymptotically, so diagnostics matter. Trace plots show whether chains have mixed. The Gelman–Rubin statistic R̂ compares variance within chains to variance between chains; values near 1 indicate convergence, values above 1.01 are warning signs. The effective sample size measures how many "independent" samples the chain has produced after autocorrelation. Good practice: run at least four chains from different starts; look at R̂, ESS, and divergences; treat suspicious posteriors with suspicion.
MCMC is the right choice when (i) you need calibrated uncertainty estimates, not just point estimates; (ii) the posterior is multi-modal and variational methods are getting stuck in one mode; (iii) the model is complex enough that no reasonable q-family approximates the posterior well; (iv) the dataset is small enough that sampling is affordable. For large-scale, fast-iteration regimes — training neural-network latent-variable models on millions of images — variational inference still wins. The modern ecosystem has both well integrated: Stan, PyMC, NumPyro for MCMC-first workflows; Pyro and TensorFlow Probability for VI-first workflows with MCMC as a fallback.
Given a graph and a dataset, how do we estimate the parameters of the local conditionals or potentials? Three routes dominate: maximum likelihood (pick the parameters that best explain the data), maximum a posteriori (MLE with a prior), and full Bayesian inference (treat parameters as random variables with their own posterior). When latent variables are present, the classical tool is the expectation-maximisation algorithm, which alternates between filling in the latents and re-estimating the parameters. Parameter learning in PGMs is what separates a workable model from a toy.
For a fully observed Bayesian network, the likelihood factorises across nodes and the MLE decomposes into separate estimates for each conditional: P(Xᵢ | Pa(Xᵢ)) is estimated by counting (for discrete variables) or by fitting the local parametric form (Gaussian linear regression for continuous children of continuous parents, etc.). The decomposition is what makes directed models pleasant to learn: each local CPD can be trained independently.
For an MRF, the log-likelihood is
ℓ(θ) = Σᵢ log p(X⁽ⁱ⁾; θ) = Σᵢ ( Σ_k θ_k f_k(X⁽ⁱ⁾) − log Z(θ) ).
It is concave in θ, so in principle a simple gradient ascent works. In practice, the gradient contains an expectation under the model distribution — ∂/∂θ_k log Z(θ) = E_{p(X;θ)}[f_k(X)] — which requires inference at every step. For small-treewidth graphs this is feasible; for larger graphs, one resorts to pseudo-likelihood (replace the full likelihood with a product of conditionals), contrastive divergence (Hinton 2002, the algorithm behind Boltzmann machine training), or stochastic MCMC-based gradient estimates.
Maximum a posteriori adds a prior: θ̂_MAP = arg max_θ p(θ | D) = arg max_θ p(D | θ) p(θ). For discrete Bayesian networks with Dirichlet priors, this is Laplace smoothing — a pseudo-count added to every cell — and makes the estimates well-behaved when some parent configurations are rare. Full Bayesian parameter inference treats θ as a latent variable and computes the posterior over parameters, yielding predictive distributions that integrate over parameter uncertainty. The machinery is MCMC or variational (sections 9–10).
When some variables are latent or some data are missing, direct MLE is intractable: the log-likelihood contains a sum (or integral) over the latents. The EM algorithm (Dempster, Laird & Rubin 1977) handles this iteratively. The E-step computes the posterior over latents given current parameters: q⁽ᵗ⁾(Z) = p(Z | X, θ⁽ᵗ⁾). The M-step updates parameters to maximise the expected complete-data log-likelihood under q⁽ᵗ⁾: θ⁽ᵗ⁺¹⁾ = arg max_θ E_{q⁽ᵗ⁾}[ log p(X, Z; θ) ]. Each iteration monotonically improves the observed-data log-likelihood. Convergence is to a local optimum; restarts mitigate.
EM generalises an enormous set of classical algorithms: Baum–Welch (EM on HMMs, section 13), the Gaussian-mixture EM that fits GMMs (Chapter 04), LDA's variational EM (section 16), and the factor-analysis iterations of Chapter 05. Every time a model has latents and closed-form conditional updates, EM is the natural algorithm.
In practice, small PGMs — medical decision networks with a few dozen nodes, pedigree analyses, industrial reliability networks — are routinely learned end-to-end with Bayesian machinery: conjugate Dirichlet-multinomial updates for discrete CPDs, normal-inverse-Wishart for Gaussian CPDs, MCMC or VI for anything more exotic. The resulting model carries its parameter uncertainty through to predictions, which is particularly valuable in domains where calibrated uncertainty matters more than raw accuracy.
Parameter learning assumes the graph is given. Structure learning asks the harder question: which edges should the graph have? Two families of approaches — score-based (assign a score to every graph, search for the highest-scoring) and constraint-based (test conditional independences in the data, assemble a graph consistent with them) — dominate. Causal structure discovery, the deeply hard problem of recovering the causal DAG from observational data, is one of the most active frontiers in machine learning. Structure learning is where PGMs touch causal inference, symbolic reasoning, and scientific modelling.
The number of DAGs on n nodes grows super-exponentially — more than n!, more than 2^(n choose 2) — and enumerating them is hopeless past about ten nodes. Structure-learning algorithms work around this by searching over equivalence classes (CPDAGs), restricting attention to graphs with bounded in-degree, using greedy heuristics, or imposing a topological order.
Assign a score to each candidate graph that balances fit and complexity: the Bayesian Information Criterion (BIC), the Akaike Information Criterion (AIC), or the Bayesian Dirichlet equivalent (BDe) score for discrete networks. Search the space with hill-climbing, tabu search, genetic algorithms, or dynamic programming on topological orders. The canonical efficient algorithm is GES (Greedy Equivalence Search, Chickering 2002), which is provably consistent in the infinite-data limit under mild assumptions. Bayesian model averaging — averaging predictions over many graphs weighted by their posterior probability — sidesteps the "which graph is right?" question but requires MCMC over the graph space.
Test conditional independences in the data using the chi-squared test (discrete) or partial correlation (Gaussian) and assemble a graph whose independence structure matches the tests. The canonical algorithm is the PC algorithm (Spirtes, Glymour & Scheines 1991): start from a complete undirected graph, remove edges whenever a conditioning set is found that renders the endpoints independent, then orient the remaining edges using collider patterns. The FCI algorithm (Fast Causal Inference) extends PC to handle latent confounders.
Without interventions, the best any algorithm can do from observational data is recover the Markov equivalence class of the true graph — the set of DAGs that imply the same conditional independences. All DAGs in the class are statistically indistinguishable; only interventional or temporal information can single out the causal one. This is a fundamental limit, not a shortcoming of algorithms. Modern causal-discovery work extends the space with assumptions that break the symmetry: linear non-Gaussian models (LiNGAM, Shimizu 2006), additive-noise models, and causal-representation-learning techniques that introduce auxiliary variables (interventions, instruments, temporal structure).
For Markov random fields, the canonical tool is the graphical Lasso (Friedman, Hastie & Tibshirani 2008): fit a sparse Gaussian inverse covariance by ℓ₁-penalised maximum likelihood. Zero entries in the inverse covariance correspond to missing edges in the Gaussian MRF. The same idea extends to discrete pairwise MRFs with group-Lasso penalties. Graphical Lasso is one of the few structure-learning methods that scale to hundreds or thousands of variables and is widely used in neuroscience and genomics.
A hidden Markov model is the PGM for sequences of observations generated by an unseen discrete state. The hidden state evolves as a Markov chain; each state emits an observation from a state-dependent distribution. HMMs dominated speech recognition, bioinformatics, and part-of-speech tagging from the 1970s through the 2010s, and remain the canonical first-pass model for any sequence where latent discrete dynamics are plausible. Every ML engineer should be able to draw an HMM, recognise the forward–backward algorithm, and explain Baum–Welch without hesitation.
An HMM is specified by: a set of hidden states S = {1, …, K}; an initial state distribution π = P(Z₁); a transition matrix A with A[i,j] = P(Zₜ = j | Zₜ₋₁ = i); an emission distribution P(Xₜ | Zₜ = i). The graph is a chain: Z₁ → Z₂ → … → Z_T, with each Zₜ → Xₜ. The joint factorises as
P(Z, X) = P(Z₁) · ∏ₜ P(Zₜ | Zₜ₋₁) · ∏ₜ P(Xₜ | Zₜ).
Evaluation: given observations X, compute P(X | θ) — useful for likelihood ratios and classifier construction. Solved by the forward algorithm, a dynamic program with runtime O(TK²).
Decoding: given observations X, find the most probable state sequence Z* = arg max_Z P(Z | X). Solved by the Viterbi algorithm (max-product on the chain), also O(TK²).
Learning: given observations and no state labels, estimate the parameters (π, A, B). Solved by the Baum–Welch algorithm — EM on the HMM, with the E-step being forward–backward and the M-step being weighted re-estimation of π, A, and the emission parameters.
The forward variable αₜ(i) = P(X₁,…,Xₜ, Zₜ = i) and the backward variable βₜ(i) = P(Xₜ₊₁,…,X_T | Zₜ = i) satisfy recursions that compute all state marginals P(Zₜ = i | X) in a single pass forward and a single pass backward. This is exactly the sum-product algorithm (section 6) specialised to a chain-shaped factor graph, and it is identical in structure to Kalman smoothing for continuous-state models.
Speech recognition: phones as hidden states, acoustic features as observations — the entire pre-neural ASR pipeline. Part-of-speech tagging and named-entity recognition: tags as hidden states, words as observations. Bioinformatics: the CpG island problem, profile HMMs for protein family detection, and gene-finding (GENSCAN). Change-point detection and regime-switching models in finance. Activity recognition from wearable sensors. Molecular motor modelling from single-molecule experiments.
HMMs impose two assumptions that are often wrong. First, the Markov assumption: the state at time t depends only on the state at t−1. Real processes have long-range dependencies. Second, the discrete-state assumption: the state is one of K categories. Continuous-state dynamics require Kalman filters (next section); long-range dependencies require higher-order HMMs, semi-Markov models, or neural sequence models (Chapter V). The decline of HMMs in speech and NLP is directly traceable to neural models' ability to handle both.
The continuous-state analogue of the HMM is the linear dynamical system, or linear-Gaussian state-space model: a continuous hidden state evolves linearly with Gaussian noise, and a continuous observation is a linear function of the state with Gaussian noise. The inference algorithm — the Kalman filter — is so efficient and so numerically stable that it is the default tool for tracking, navigation, and sensor fusion in every field that uses them. It put humans on the moon in 1969 and is the algorithmic backbone of every GPS unit, every inertial navigation system, and a great deal of modern state-space-based sequence modelling.
The state-space equations:
Zₜ = A Zₜ₋₁ + wₜ, wₜ ~ N(0, Q)
Xₜ = H Zₜ + vₜ, vₜ ~ N(0, R)
with A the transition matrix, Q the process-noise covariance, H the observation matrix, R the observation-noise covariance. All hidden and observed variables are jointly Gaussian, which is what makes inference tractable — the entire filter is linear algebra on mean vectors and covariance matrices.
At each step, the posterior over the current hidden state is a Gaussian with some mean and covariance. The predict step pushes this Gaussian forward through the transition: ẑₜ₊₁⁻ = A ẑₜ, P̂ₜ₊₁⁻ = A Pₜ Aᵀ + Q. The update step combines the predicted Gaussian with the new observation using Bayes' rule for Gaussians: compute the Kalman gain K = P̂⁻ Hᵀ (H P̂⁻ Hᵀ + R)⁻¹, and update ẑ = ẑ⁻ + K(x − H ẑ⁻), P = (I − KH) P̂⁻. Iterate. The filter is optimal (minimum-variance unbiased) for any linear-Gaussian system.
The Kalman filter is an online procedure — it computes the posterior over each hidden state given observations up to and including the current one. Kalman smoothing (the Rauch–Tung–Striebel algorithm) is the offline counterpart: given all observations, compute the posterior over every hidden state given the full sequence. It is the direct analogue of forward–backward for HMMs, and it is what you want when you can process the whole sequence at once.
Real dynamics are rarely exactly linear. The Extended Kalman Filter (EKF) linearises the transition and observation functions around the current estimate at each step. The Unscented Kalman Filter (UKF, Julier & Uhlmann 1997) uses a deterministic sample of sigma points propagated through the non-linear dynamics — more accurate than linearisation at comparable cost. For strongly non-linear or non-Gaussian systems, particle filters (sequential Monte Carlo) replace the Gaussian approximation with a weighted set of samples and are the go-to tool for robot localisation in non-parametric settings.
The mathematical lineage from the Kalman filter to modern state-space sequence models — Linear State-Space Layers, S4 (Gu, Goel & Ré 2022), Mamba (Gu & Dao 2023) — is direct: each is a linear dynamical system whose state dimension and transition structure have been adapted for the efficient processing of long sequences, and for the parameters to be learned end-to-end from gradients rather than estimated from physical first principles. The state-space revival in deep sequence modelling is one of the clearest examples of how classical PGM machinery keeps returning under new names.
A hidden Markov model is generative — it models the joint P(X, Z), which includes modelling how the observations were produced. For structured prediction — labelling every token in a sentence, every pixel in an image, every amino acid in a protein — we typically do not care about P(X); we care about P(Z | X), the distribution over labels given the input. Conditional random fields (Lafferty, McCallum & Pereira 2001) are the discriminative counterpart: undirected graphical models of the conditional distribution, with rich input-dependent features. CRFs dominated structured-prediction benchmarks in NLP and computer vision through the 2000s and remain the classical baseline.
A CRF on observations X and labels Z defines
P(Z | X) = (1/Z(X)) · exp( Σₜ Σ_k θ_k f_k(Zₜ, Zₜ₋₁, X, t) ),
where the features f_k can depend on the current label, the previous label, the entire observation sequence X, and the position t. For a linear-chain CRF, the graph over labels is a chain — identical in structure to an HMM, but with one-sided conditioning on X. The partition function Z(X) depends on the input, which is the key advantage: rich, non-local input features can be used freely without worrying about modelling their joint distribution.
Two reasons. First, HMMs require the observations to be conditionally independent given the states — a strong assumption that is usually wrong for rich inputs like words, where prefixes, suffixes, gazetteers, and capitalisation are all informative. CRFs sidestep this because they do not model P(X) at all. Second, HMMs are generative, so errors in modelling P(X) degrade label predictions; CRFs optimise the quantity we actually care about. Empirically, for tasks like named-entity recognition, part-of-speech tagging, and shallow parsing, CRFs were consistently 1–3 points more accurate than the analogous HMMs in the 2000s.
Inference on a linear-chain CRF is identical to forward–backward on an HMM — compute marginals for each label and each label pair. Decoding is Viterbi. Training maximises the conditional log-likelihood Σᵢ log P(Zⁱ | Xⁱ; θ) with L-BFGS or stochastic gradient descent. The gradient is the difference between the empirical feature expectation and the model's feature expectation — identical in structure to MRF training (section 11), but tractable because inference on each individual chain is cheap.
Skip-chain CRFs add long-range edges between tokens likely to share labels (e.g., the same word appearing twice). Semi-Markov CRFs label segments rather than tokens and are the right tool when label boundaries need to be predicted explicitly. 2D CRFs on grid graphs are the structured-prediction backbone for semantic segmentation — conditional random fields over pixel labels with pairwise potentials encoding "adjacent pixels prefer similar labels". In the 2010s, neural architectures (bi-LSTMs, then transformers) replaced the hand-crafted features, and the CRF layer moved from the whole model to a final structured-output head — the "BiLSTM-CRF" architecture was state-of-the-art for sequence labelling from 2015 to about 2018 and is still a strong, cheap baseline.
When PGM ideas met text mining, the result was latent Dirichlet allocation (Blei, Ng & Jordan 2003), the canonical unsupervised model for large document collections. Each document is modelled as a mixture over a small number of latent topics; each topic is a distribution over words; inference recovers both the topic-word distributions and the per-document topic proportions. LDA became one of the most widely used probabilistic models of the 2000s and introduced a generation of researchers to variational and Gibbs inference.
For each document d, draw a topic proportion vector θ_d ~ Dirichlet(α). For each topic k, draw a word distribution φ_k ~ Dirichlet(β). For each word position n in document d: draw a topic z_{d,n} ~ Categorical(θ_d); draw a word w_{d,n} ~ Categorical(φ_{z_{d,n}}). The graph is a directed plate model with three plates (topics, documents, words). The joint factorises as a product of local Dirichlet and categorical distributions, and inference — recovering the per-document topic proportions and per-topic word distributions from observed word counts — is the algorithmic challenge.
Collapsed Gibbs sampling (Griffiths & Steyvers 2004): integrate out the Dirichlet parameters analytically, then Gibbs-sample the topic assignment of each word given all other assignments. Each iteration is cheap (a single categorical resample per token) and the algorithm mixes well in practice. This is the classic LDA training algorithm in Mallet and GibbsLDA++.
Variational inference (Blei, Ng & Jordan 2003): mean-field VI in which θ, z, and φ each have their own variational factor. The updates are closed-form given conjugacy. Scales better to very large corpora than Gibbs because it parallelises cleanly and is the approach used in online LDA (Hoffman, Blei & Bach 2010) and the scikit-learn implementation.
A zoo of variations. Dynamic topic models (Blei & Lafferty 2006) let topics evolve over time, which is useful for scientific-literature analysis. Correlated topic models replace the Dirichlet prior with a logistic normal to allow topic correlations. Hierarchical Dirichlet processes (Teh et al. 2006) make the number of topics itself an inferred quantity rather than a hyperparameter. Supervised LDA adds a label predicted from the topic proportions. Author-topic models combine document authorship information with topic structure. Neural topic models — ProdLDA, ETM (Embedded Topic Models), BERTopic — replace or augment the categorical-Dirichlet machinery with neural networks and word embeddings, and are the modern default for topic discovery.
LDA is no longer the default for topic discovery — embedding-based methods (BERTopic, UMAP-over-document-embeddings-plus-clustering) have taken over the middle of the market — but the core idea of "document is a mixture of topics, topic is a distribution over words" remains influential. The Bayesian hierarchy, the plate notation, and the variational and Gibbs inference recipes introduced by LDA are now standard vocabulary across the field, and many modern neural generative-retrieval and neural-topic systems are LDA's great-grandchildren.
A PGM on paper and a PGM in production are different artefacts. The practical workflow is roughly: (i) decide whether a PGM is the right tool at all, (ii) choose a directed or undirected representation, (iii) elicit or learn the structure, (iv) fit the parameters, (v) choose an inference algorithm compatible with the scale and the kind of query you need, (vi) validate with held-out data and posterior predictive checks. Each step has well-understood tools and well-known traps.
PGMs win when the problem is small-data, the structure is known or elicitable, interpretability is non-negotiable, calibrated uncertainty is required, or when prior knowledge needs to be combined with data. They also win for causal reasoning: no deep architecture, however large, answers interventional queries without some graphical-model scaffolding. They lose on perceptual tasks, very large unstructured datasets, and anywhere the inductive biases of convolution or attention fit the problem better than anything a graph would supply.
For Bayesian inference at scale: Stan (the reference MCMC engine, first-rate NUTS implementation, strong diagnostics), PyMC (the Pythonic alternative, deeply integrated with PyData), NumPyro (JAX-backed, GPU-fast NUTS), Pyro (deep probabilistic programming on PyTorch, variational-first), TensorFlow Probability (VI and MCMC on TF). For discrete Bayesian networks: pgmpy (Pythonic, supports exact and approximate inference, structure learning, causal tooling), bnlearn (R, strong structure-learning methods), Hugin / Netica / GeNIe (commercial, production-grade inference engines). For structured-prediction CRFs: sklearn-crfsuite, PyTorch's CRF layer inside BiLSTM/transformer pipelines. For topic models: gensim, Mallet, scikit-learn.
Posterior predictive checks — simulate data from the fitted posterior and compare the simulated data to the observed — are the first-line sanity check. Calibration curves check whether the model's 80% credible intervals contain the truth 80% of the time. For MCMC, R̂ and effective sample size are mandatory. For structure learning, hold-out log-likelihood and stability across bootstraps. For causal claims, sensitivity analysis to unobserved confounders.
The most frequent: omitted variables confound every dependency in the learned graph (no algorithm recovers what cannot be seen); a plausible-looking topology hides wrong priors (e.g., overly diffuse Dirichlets in LDA producing uninterpretable topics); mean-field VI gives overconfident posteriors and collapsing modes; MCMC reports R̂ near 1 while hiding a missed mode (run many chains from diverse starts); causal interpretation of a DAG learned from observational data without the caveats that apply; treating a graphical model as a generative simulator when the fit is only for predictive accuracy on a small set of marginals.
A good default workflow for a new problem: (1) sketch the generative story on a whiteboard with an expert; (2) implement it in PyMC or NumPyro with weakly informative priors; (3) run NUTS with four chains and look at diagnostics; (4) check posterior predictives against held-out data; (5) only then decide whether the model is good enough — in which case ship it — or whether the structure needs revision, in which case go back to step 1. Most of the time, the iteration is about the structure and the priors, not the inference algorithm.
Probabilistic graphical models are sometimes written off as an artefact of the pre-deep-learning era. This is wrong. The vocabulary of PGMs — latent variables, conditional independence, message passing, priors, marginal likelihood, the ELBO — is the vocabulary of modern probabilistic deep learning; several of the most important modern architectures are directly descended from PGM ideas; and the renaissance of causal and structured reasoning makes graphical thinking more central, not less.
A variational autoencoder (Kingma & Welling 2013) is a Bayesian network with two nodes (Z → X) whose conditional P(X | Z) is a deep neural network, and whose approximate posterior q(Z | X) is another neural network trained by amortised variational inference. The VAE is a PGM whose parameters are neural and whose inference is variational. The same template animates normalising flows (invertible neural networks that make P(X) tractable), diffusion models (hierarchies of denoising latents), and deep mixture-of-experts models.
The state-space revival — Structured State-Space Sequence models (S4, Gu et al. 2022) and Mamba (Gu & Dao 2023) — recasts sequence modelling as linear dynamical systems with clever parameterisations. This is the direct descendant of Kalman filtering, reborn with learned dynamics, GPU-efficient implementations, and subquadratic scaling in context length. It is currently the most promising alternative to transformer attention for very long contexts.
CRF layers on top of neural encoders — the BiLSTM-CRF, the transformer-CRF — remain strong for sequence labelling, span extraction, and constrained decoding. The combination is the clearest hybrid in the field: let the neural network learn features from raw data, let the graphical model enforce structural constraints (valid label sequences, tree structure, parse validity). In computer vision, CRFs-as-RNN layers over segmentation logits (Zheng et al. 2015) were state-of-the-art for semantic segmentation for several years.
A probabilistic programming language — Stan, PyMC, Pyro, Gen, Turing.jl, NumPyro — is a general-purpose language in which one writes a generative model as ordinary code, with built-in inference primitives. PPLs are the practical face of modern PGMs: you describe the structure by writing the sample statements, the compiler builds the factor graph, and automatic differentiation plus MCMC or VI do the rest. Probabilistic programming is where most working PGM research actually happens today.
Causal inference — the discipline of reasoning about interventions and counterfactuals — is built on graphical models. Pearl's do-calculus, structural causal models, front-door and back-door adjustments, instrumental variables, and modern causal-representation learning all use the vocabulary of DAGs. As ML moves from prediction to decision support — "what would happen if we deploy this policy?" — causal DAGs become unavoidable. In scientific ML, PGMs are the natural way to combine a known physical mechanism with a learned residual, and they appear as the graphical backbone of modern systems in drug discovery, protein design, and climate modelling.
Three ideas carry over from this chapter into everything that follows. Factorisation: every tractable probabilistic model factors a big joint into smaller local pieces, and the structure of that factorisation is what makes inference feasible. Conditional independence: knowing which variables ignore which, given what, is what lets you design models that scale. Inference-as-optimisation versus inference-as-sampling: variational and Monte-Carlo methods are the two canonical strategies for reasoning under uncertainty when exact inference is out of reach. These ideas structure the rest of classical machine learning — kernel methods, feature engineering, model evaluation — and they structure much of modern deep probabilistic learning too.
Probabilistic graphical models sit at the intersection of probability theory, statistics, computer science, and (increasingly) deep learning, and the literature spans sixty years and three distinct communities. The references below split into anchor textbooks that define the field, foundational papers that introduced each major algorithm, modern extensions that carry the ideas into 2026, and the software documentation where most practical work happens. Read the Koller–Friedman textbook if you only read one thing.
This page is Chapter 06 of Part IV: Classical Machine Learning. Chapter 07 turns from graph-structured reasoning to kernel methods and support vector machines — a different way of handling non-linearity, rooted in Hilbert-space geometry rather than graphical factorisation. Where PGMs ask "what variables depend on what, and how do we compute the consequence?", kernels ask "what functions live in what space, and what margin separates my classes?" The two traditions together account for most of classical ML's remaining ground; when they are joined with the feature-engineering and model-evaluation toolkit of Chapters 08–09, you will have the full classical-ML picture, against which the deep-learning revolution of Part V can be properly set.