In 1948 Claude Shannon proposed a single quantity — entropy — that turned out to measure, all at once, how much a data source can be compressed, how much can be transmitted over a noisy channel, and how far one probability distribution is from another. Modern machine learning runs on those same three ideas under new names: cross-entropy loss, KL divergence, mutual information. This chapter develops them from first principles, then traces their fingerprints across every major corner of contemporary AI.
Sections build on one another, so read in order the first time through. The prose carries the argument; the equations make it precise; the diagrams keep it from going abstract. Everywhere we speak of "probability," assume the relevant distribution is well-defined — this is not a measure-theoretic chapter, but every result here extends to that setting.
Notation: $p$, $q$ denote probability mass (or density) functions; $X$, $Y$ are random variables; $H(X)$ is the entropy of $X$; $H(X,Y)$ the joint entropy; $H(X \mid Y)$ conditional; $I(X;Y)$ mutual information; $D_{KL}(p \Vert q)$ the Kullback–Leibler divergence; $h(X)$ differential entropy of a continuous variable; $F(\theta)$ the Fisher information matrix. Logarithms are base-$2$ when we count bits and natural when we count nats; both conventions appear here, and the identity $\log_2 x = (\ln x)/\ln 2$ is all that converts between them.
Claude Shannon's 1948 paper turned communication into mathematics. The trick: measure information by how surprising it is, and the entire theory of compression, transmission, and learning falls out of three equations.
Before 1948, "information" was a word engineers used informally to describe what flowed through a telegraph wire or a radio link. Shannon's A Mathematical Theory of Communication made it a quantity. The single idea is this: the information content of a message is not its meaning, its length, or its visual complexity — it is the amount of surprise the message resolves. A message you already expected carries little information; a message you considered impossible carries a lot.
Make that quantitative and three things happen at once. First, you get a precise measure of how compressible any data source is — the entropy of the source. Second, you get a precise measure of how much information can be transmitted through a noisy channel — the capacity of the channel. Third, you get a unifying notion of "distance between distributions" — the KL divergence — that turns out to underwrite half of modern machine learning. Compression, transmission, and learning are the same theory, told from three angles.
The chapter follows the canonical arc. Sections 02–06 build the basic measures: entropy, joint and conditional entropy, mutual information, KL divergence, cross-entropy. Sections 07–09 cover Shannon's source coding theorem and the practical compression algorithms it inspired. Sections 10–11 cover channel capacity and the famous Shannon–Hartley limit. Section 12 handles the continuous case (differential entropy). Section 13 makes the connection back to statistics through Fisher information and information geometry. Section 14 closes by mapping every concept onto its modern incarnation in machine learning.
Notation is light. Logs default to base 2 unless otherwise stated; switching to natural log changes "bits" to "nats" and rescales every quantity by $1/\ln 2$, but never changes a sign or a relationship. We use $H$ for entropy, $D$ for KL divergence, $I$ for mutual information, and lowercase $p, q$ for probability mass functions or densities. Random variables are uppercase ($X, Y$); their realisations are lowercase ($x, y$).
Three reasonable axioms for "how surprising is an outcome of probability $p$" force a unique answer: $-\log p$. Averaging that surprise gives entropy, and entropy turns out to be the right unit of information.
Define the self-information (or "surprise") of an event of probability $p$ as
$$ h(p) \;=\; -\log p \;=\; \log \frac{1}{p}. $$Three properties pin this choice down. Surprise should be non-negative ($p \leq 1$). Surprise should decrease in probability — likely events are less surprising. And the surprise of two independent events should add: if $A$ and $B$ are independent then $\mathbb{P}(A \cap B) = p_A p_B$, and we want $h(p_A p_B) = h(p_A) + h(p_B)$. The only continuous, non-negative, decreasing function satisfying $h(pq) = h(p) + h(q)$ is $h(p) = -c \log p$ for some constant $c > 0$. Choosing $c = 1$ with $\log = \log_2$ measures information in bits; $c = 1$ with $\log = \ln$ measures it in nats ($1\text{ nat} = 1/\ln 2 \approx 1.443$ bits).
The entropy of a discrete random variable $X$ taking values in $\mathcal{X}$ with PMF $p(x)$ is the expected surprise:
$$ H(X) \;=\; \mathbb{E}_p[h(p(X))] \;=\; -\sum_{x \in \mathcal{X}} p(x) \log p(x), $$with the convention $0 \log 0 = 0$ (extended by continuity, since $p \log p \to 0$ as $p \to 0$). Three concrete examples lock this in. A fair coin: $H = -\tfrac{1}{2}\log\tfrac{1}{2} - \tfrac{1}{2}\log\tfrac{1}{2} = 1$ bit. A biased coin with $p = 0.9$: $H \approx 0.469$ bits — less surprising on average. A uniform die with six faces: $H = \log_2 6 \approx 2.585$ bits.
More generally, for a uniform distribution on $|\mathcal{X}| = n$ outcomes,
$$ H(X) \;=\; \log_2 n. $$This is the maximum entropy any distribution on $n$ outcomes can have. The intuition: uniform is the "most uncertain" distribution, and entropy quantifies uncertainty.
The binary entropy function $H(p) = -p\log p - (1-p)\log(1-p)$ for $p \in [0, 1]$ deserves to be memorised. It is zero at $p = 0$ and $p = 1$, peaks at $H(\tfrac{1}{2}) = 1$, and has the bell-like shape that makes it the natural "uncertainty" curve for any binary classifier output. Every binary cross-entropy loss in machine learning is computing exactly this function, averaged over a training set.
Two random variables have one joint entropy and two conditional entropies, related by an algebra so clean it transfers verbatim from probability into information theory.
The joint entropy of two random variables $X$ and $Y$ with joint PMF $p(x, y)$ is
$$ H(X, Y) \;=\; -\sum_{x, y} p(x, y) \log p(x, y). $$The conditional entropy of $Y$ given $X$ is the expected entropy of $Y$ once $X$ has been observed:
$$ H(Y \mid X) \;=\; \sum_x p(x)\, H(Y \mid X = x) \;=\; -\sum_{x, y} p(x, y) \log p(y \mid x). $$Read it as: average over the value of $X$ the residual uncertainty in $Y$. If $X$ and $Y$ are independent, observing $X$ tells you nothing and $H(Y \mid X) = H(Y)$. If $Y$ is a deterministic function of $X$, observing $X$ tells you everything and $H(Y \mid X) = 0$.
The chain rule for entropy is the workhorse identity:
$$ H(X, Y) \;=\; H(X) + H(Y \mid X) \;=\; H(Y) + H(X \mid Y). $$Reading left-to-right: the joint uncertainty in $(X, Y)$ equals the uncertainty in $X$ plus the residual uncertainty in $Y$ once $X$ is known. The proof is two lines of algebra from $\log p(x, y) = \log p(x) + \log p(y \mid x)$. The general $n$-variable version,
$$ H(X_1, \ldots, X_n) \;=\; \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1}), $$underwrites every autoregressive model in modern ML — a language model is decomposing the joint entropy of a sequence into a sum of conditional entropies, one per token.
Two more useful inequalities:
$$ H(X, Y) \;\leq\; H(X) + H(Y), \qquad H(X, Y) \;\geq\; \max\{H(X),\, H(Y)\}. $$The first is "subadditivity" — joint uncertainty is at most the sum of individual uncertainties, with equality iff independent. The second is "the joint is at least as uncertain as either marginal" — adding a variable cannot reduce uncertainty. Together they bracket $H(X, Y)$ in a way that the next section will visualise as a Venn diagram.
How much does knowing $X$ tell you about $Y$? The answer — a single non-negative number with three equivalent definitions — is the central object of information theory.
The mutual information between $X$ and $Y$ is the reduction in uncertainty about one variable from observing the other:
$$ I(X; Y) \;=\; H(Y) - H(Y \mid X) \;=\; H(X) - H(X \mid Y). $$The two expressions are equal — mutual information is symmetric, which is not obvious from either form alone but follows immediately from the chain rule:
$$ I(X; Y) \;=\; H(X) + H(Y) - H(X, Y). $$A third equivalent form, which we revisit once KL divergence is on the table:
$$ I(X; Y) \;=\; \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)} \;=\; D_{\text{KL}}\!\left(p(x, y) \,\|\, p(x)\,p(y)\right). $$Mutual information is the KL divergence between the joint distribution and the product of the marginals. It is zero iff $X$ and $Y$ are independent, and otherwise strictly positive. This single fact is why mutual information is the right "association measure" between random variables — it captures every form of dependence, linear or nonlinear, and reduces to zero exactly when there is none. Pearson correlation has neither property.
The Venn diagram makes every identity in this chapter visible. The two circles are $H(X)$ and $H(Y)$. Their union is $H(X, Y)$. Their intersection is $I(X; Y)$. The crescents are the conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$. Reading off relations:
$$ I(X; Y) \;=\; H(X) + H(Y) - H(X, Y), $$ $$ H(X \mid Y) \;=\; H(X) - I(X; Y), $$ $$ H(X, Y) \;=\; H(X \mid Y) + I(X; Y) + H(Y \mid X). $$Computing mutual information from data is harder than computing it from a known joint distribution. Estimators based on binning are biased and high-variance; modern practice uses neural-network-based variational estimators (MINE, InfoNCE) that exploit lower-bound representations of MI as expectations.
Two probability distributions over the same space have a "distance" — not symmetric, not a metric, but more useful than either. Almost every loss function in machine learning is a KL divergence wearing a costume.
The relative entropy or Kullback–Leibler divergence from $q$ to $p$ is
$$ D_{\text{KL}}(p \,\|\, q) \;=\; \sum_x p(x) \log \frac{p(x)}{q(x)} \;=\; \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right]. $$Read it as: the expected number of extra bits required to encode a sample from $p$ if you used a code optimised for $q$ instead. Or, equivalently, the expected log-likelihood ratio between the true model $p$ and a candidate model $q$, when data are sampled from $p$. Both readings will show up in this chapter.
Two non-obvious properties make KL the universal "distance" in statistical settings:
The asymmetry is not a bug. Variational inference picks the mode-seeking KL on purpose (it gives compact, peaky approximations); maximum likelihood implicitly uses the mean-seeking one (it spreads mass to cover the data). Understanding which KL you are minimising is half of understanding any modern probabilistic model.
The link back to mutual information is now visible: $I(X; Y) = D_{\text{KL}}(p(x, y) \,\|\, p(x) p(y))$ is the KL between the joint and the product-of-marginals. So $I(X; Y) = 0$ iff $X$ and $Y$ are independent, exactly because Gibbs' inequality says KL is zero only at equality of distributions. The entire algebra of entropy, conditional entropy, and mutual information is "KL divergence applied to specific pairs of distributions" — and that is the cleanest way to remember it.
Cross-entropy is the loss function for almost every classifier ever trained. It is also the cleanest way to see the relationship between entropy, KL divergence, and maximum-likelihood estimation.
The cross-entropy from $p$ to $q$ is
$$ H(p, q) \;=\; -\sum_x p(x) \log q(x) \;=\; \mathbb{E}_p[-\log q(X)]. $$Operationally: the expected number of bits needed to encode samples from $p$ using a code designed for $q$. It is always at least $H(p)$, with equality iff $q = p$. The gap is exactly the KL divergence:
$$ H(p, q) \;=\; H(p) + D_{\text{KL}}(p \,\|\, q). $$This decomposition is the most useful identity in the chapter. Three readings:
The third reading is why cross-entropy is the loss in classifiers. With $p$ the empirical distribution of class labels in a training batch and $q$ the model's predicted class probabilities, minimising cross-entropy maximises log-likelihood — exactly the MLE of Section 05's discussion in the previous chapter.
Two practical notes that save hours of debugging. First, cross-entropy is implemented in libraries to operate on logits (pre-softmax scores) rather than probabilities, because the combined "log-softmax + NLL" computation is numerically much more stable than computing the softmax and then taking its log. Second, the "label smoothing" trick mixes the one-hot $p$ with a uniform distribution; this is exactly equivalent to adding an entropy regulariser $-\beta H(q)$ to the loss, which prevents the model from becoming overconfident and tends to improve calibration.
Shannon's first theorem says every lossless code for a source uses at least $H$ bits per symbol on average, and a code coming arbitrarily close exists. Compression has a hard floor, and entropy is its name.
A code for a source $X$ over alphabet $\mathcal{X}$ is an injective map $C : \mathcal{X} \to \{0, 1\}^*$ assigning a binary string $C(x)$ to each symbol. The expected length of the code is
$$ L(C) \;=\; \sum_x p(x) \, \ell(x), $$where $\ell(x) = |C(x)|$ is the length of the codeword. We want $L(C)$ as small as possible while still being able to decode unambiguously.
Decodability requires more than injectivity. A prefix code (or "instantaneous code") has the property that no codeword is a prefix of another. Prefix codes are decodable on the fly: as soon as you see a complete codeword in the bitstream, you know it's complete. The set of prefix codes is exactly the set of codes with codewords corresponding to leaves of a binary tree, which gives Kraft's inequality:
$$ \sum_x 2^{-\ell(x)} \;\leq\; 1. $$Conversely, any set of lengths satisfying Kraft's inequality can be realised by a prefix code. Two interpretations: codes "consume" probability mass on a binary tree, and the total can't exceed 1; or, a Kraft-compliant length assignment is exactly $\ell(x) = -\log q(x)$ for some distribution $q$ — coding lengths and probability distributions are the same object.
The proof sketch is two ideas. The lower bound is one application of Jensen's inequality to the Kraft sum. The upper bound is constructive: choose $\ell(x) = \lceil -\log p(x) \rceil$ (the "Shannon code"), which always satisfies Kraft and has expected length within one bit of entropy. Block coding handles the sub-bit overhead.
The theorem is operational, not metaphysical. It does not just say compression has a limit; it says this is the limit, and explicit algorithms (next section) get arbitrarily close.
Three families of compression algorithms have stood the test of time. Each uses the entropy bound differently — and one of them is what zip, gzip, and DEFLATE actually do under the hood.
Huffman coding (1952). Given a known distribution $p$, build the optimal prefix code by a greedy bottom-up procedure: take the two least-probable symbols, merge them into a virtual symbol with their summed probability, and repeat. The resulting binary tree's leaves are the codewords. Huffman coding is optimal among prefix codes that assign integer-length codewords to individual symbols, with expected length satisfying $H(X) \leq L_{\text{Huff}} < H(X) + 1$.
The "$+1$" is the catch: if entropy is 0.1 bits per symbol (e.g., a heavily biased Bernoulli), Huffman pays a full bit per symbol — a tenfold overhead. Block coding reduces the relative overhead but quickly becomes impractical, because block alphabet size grows exponentially.
Arithmetic coding (1976). Resolves the integer-length problem. Encode the entire message as a single number in $[0, 1)$, recursively narrowing an interval whose length is exactly $\prod_i p(x_i)$. Reading off the binary expansion of any number in the final interval recovers the message. The expected coding length per symbol approaches $H(X)$ to within a vanishingly small overhead — no "$+1$ per symbol" penalty. Arithmetic coding underwrites modern image, video, and audio compression (JPEG, H.264, MP3 derivatives) and is the entropy coder behind state-of-the-art neural compression.
Lempel–Ziv (1977/78). A different philosophy: don't assume the distribution is known, and let the algorithm learn it from the data. The LZ family parses the input into phrases not seen before, building a dictionary on the fly, and outputs a sequence of pointers to dictionary entries. Critically, both encoder and decoder build the same dictionary in the same way — no probability model needs to be transmitted. LZ77 powers DEFLATE (zip, gzip, PNG); LZ78 and its variant LZW powered GIF and the original UNIX compress. For long stationary sources, LZ algorithms are universal: their per-symbol coding rate converges to the source entropy without knowing the source.
The minimum description length (MDL) principle takes this further: the best model for a dataset is the one that, together with the data encoded under it, gives the shortest total description. MDL is a frequentist alternative to Bayesian model selection that turns out to give nearly identical answers in regular cases. It also gives a clean information-theoretic justification for Occam's razor: simpler models compress better, all else equal.
Long random sequences live almost entirely in a vanishingly small region of sequence space, and inside that region they are nearly uniform. This single observation makes Shannon's coding theorems work.
Let $X_1, X_2, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} p$. The asymptotic equipartition property (AEP) is a consequence of the law of large numbers applied to $-\log p(X_i)$:
$$ -\frac{1}{n} \log p(X_1, \ldots, X_n) \;=\; -\frac{1}{n}\sum_{i=1}^n \log p(X_i) \;\xrightarrow{p}\; H(X). $$In words: the empirical average of self-information converges to entropy. So with high probability,
$$ p(X_1, \ldots, X_n) \;\approx\; 2^{-n H(X)}. $$The typical set $A_\varepsilon^{(n)}$ is the set of sequences whose probability is within $\varepsilon$ of $2^{-nH}$:
$$ A_\varepsilon^{(n)} \;=\; \left\{ (x_1, \ldots, x_n) : \left| -\frac{1}{n}\log p(x_1, \ldots, x_n) - H(X) \right| < \varepsilon \right\}. $$Three properties of this set together prove every coding theorem in this chapter:
The compression algorithm is now obvious: list the elements of $A_\varepsilon^{(n)}$ in some canonical order, and use the index as the codeword. Indices fit in $n(H + \varepsilon) + 1$ bits (the "+1" is for a flag distinguishing "typical, here is the index" from "non-typical, here is the raw sequence"). The total expected codelength is $n H + O(1)$, achieving Shannon's source coding theorem.
The AEP is also the conceptual prerequisite for the channel coding theorem of the next section — the joint typical set of "transmitted, received" pairs is the object whose size determines how many distinct messages can be reliably decoded.
Reliable communication over a noisy channel is possible at any rate below the channel's capacity, and impossible above it. The capacity has a closed form: the maximum mutual information between input and output.
A discrete memoryless channel is a conditional distribution $p(y \mid x)$ that takes an input symbol $x \in \mathcal{X}$ and produces an output symbol $y \in \mathcal{Y}$ independently each use. The simplest example is the binary symmetric channel (BSC) with crossover probability $f$: the input is a bit, and the channel flips it with probability $f$.
A code for the channel is a mapping from messages $w \in \{1, \ldots, 2^{nR}\}$ to length-$n$ codewords, plus a decoding rule from received sequences to messages. The rate $R$ is bits-per-channel-use; the error probability is the maximum over messages of the probability of incorrect decoding.
The capacity of the channel is
$$ C \;=\; \max_{p(x)} I(X; Y), $$the maximum mutual information between input and output, taken over all input distributions. For the BSC,
$$ C_{\text{BSC}} \;=\; 1 - H(f), $$where $H(f) = -f \log f - (1 - f) \log(1 - f)$ is the binary entropy function. A noiseless BSC ($f = 0$) has capacity 1 bit per use; a maximally noisy one ($f = 0.5$) has capacity 0.
The proof is the high point of the whole subject. The achievability direction uses a non-constructive argument — averaged over random codes drawn i.i.d. from the capacity-achieving input distribution, the expected error rate goes to zero, so a deterministic code achieving the same rate must exist. The geometric idea: $2^{nR}$ codewords scattered randomly in input space have, with high probability, non-overlapping joint-typical "spheres" around them in output space, so a maximum-likelihood decoder works.
The converse uses Fano's inequality (which bounds error probability in terms of conditional entropy) to show that no decoder can do better than capacity even given perfect side information about which message was sent. The two directions together pin capacity to its exact value.
Modern error-correcting codes — Hamming, Reed–Solomon, BCH, convolutional codes, turbo codes, LDPC codes, polar codes — are practical realisations of Shannon's theorem. LDPC and polar codes both come within a fraction of a dB of capacity in modern wireless standards (5G, Wi-Fi 7); for decades after 1948 we knew capacity was achievable in principle but had no idea how to do so efficiently. The story of practical coding is the story of closing that gap.
For continuous channels with additive Gaussian noise, capacity has the most-cited equation in communications: $C = \tfrac{1}{2}\log(1 + \mathrm{SNR})$. It is the reason every modem in history has a top speed.
The additive white Gaussian noise (AWGN) channel takes a real-valued input $X$ with average power constraint $\mathbb{E}[X^2] \leq P$ and outputs $Y = X + Z$ where $Z \sim \mathcal{N}(0, N)$ is independent of $X$.
Under the power constraint, the capacity of the AWGN channel is
$$ C \;=\; \frac{1}{2} \log_2\!\left(1 + \frac{P}{N}\right) \quad \text{bits per channel use}, $$achieved when the input is Gaussian: $X \sim \mathcal{N}(0, P)$. The proof uses two facts from differential entropy (next section): the Gaussian maximises entropy among distributions with a given variance, and the conditional entropy $h(Y \mid X) = h(Z)$ depends only on the noise.
For a band-limited channel of bandwidth $W$ Hz with two samples per Hz (Nyquist), the capacity per second is
$$ C \;=\; W \log_2\!\left(1 + \frac{P}{N_0 W}\right) \quad \text{bits per second}, $$the famous Shannon–Hartley formula. This is the absolute upper bound on the data rate of any digital communication system using a band-limited Gaussian channel — modems, Wi-Fi, cellular, fibre optics, every Earth-to-spacecraft link. Doubling the bandwidth doubles capacity. Doubling the SNR adds only one bit per use.
The Gaussian channel also clarifies the role of coding rate in modern systems. A "1/2 rate" code transmits one bit of information for every two channel uses; this lets you operate well below capacity, paying redundancy in exchange for very low error rates. A "5/6 rate" code transmits closer to capacity but is more fragile. Adaptive modulation and coding schemes — every modern wireless standard has them — pick the highest rate that the current SNR can support, sliding along the Shannon limit as channel conditions change.
Replace sums with integrals and you get differential entropy — almost the same theory, with a few subtleties that catch beginners. KL divergence, mutual information, and cross-entropy all transfer cleanly; entropy itself does not.
For a continuous random variable $X$ with density $f(x)$, the differential entropy is
$$ h(X) \;=\; -\int f(x) \log f(x)\, dx. $$Two warnings up front. First, differential entropy can be negative. A uniform distribution on $[0, 1/2]$ has $h = -1$ bit. Discrete entropy is the average codeword length and so is non-negative; differential entropy is a "density of bits" and has no such guarantee. Second, $h(X)$ is not invariant under change of variables. If $Y = aX$, then $h(Y) = h(X) + \log|a|$ — scaling a variable changes its differential entropy by a Jacobian factor. The discrete entropy of a relabelling is unchanged.
Two key entropies worth memorising:
$$ h(\mathcal{N}(\mu, \sigma^2)) \;=\; \tfrac{1}{2}\log(2\pi e \sigma^2), \qquad h(\text{Uniform}[0, a]) \;=\; \log a. $$The Gaussian formula has a deep generalisation:
KL divergence transfers from sums to integrals without issue:
$$ D_{\text{KL}}(p \,\|\, q) \;=\; \int p(x) \log \frac{p(x)}{q(x)}\, dx, $$is well-defined whenever $p$ is absolutely continuous with respect to $q$ (no mass where $q$ has none). It is non-negative, asymmetric, and zero iff $p = q$ — exactly as in the discrete case. Cross-entropy and mutual information transfer the same way:
$$ I(X; Y) \;=\; \int p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)}\, dx\, dy, $$is non-negative and zero iff $X$ and $Y$ are independent. Differential entropy alone is the awkward member of the family; everything that combines two distributions behaves cleanly.
A useful closed form: the KL divergence between two multivariate Gaussians $\mathcal{N}(\mu_0, \Sigma_0)$ and $\mathcal{N}(\mu_1, \Sigma_1)$ in $d$ dimensions is
$$ D_{\text{KL}} \;=\; \tfrac{1}{2}\!\left[\operatorname{tr}(\Sigma_1^{-1}\Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1}(\mu_1 - \mu_0) - d + \log \frac{\det \Sigma_1}{\det \Sigma_0}\right]. $$This is the formula every variational autoencoder uses to compute its KL regulariser, and the formula every diffusion model uses to compute training losses. Memorise it; it pays for itself within a week of reading any modern deep-learning paper.
The KL divergence between two infinitesimally separated members of a parametric family is the Fisher information matrix. This single fact links information theory to statistics, and to a Riemannian geometry on the space of distributions.
Consider a parametric family of densities $\{p_\theta(x) : \theta \in \mathbb{R}^d\}$ — the kind of model you fit with maximum likelihood. The Fisher information matrix at $\theta$ is
$$ F(\theta) \;=\; \mathbb{E}_\theta\!\left[\nabla_\theta \log p_\theta(X)\, \nabla_\theta \log p_\theta(X)^\top\right] \;=\; -\mathbb{E}_\theta\!\left[\nabla_\theta^2 \log p_\theta(X)\right]. $$The two expressions are equal under regularity (the score's variance equals minus the expected Hessian — a special case of the "information identity"). $F(\theta)$ is symmetric, positive semi-definite, and measures how sharply the log-likelihood is curved around $\theta$. Sharp curvature means data are highly informative about $\theta$; flat curvature means the parameter is poorly identified.
The deep connection: the KL divergence between $p_\theta$ and $p_{\theta + \delta}$ for small $\delta$ is
$$ D_{\text{KL}}(p_\theta \,\|\, p_{\theta + \delta}) \;=\; \tfrac{1}{2}\, \delta^\top F(\theta)\, \delta + O(\|\delta\|^3). $$KL divergence is locally a quadratic form, with the Fisher information matrix as its metric. This is the starting point of information geometry: the parameter space of a statistical model has a natural Riemannian structure, with $F(\theta)$ as the metric tensor. Geodesics in this geometry are the natural notion of "shortest path between two distributions"; gradients with respect to the Fisher metric — the natural gradient — give parameter updates that don't depend on the parameterisation chosen.
Two small payoffs of the geometric viewpoint. First, the Cramér–Rao lower bound from Chapter 05 reads cleanly: the inverse Fisher matrix is the local covariance of any efficient estimator. Second, the maximum-entropy distributions of Section 12 are exactly the densities sitting at the "centroids" of certain natural sub-manifolds in this geometry — a perspective due to Amari that sits underneath much of modern variational inference.
Almost every loss, regulariser, and architectural choice in modern ML has an information-theoretic ancestor in this chapter. Naming the connections is what turns a list of techniques into a single subject.
A list worth memorising:
The pattern is the same one we saw in earlier chapters. A handful of equations from a 1948 paper, applied at the scale and function-class flexibility of modern computation, generates an entire technical vocabulary that practitioners use without always knowing it. Knowing the older language doesn't slow you down; it lets you stop pattern-matching to specific papers and start thinking in the underlying ideas.
Up next in the compendium: Bayesian reasoning, which turns the KL divergences and likelihood functions of this chapter into a complete inferential framework; and signal processing, which gives the harmonic-analytic toolkit that audio, image, and time-series models rest on. Together these eight chapters are the mathematical engine room of modern machine learning.
Information theory is a small, deep subject — a handful of quantities and a few theorems that connect them — and almost anything worth knowing about it is covered somewhere in the sources below. Read Shannon's 1948 paper once; it is still clearer than most textbooks.
This is Chapter 06 of an eight-chapter tour of the mathematical foundations of AI. Up next: Bayesian reasoning and signal processing — two further lenses on the same machinery.