Part I · Mathematical Foundations · Chapter 06

Information, the calculus of surprise.

In 1948 Claude Shannon proposed a single quantity — entropy — that turned out to measure, all at once, how much a data source can be compressed, how much can be transmitted over a noisy channel, and how far one probability distribution is from another. Modern machine learning runs on those same three ideas under new names: cross-entropy loss, KL divergence, mutual information. This chapter develops them from first principles, then traces their fingerprints across every major corner of contemporary AI.

How to read this chapter

Sections build on one another, so read in order the first time through. The prose carries the argument; the equations make it precise; the diagrams keep it from going abstract. Everywhere we speak of "probability," assume the relevant distribution is well-defined — this is not a measure-theoretic chapter, but every result here extends to that setting.

Notation: $p$, $q$ denote probability mass (or density) functions; $X$, $Y$ are random variables; $H(X)$ is the entropy of $X$; $H(X,Y)$ the joint entropy; $H(X \mid Y)$ conditional; $I(X;Y)$ mutual information; $D_{KL}(p \Vert q)$ the Kullback–Leibler divergence; $h(X)$ differential entropy of a continuous variable; $F(\theta)$ the Fisher information matrix. Logarithms are base-$2$ when we count bits and natural when we count nats; both conventions appear here, and the identity $\log_2 x = (\ln x)/\ln 2$ is all that converts between them.

Why information theory?Motivation
Surprise and entropyThe foundational quantity
Joint, conditional, chain ruleMultiple variables
Mutual informationShared bits
KL divergenceDistance between distributions
Cross-entropy and the H–D triangleThe ML loss
Source codingCompression limit
Practical compressionHuffman, arithmetic, LZ
Typical sets and the AEPWhy compression works
Channel capacityThe transmission limit
The Gaussian channelShannon–Hartley
Differential entropyContinuous variables
Fisher information and geometryLocal information
Where it shows up in MLPayoff

Section 01

Information theory in one idea.

Claude Shannon's 1948 paper turned communication into mathematics. The trick: measure information by how surprising it is, and the entire theory of compression, transmission, and learning falls out of three equations.

Before 1948, "information" was a word engineers used informally to describe what flowed through a telegraph wire or a radio link. Shannon's A Mathematical Theory of Communication made it a quantity. The single idea is this: the information content of a message is not its meaning, its length, or its visual complexity — it is the amount of surprise the message resolves. A message you already expected carries little information; a message you considered impossible carries a lot.

Make that quantitative and three things happen at once. First, you get a precise measure of how compressible any data source is — the entropy of the source. Second, you get a precise measure of how much information can be transmitted through a noisy channel — the capacity of the channel. Third, you get a unifying notion of "distance between distributions" — the KL divergence — that turns out to underwrite half of modern machine learning. Compression, transmission, and learning are the same theory, told from three angles.

Where this chapter sits. Information theory is the mathematical glue between probability (Chapter 04), statistics (Chapter 05), and machine learning. Cross-entropy losses are KL divergences. Variational autoencoders maximise a lower bound on data log-likelihood that decomposes into a reconstruction term and a KL term. Contrastive losses estimate mutual information. Language models are explicitly trying to reduce the entropy of the next token. Once you have the vocabulary of this chapter, large parts of modern ML stop reading as engineering tricks and start reading as direct applications of Shannon's theorems.

The chapter follows the canonical arc. Sections 02–06 build the basic measures: entropy, joint and conditional entropy, mutual information, KL divergence, cross-entropy. Sections 07–09 cover Shannon's source coding theorem and the practical compression algorithms it inspired. Sections 10–11 cover channel capacity and the famous Shannon–Hartley limit. Section 12 handles the continuous case (differential entropy). Section 13 makes the connection back to statistics through Fisher information and information geometry. Section 14 closes by mapping every concept onto its modern incarnation in machine learning.

Notation is light. Logs default to base 2 unless otherwise stated; switching to natural log changes "bits" to "nats" and rescales every quantity by $1/\ln 2$, but never changes a sign or a relationship. We use $H$ for entropy, $D$ for KL divergence, $I$ for mutual information, and lowercase $p, q$ for probability mass functions or densities. Random variables are uppercase ($X, Y$); their realisations are lowercase ($x, y$).

Section 02

Surprise and entropy.

Three reasonable axioms for "how surprising is an outcome of probability $p$" force a unique answer: $-\log p$. Averaging that surprise gives entropy, and entropy turns out to be the right unit of information.

Define the self-information (or "surprise") of an event of probability $p$ as

$$ h(p) \;=\; -\log p \;=\; \log \frac{1}{p}. $$

Three properties pin this choice down. Surprise should be non-negative ($p \leq 1$). Surprise should decrease in probability — likely events are less surprising. And the surprise of two independent events should add: if $A$ and $B$ are independent then $\mathbb{P}(A \cap B) = p_A p_B$, and we want $h(p_A p_B) = h(p_A) + h(p_B)$. The only continuous, non-negative, decreasing function satisfying $h(pq) = h(p) + h(q)$ is $h(p) = -c \log p$ for some constant $c > 0$. Choosing $c = 1$ with $\log = \log_2$ measures information in bits; $c = 1$ with $\log = \ln$ measures it in nats ($1\text{ nat} = 1/\ln 2 \approx 1.443$ bits).

The entropy of a discrete random variable $X$ taking values in $\mathcal{X}$ with PMF $p(x)$ is the expected surprise:

$$ H(X) \;=\; \mathbb{E}_p[h(p(X))] \;=\; -\sum_{x \in \mathcal{X}} p(x) \log p(x), $$

with the convention $0 \log 0 = 0$ (extended by continuity, since $p \log p \to 0$ as $p \to 0$). Three concrete examples lock this in. A fair coin: $H = -\tfrac{1}{2}\log\tfrac{1}{2} - \tfrac{1}{2}\log\tfrac{1}{2} = 1$ bit. A biased coin with $p = 0.9$: $H \approx 0.469$ bits — less surprising on average. A uniform die with six faces: $H = \log_2 6 \approx 2.585$ bits.

More generally, for a uniform distribution on $|\mathcal{X}| = n$ outcomes,

$$ H(X) \;=\; \log_2 n. $$

This is the maximum entropy any distribution on $n$ outcomes can have. The intuition: uniform is the "most uncertain" distribution, and entropy quantifies uncertainty.

Operational meaning. Entropy is not a metaphor. Shannon's source coding theorem (Section 07) shows that any lossless code for $X$ requires at least $H(X)$ bits per symbol on average, and that a code achieving close to $H(X)$ bits exists for long enough messages. Entropy is the exact, achievable minimum — the floor below which no compression scheme can ever go. Every compression algorithm you have ever used (zip, gzip, JPEG, MP3) is an attempt to approach this floor.

The binary entropy function $H(p) = -p\log p - (1-p)\log(1-p)$ for $p \in [0, 1]$ deserves to be memorised. It is zero at $p = 0$ and $p = 1$, peaks at $H(\tfrac{1}{2}) = 1$, and has the bell-like shape that makes it the natural "uncertainty" curve for any binary classifier output. Every binary cross-entropy loss in machine learning is computing exactly this function, averaged over a training set.

Section 03

Joint entropy, conditional entropy, the chain rule.

Two random variables have one joint entropy and two conditional entropies, related by an algebra so clean it transfers verbatim from probability into information theory.

The joint entropy of two random variables $X$ and $Y$ with joint PMF $p(x, y)$ is

$$ H(X, Y) \;=\; -\sum_{x, y} p(x, y) \log p(x, y). $$

The conditional entropy of $Y$ given $X$ is the expected entropy of $Y$ once $X$ has been observed:

$$ H(Y \mid X) \;=\; \sum_x p(x)\, H(Y \mid X = x) \;=\; -\sum_{x, y} p(x, y) \log p(y \mid x). $$

Read it as: average over the value of $X$ the residual uncertainty in $Y$. If $X$ and $Y$ are independent, observing $X$ tells you nothing and $H(Y \mid X) = H(Y)$. If $Y$ is a deterministic function of $X$, observing $X$ tells you everything and $H(Y \mid X) = 0$.

The chain rule for entropy is the workhorse identity:

$$ H(X, Y) \;=\; H(X) + H(Y \mid X) \;=\; H(Y) + H(X \mid Y). $$

Reading left-to-right: the joint uncertainty in $(X, Y)$ equals the uncertainty in $X$ plus the residual uncertainty in $Y$ once $X$ is known. The proof is two lines of algebra from $\log p(x, y) = \log p(x) + \log p(y \mid x)$. The general $n$-variable version,

$$ H(X_1, \ldots, X_n) \;=\; \sum_{i=1}^n H(X_i \mid X_1, \ldots, X_{i-1}), $$

underwrites every autoregressive model in modern ML — a language model is decomposing the joint entropy of a sequence into a sum of conditional entropies, one per token.

Conditioning reduces entropy. A fundamental inequality: $H(Y \mid X) \leq H(Y)$, with equality iff $X$ and $Y$ are independent. Knowing more cannot increase your average uncertainty. (It can increase the entropy of this particular realisation — observing $X = x$ might reveal that $Y$ is in a region where it was previously thought to be concentrated. The inequality is on the average.) The proof is one line from the non-negativity of mutual information, which we meet in the next section.

Two more useful inequalities:

$$ H(X, Y) \;\leq\; H(X) + H(Y), \qquad H(X, Y) \;\geq\; \max\{H(X),\, H(Y)\}. $$

The first is "subadditivity" — joint uncertainty is at most the sum of individual uncertainties, with equality iff independent. The second is "the joint is at least as uncertain as either marginal" — adding a variable cannot reduce uncertainty. Together they bracket $H(X, Y)$ in a way that the next section will visualise as a Venn diagram.

Section 04

Mutual information.

How much does knowing $X$ tell you about $Y$? The answer — a single non-negative number with three equivalent definitions — is the central object of information theory.

The mutual information between $X$ and $Y$ is the reduction in uncertainty about one variable from observing the other:

$$ I(X; Y) \;=\; H(Y) - H(Y \mid X) \;=\; H(X) - H(X \mid Y). $$

The two expressions are equal — mutual information is symmetric, which is not obvious from either form alone but follows immediately from the chain rule:

$$ I(X; Y) \;=\; H(X) + H(Y) - H(X, Y). $$

A third equivalent form, which we revisit once KL divergence is on the table:

$$ I(X; Y) \;=\; \sum_{x, y} p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)} \;=\; D_{\text{KL}}\!\left(p(x, y) \,\|\, p(x)\,p(y)\right). $$

Mutual information is the KL divergence between the joint distribution and the product of the marginals. It is zero iff $X$ and $Y$ are independent, and otherwise strictly positive. This single fact is why mutual information is the right "association measure" between random variables — it captures every form of dependence, linear or nonlinear, and reduces to zero exactly when there is none. Pearson correlation has neither property.

The two circles are the marginal entropies $H(X)$ and $H(Y)$. Their intersection is the mutual information $I(X;Y)$; the crescents are the conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$; the union is the joint entropy $H(X,Y)$.

The Venn diagram makes every identity in this chapter visible. The two circles are $H(X)$ and $H(Y)$. Their union is $H(X, Y)$. Their intersection is $I(X; Y)$. The crescents are the conditional entropies $H(X \mid Y)$ and $H(Y \mid X)$. Reading off relations:

$$ I(X; Y) \;=\; H(X) + H(Y) - H(X, Y), $$ $$ H(X \mid Y) \;=\; H(X) - I(X; Y), $$ $$ H(X, Y) \;=\; H(X \mid Y) + I(X; Y) + H(Y \mid X). $$

Why mutual information is everywhere in ML. Three places to recognise it. Representation learning often maximises $I(\text{representation}; \text{input})$ to keep features informative; the InfoNCE loss in SimCLR/CLIP is a tractable lower bound on this MI. Variational autoencoders regularise by minimising $I(\text{representation}; \text{input})$ — the opposite direction — which controls how much each latent unit encodes. The information bottleneck trades $I(\text{representation}; \text{input})$ against $I(\text{representation}; \text{label})$ to find compressed-but-predictive features. The same scalar appears with three different signs depending on the goal.

Computing mutual information from data is harder than computing it from a known joint distribution. Estimators based on binning are biased and high-variance; modern practice uses neural-network-based variational estimators (MINE, InfoNCE) that exploit lower-bound representations of MI as expectations.

Section 05

Relative entropy: the KL divergence.

Two probability distributions over the same space have a "distance" — not symmetric, not a metric, but more useful than either. Almost every loss function in machine learning is a KL divergence wearing a costume.

The relative entropy or Kullback–Leibler divergence from $q$ to $p$ is

$$ D_{\text{KL}}(p \,\|\, q) \;=\; \sum_x p(x) \log \frac{p(x)}{q(x)} \;=\; \mathbb{E}_p\!\left[\log \frac{p(X)}{q(X)}\right]. $$

Read it as: the expected number of extra bits required to encode a sample from $p$ if you used a code optimised for $q$ instead. Or, equivalently, the expected log-likelihood ratio between the true model $p$ and a candidate model $q$, when data are sampled from $p$. Both readings will show up in this chapter.

Two non-obvious properties make KL the universal "distance" in statistical settings:

Gibbs' inequality: $D_{\text{KL}}(p \,\|\, q) \geq 0$, with equality iff $p = q$ almost everywhere. The proof is one line from Jensen's inequality applied to $\log$. Combined with $H(X) = -\mathbb{E}_p \log p$, this immediately gives "uniform maximises entropy" and "conditioning reduces entropy" as corollaries.
It is not symmetric: in general $D_{\text{KL}}(p \,\|\, q) \neq D_{\text{KL}}(q \,\|\, p)$. There are two KLs between $p$ and $q$, and they capture different things. $D_{\text{KL}}(p \,\|\, q)$ is "mean-seeking" — it punishes $q$ for putting low mass where $p$ has mass. $D_{\text{KL}}(q \,\|\, p)$ is "mode-seeking" — it punishes $q$ for putting mass where $p$ has none.

The asymmetry is not a bug. Variational inference picks the mode-seeking KL on purpose (it gives compact, peaky approximations); maximum likelihood implicitly uses the mean-seeking one (it spreads mass to cover the data). Understanding which KL you are minimising is half of understanding any modern probabilistic model.

KL is not a metric. It is not symmetric, and it does not satisfy the triangle inequality. So you cannot use it as a distance in algorithms that require metric properties (clustering, $k$-NN). A symmetrised cousin — the Jensen–Shannon divergence $\operatorname{JS}(p, q) = \tfrac{1}{2} D_{\text{KL}}(p \,\|\, m) + \tfrac{1}{2} D_{\text{KL}}(q \,\|\, m)$ with $m = \tfrac{1}{2}(p + q)$ — is symmetric, bounded, and its square root is a metric. JS divergence is what the original GAN paper claimed to minimise; modern WGANs use Wasserstein distance instead, which has better gradient behaviour when $p$ and $q$ have disjoint supports.

The link back to mutual information is now visible: $I(X; Y) = D_{\text{KL}}(p(x, y) \,\|\, p(x) p(y))$ is the KL between the joint and the product-of-marginals. So $I(X; Y) = 0$ iff $X$ and $Y$ are independent, exactly because Gibbs' inequality says KL is zero only at equality of distributions. The entire algebra of entropy, conditional entropy, and mutual information is "KL divergence applied to specific pairs of distributions" — and that is the cleanest way to remember it.

Section 06

Cross-entropy and the H–D triangle.

Cross-entropy is the loss function for almost every classifier ever trained. It is also the cleanest way to see the relationship between entropy, KL divergence, and maximum-likelihood estimation.

The cross-entropy from $p$ to $q$ is

$$ H(p, q) \;=\; -\sum_x p(x) \log q(x) \;=\; \mathbb{E}_p[-\log q(X)]. $$

Operationally: the expected number of bits needed to encode samples from $p$ using a code designed for $q$. It is always at least $H(p)$, with equality iff $q = p$. The gap is exactly the KL divergence:

$$ H(p, q) \;=\; H(p) + D_{\text{KL}}(p \,\|\, q). $$

This decomposition is the most useful identity in the chapter. Three readings:

Compression. Coding cost = source entropy + penalty for using the wrong code.
Statistics. Negative log-likelihood = entropy of the data + KL from the model to the data.
Optimisation. Minimising cross-entropy in $q$ is equivalent to minimising $D_{\text{KL}}(p \,\|\, q)$ — the source entropy $H(p)$ is a constant that doesn't depend on $q$ and drops out of the gradient.

The third reading is why cross-entropy is the loss in classifiers. With $p$ the empirical distribution of class labels in a training batch and $q$ the model's predicted class probabilities, minimising cross-entropy maximises log-likelihood — exactly the MLE of Section 05's discussion in the previous chapter.

Cross-entropy = NLL = MLE — three names, one loss.

For a binary classifier with Bernoulli output $q = \sigma(z)$ and labels $y \in \{0, 1\}$: $$ \text{BCE} = -\big[y \log q + (1 - y) \log(1 - q)\big]. $$ Same formula as logistic-regression NLL, same formula as binary cross-entropy from this chapter.
For a multi-class classifier with softmax output $q_k = e^{z_k}/\sum_j e^{z_j}$ and one-hot labels: $$ \text{CE} = -\sum_k y_k \log q_k. $$ Same formula as multinomial-logistic NLL, same formula as KL divergence from the empirical to the model distribution (plus a constant).
For a Gaussian regressor with constant variance: $$ \text{MSE} \;\propto\; -\log q. $$ Same formula as Gaussian NLL, same formula as KL divergence between empirical and Gaussian (plus a constant).

Pick a likelihood; cross-entropy gives you its loss for free. This is why training a neural classifier and "doing maximum likelihood" describe the same activity.

Two practical notes that save hours of debugging. First, cross-entropy is implemented in libraries to operate on logits (pre-softmax scores) rather than probabilities, because the combined "log-softmax + NLL" computation is numerically much more stable than computing the softmax and then taking its log. Second, the "label smoothing" trick mixes the one-hot $p$ with a uniform distribution; this is exactly equivalent to adding an entropy regulariser $-\beta H(q)$ to the loss, which prevents the model from becoming overconfident and tends to improve calibration.

Section 07

Source coding: entropy is the compression limit.

Shannon's first theorem says every lossless code for a source uses at least $H$ bits per symbol on average, and a code coming arbitrarily close exists. Compression has a hard floor, and entropy is its name.

A code for a source $X$ over alphabet $\mathcal{X}$ is an injective map $C : \mathcal{X} \to \{0, 1\}^*$ assigning a binary string $C(x)$ to each symbol. The expected length of the code is

$$ L(C) \;=\; \sum_x p(x) \, \ell(x), $$

where $\ell(x) = |C(x)|$ is the length of the codeword. We want $L(C)$ as small as possible while still being able to decode unambiguously.

Decodability requires more than injectivity. A prefix code (or "instantaneous code") has the property that no codeword is a prefix of another. Prefix codes are decodable on the fly: as soon as you see a complete codeword in the bitstream, you know it's complete. The set of prefix codes is exactly the set of codes with codewords corresponding to leaves of a binary tree, which gives Kraft's inequality:

$$ \sum_x 2^{-\ell(x)} \;\leq\; 1. $$

Conversely, any set of lengths satisfying Kraft's inequality can be realised by a prefix code. Two interpretations: codes "consume" probability mass on a binary tree, and the total can't exceed 1; or, a Kraft-compliant length assignment is exactly $\ell(x) = -\log q(x)$ for some distribution $q$ — coding lengths and probability distributions are the same object.

Shannon's source coding theorem. For any prefix code $C$, $$ L(C) \;\geq\; H(X), $$ and a prefix code with $L(C) < H(X) + 1$ exists. By coding blocks of $n$ symbols at a time, the per-symbol overhead can be made arbitrarily small: there exists a prefix code with $L(C)/n < H(X) + 1/n$. So the best achievable per-symbol length converges to exactly $H(X)$ — not approximately, exactly.

The proof sketch is two ideas. The lower bound is one application of Jensen's inequality to the Kraft sum. The upper bound is constructive: choose $\ell(x) = \lceil -\log p(x) \rceil$ (the "Shannon code"), which always satisfies Kraft and has expected length within one bit of entropy. Block coding handles the sub-bit overhead.

The theorem is operational, not metaphysical. It does not just say compression has a limit; it says this is the limit, and explicit algorithms (next section) get arbitrarily close.

Section 08

Huffman, arithmetic, Lempel–Ziv.

Three families of compression algorithms have stood the test of time. Each uses the entropy bound differently — and one of them is what zip, gzip, and DEFLATE actually do under the hood.

Huffman coding (1952). Given a known distribution $p$, build the optimal prefix code by a greedy bottom-up procedure: take the two least-probable symbols, merge them into a virtual symbol with their summed probability, and repeat. The resulting binary tree's leaves are the codewords. Huffman coding is optimal among prefix codes that assign integer-length codewords to individual symbols, with expected length satisfying $H(X) \leq L_{\text{Huff}} < H(X) + 1$.

The "$+1$" is the catch: if entropy is 0.1 bits per symbol (e.g., a heavily biased Bernoulli), Huffman pays a full bit per symbol — a tenfold overhead. Block coding reduces the relative overhead but quickly becomes impractical, because block alphabet size grows exponentially.

Arithmetic coding (1976). Resolves the integer-length problem. Encode the entire message as a single number in $[0, 1)$, recursively narrowing an interval whose length is exactly $\prod_i p(x_i)$. Reading off the binary expansion of any number in the final interval recovers the message. The expected coding length per symbol approaches $H(X)$ to within a vanishingly small overhead — no "$+1$ per symbol" penalty. Arithmetic coding underwrites modern image, video, and audio compression (JPEG, H.264, MP3 derivatives) and is the entropy coder behind state-of-the-art neural compression.

Lempel–Ziv (1977/78). A different philosophy: don't assume the distribution is known, and let the algorithm learn it from the data. The LZ family parses the input into phrases not seen before, building a dictionary on the fly, and outputs a sequence of pointers to dictionary entries. Critically, both encoder and decoder build the same dictionary in the same way — no probability model needs to be transmitted. LZ77 powers DEFLATE (zip, gzip, PNG); LZ78 and its variant LZW powered GIF and the original UNIX compress. For long stationary sources, LZ algorithms are universal: their per-symbol coding rate converges to the source entropy without knowing the source.

Compression as prediction. Every compressor is implicitly a predictor: given the bits seen so far, it assigns a probability distribution to the next bit, and the codelength is $-\log p(\text{next bit})$. Better predictors compress better. This makes large language models, in principle, the best compressors of natural language we have — and indeed, modern neural compression (CMIX, NNCP) uses small language models as the predictor inside an arithmetic coder. The "Hutter Prize" pays for compressing the first 1 GB of Wikipedia using a single self-contained predictor.

The minimum description length (MDL) principle takes this further: the best model for a dataset is the one that, together with the data encoded under it, gives the shortest total description. MDL is a frequentist alternative to Bayesian model selection that turns out to give nearly identical answers in regular cases. It also gives a clean information-theoretic justification for Occam's razor: simpler models compress better, all else equal.

Section 09

Typical sets and the asymptotic equipartition property.

Long random sequences live almost entirely in a vanishingly small region of sequence space, and inside that region they are nearly uniform. This single observation makes Shannon's coding theorems work.

Let $X_1, X_2, \ldots, X_n \stackrel{\text{i.i.d.}}{\sim} p$. The asymptotic equipartition property (AEP) is a consequence of the law of large numbers applied to $-\log p(X_i)$:

$$ -\frac{1}{n} \log p(X_1, \ldots, X_n) \;=\; -\frac{1}{n}\sum_{i=1}^n \log p(X_i) \;\xrightarrow{p}\; H(X). $$

In words: the empirical average of self-information converges to entropy. So with high probability,

$$ p(X_1, \ldots, X_n) \;\approx\; 2^{-n H(X)}. $$

The typical set $A_\varepsilon^{(n)}$ is the set of sequences whose probability is within $\varepsilon$ of $2^{-nH}$:

$$ A_\varepsilon^{(n)} \;=\; \left\{ (x_1, \ldots, x_n) : \left| -\frac{1}{n}\log p(x_1, \ldots, x_n) - H(X) \right| < \varepsilon \right\}. $$

Three properties of this set together prove every coding theorem in this chapter:

High probability: $\mathbb{P}(A_\varepsilon^{(n)}) \to 1$ as $n \to \infty$. Almost all the probability is concentrated on the typical set.
Small size: $|A_\varepsilon^{(n)}| \leq 2^{n(H + \varepsilon)}$. The typical set has at most $2^{nH}$ elements — far fewer than the total $|\mathcal{X}|^n$ when $H < \log |\mathcal{X}|$.
Equiprobability: Every element of the typical set has approximately probability $2^{-nH}$. Within the typical set, the distribution is essentially uniform.

The compression algorithm is now obvious: list the elements of $A_\varepsilon^{(n)}$ in some canonical order, and use the index as the codeword. Indices fit in $n(H + \varepsilon) + 1$ bits (the "+1" is for a flag distinguishing "typical, here is the index" from "non-typical, here is the raw sequence"). The total expected codelength is $n H + O(1)$, achieving Shannon's source coding theorem.

Why this is surprising. For a fair coin, all $2^n$ sequences of length $n$ are equally probable — there is no "typical" set in the sense of "more probable". The AEP still applies: all sequences are typical and the typical set is everything. The interesting case is biased sources. For a coin with $p = 0.9$, almost all of probability lives on sequences with about $0.9n$ heads — vastly fewer than $2^n$ — and inside that small set sequences are nearly equiprobable. This is the geometric picture behind compression: most of the probability mass is concentrated, and concentrated mass is short to describe.

The AEP is also the conceptual prerequisite for the channel coding theorem of the next section — the joint typical set of "transmitted, received" pairs is the object whose size determines how many distinct messages can be reliably decoded.

Section 10

Channel capacity and Shannon's noisy channel theorem.

Reliable communication over a noisy channel is possible at any rate below the channel's capacity, and impossible above it. The capacity has a closed form: the maximum mutual information between input and output.

A discrete memoryless channel is a conditional distribution $p(y \mid x)$ that takes an input symbol $x \in \mathcal{X}$ and produces an output symbol $y \in \mathcal{Y}$ independently each use. The simplest example is the binary symmetric channel (BSC) with crossover probability $f$: the input is a bit, and the channel flips it with probability $f$.

A code for the channel is a mapping from messages $w \in \{1, \ldots, 2^{nR}\}$ to length-$n$ codewords, plus a decoding rule from received sequences to messages. The rate $R$ is bits-per-channel-use; the error probability is the maximum over messages of the probability of incorrect decoding.

The capacity of the channel is

$$ C \;=\; \max_{p(x)} I(X; Y), $$

the maximum mutual information between input and output, taken over all input distributions. For the BSC,

$$ C_{\text{BSC}} \;=\; 1 - H(f), $$

where $H(f) = -f \log f - (1 - f) \log(1 - f)$ is the binary entropy function. A noiseless BSC ($f = 0$) has capacity 1 bit per use; a maximally noisy one ($f = 0.5$) has capacity 0.

Shannon's noisy channel coding theorem. For any rate $R < C$ there exists a sequence of codes with arbitrarily small error probability as $n \to \infty$. For any $R > C$ no such sequence exists — the error probability is bounded away from zero. Capacity is a sharp threshold: below it, reliable communication is achievable; above it, impossible.

The proof is the high point of the whole subject. The achievability direction uses a non-constructive argument — averaged over random codes drawn i.i.d. from the capacity-achieving input distribution, the expected error rate goes to zero, so a deterministic code achieving the same rate must exist. The geometric idea: $2^{nR}$ codewords scattered randomly in input space have, with high probability, non-overlapping joint-typical "spheres" around them in output space, so a maximum-likelihood decoder works.

The converse uses Fano's inequality (which bounds error probability in terms of conditional entropy) to show that no decoder can do better than capacity even given perfect side information about which message was sent. The two directions together pin capacity to its exact value.

Modern error-correcting codes — Hamming, Reed–Solomon, BCH, convolutional codes, turbo codes, LDPC codes, polar codes — are practical realisations of Shannon's theorem. LDPC and polar codes both come within a fraction of a dB of capacity in modern wireless standards (5G, Wi-Fi 7); for decades after 1948 we knew capacity was achievable in principle but had no idea how to do so efficiently. The story of practical coding is the story of closing that gap.

Section 11

The Gaussian channel and the Shannon–Hartley limit.

For continuous channels with additive Gaussian noise, capacity has the most-cited equation in communications: $C = \tfrac{1}{2}\log(1 + \mathrm{SNR})$. It is the reason every modem in history has a top speed.

The additive white Gaussian noise (AWGN) channel takes a real-valued input $X$ with average power constraint $\mathbb{E}[X^2] \leq P$ and outputs $Y = X + Z$ where $Z \sim \mathcal{N}(0, N)$ is independent of $X$.

Under the power constraint, the capacity of the AWGN channel is

$$ C \;=\; \frac{1}{2} \log_2\!\left(1 + \frac{P}{N}\right) \quad \text{bits per channel use}, $$

achieved when the input is Gaussian: $X \sim \mathcal{N}(0, P)$. The proof uses two facts from differential entropy (next section): the Gaussian maximises entropy among distributions with a given variance, and the conditional entropy $h(Y \mid X) = h(Z)$ depends only on the noise.

For a band-limited channel of bandwidth $W$ Hz with two samples per Hz (Nyquist), the capacity per second is

$$ C \;=\; W \log_2\!\left(1 + \frac{P}{N_0 W}\right) \quad \text{bits per second}, $$

the famous Shannon–Hartley formula. This is the absolute upper bound on the data rate of any digital communication system using a band-limited Gaussian channel — modems, Wi-Fi, cellular, fibre optics, every Earth-to-spacecraft link. Doubling the bandwidth doubles capacity. Doubling the SNR adds only one bit per use.

Why your home internet is what it is. A typical Wi-Fi 6 channel has 80 MHz of bandwidth, an SNR of perhaps 30 dB ($P/N \approx 1000$), and a single spatial stream gives about $80 \cdot 10^6 \cdot \log_2(1001) \approx 800$ Mbps. Multiple antennas (MIMO) multiply this by exploiting parallel spatial channels; modern Wi-Fi 7 with 320 MHz channels and 4×4 MIMO gets into the multi-Gbps range. None of it can exceed the underlying Shannon–Hartley limit per spatial channel.

The Gaussian channel also clarifies the role of coding rate in modern systems. A "1/2 rate" code transmits one bit of information for every two channel uses; this lets you operate well below capacity, paying redundancy in exchange for very low error rates. A "5/6 rate" code transmits closer to capacity but is more fragile. Adaptive modulation and coding schemes — every modern wireless standard has them — pick the highest rate that the current SNR can support, sliding along the Shannon limit as channel conditions change.

Section 12

Differential entropy, for continuous variables.

Replace sums with integrals and you get differential entropy — almost the same theory, with a few subtleties that catch beginners. KL divergence, mutual information, and cross-entropy all transfer cleanly; entropy itself does not.

For a continuous random variable $X$ with density $f(x)$, the differential entropy is

$$ h(X) \;=\; -\int f(x) \log f(x)\, dx. $$

Two warnings up front. First, differential entropy can be negative. A uniform distribution on $[0, 1/2]$ has $h = -1$ bit. Discrete entropy is the average codeword length and so is non-negative; differential entropy is a "density of bits" and has no such guarantee. Second, $h(X)$ is not invariant under change of variables. If $Y = aX$, then $h(Y) = h(X) + \log|a|$ — scaling a variable changes its differential entropy by a Jacobian factor. The discrete entropy of a relabelling is unchanged.

Two key entropies worth memorising:

$$ h(\mathcal{N}(\mu, \sigma^2)) \;=\; \tfrac{1}{2}\log(2\pi e \sigma^2), \qquad h(\text{Uniform}[0, a]) \;=\; \log a. $$

The Gaussian formula has a deep generalisation:

Maximum-entropy principle. Among all distributions with a given variance $\sigma^2$, the Gaussian $\mathcal{N}(\mu, \sigma^2)$ has the largest differential entropy. Among distributions with given mean and supported on a finite interval, the uniform has the largest. Among non-negative distributions with a given mean, the exponential has the largest. In each case, "give me the most uncertain distribution consistent with these constraints" produces a familiar named distribution. This maximum-entropy principle (Jaynes 1957) is one of the cleanest justifications for why these specific distributions show up everywhere in physics, statistics, and ML.

KL divergence transfers from sums to integrals without issue:

$$ D_{\text{KL}}(p \,\|\, q) \;=\; \int p(x) \log \frac{p(x)}{q(x)}\, dx, $$

is well-defined whenever $p$ is absolutely continuous with respect to $q$ (no mass where $q$ has none). It is non-negative, asymmetric, and zero iff $p = q$ — exactly as in the discrete case. Cross-entropy and mutual information transfer the same way:

$$ I(X; Y) \;=\; \int p(x, y) \log \frac{p(x, y)}{p(x)\, p(y)}\, dx\, dy, $$

is non-negative and zero iff $X$ and $Y$ are independent. Differential entropy alone is the awkward member of the family; everything that combines two distributions behaves cleanly.

A useful closed form: the KL divergence between two multivariate Gaussians $\mathcal{N}(\mu_0, \Sigma_0)$ and $\mathcal{N}(\mu_1, \Sigma_1)$ in $d$ dimensions is

$$ D_{\text{KL}} \;=\; \tfrac{1}{2}\!\left[\operatorname{tr}(\Sigma_1^{-1}\Sigma_0) + (\mu_1 - \mu_0)^\top \Sigma_1^{-1}(\mu_1 - \mu_0) - d + \log \frac{\det \Sigma_1}{\det \Sigma_0}\right]. $$

This is the formula every variational autoencoder uses to compute its KL regulariser, and the formula every diffusion model uses to compute training losses. Memorise it; it pays for itself within a week of reading any modern deep-learning paper.

Section 13

Fisher information and a glimpse of information geometry.

The KL divergence between two infinitesimally separated members of a parametric family is the Fisher information matrix. This single fact links information theory to statistics, and to a Riemannian geometry on the space of distributions.

Consider a parametric family of densities $\{p_\theta(x) : \theta \in \mathbb{R}^d\}$ — the kind of model you fit with maximum likelihood. The Fisher information matrix at $\theta$ is

$$ F(\theta) \;=\; \mathbb{E}_\theta\!\left[\nabla_\theta \log p_\theta(X)\, \nabla_\theta \log p_\theta(X)^\top\right] \;=\; -\mathbb{E}_\theta\!\left[\nabla_\theta^2 \log p_\theta(X)\right]. $$

The two expressions are equal under regularity (the score's variance equals minus the expected Hessian — a special case of the "information identity"). $F(\theta)$ is symmetric, positive semi-definite, and measures how sharply the log-likelihood is curved around $\theta$. Sharp curvature means data are highly informative about $\theta$; flat curvature means the parameter is poorly identified.

The deep connection: the KL divergence between $p_\theta$ and $p_{\theta + \delta}$ for small $\delta$ is

$$ D_{\text{KL}}(p_\theta \,\|\, p_{\theta + \delta}) \;=\; \tfrac{1}{2}\, \delta^\top F(\theta)\, \delta + O(\|\delta\|^3). $$

KL divergence is locally a quadratic form, with the Fisher information matrix as its metric. This is the starting point of information geometry: the parameter space of a statistical model has a natural Riemannian structure, with $F(\theta)$ as the metric tensor. Geodesics in this geometry are the natural notion of "shortest path between two distributions"; gradients with respect to the Fisher metric — the natural gradient — give parameter updates that don't depend on the parameterisation chosen.

Natural gradient and modern ML. Standard gradient descent updates $\theta \leftarrow \theta - \eta\, \nabla \mathcal{L}$. Natural gradient descent updates $\theta \leftarrow \theta - \eta\, F(\theta)^{-1} \nabla \mathcal{L}$. The factor of $F^{-1}$ rescales updates by the local geometry of the model space, making convergence invariant to reparameterisation. Computing $F^{-1}$ exactly is prohibitive for neural networks; the K-FAC family of optimisers approximates it block-diagonally and gives genuinely large speedups on convex-ish problems and some deep RL settings. Trust-region policy optimisation (TRPO) in RL is exactly natural gradient on the policy. The line back to Section 04 of the statistics chapter — Cramér–Rao, Fisher information — is direct.

Two small payoffs of the geometric viewpoint. First, the Cramér–Rao lower bound from Chapter 05 reads cleanly: the inverse Fisher matrix is the local covariance of any efficient estimator. Second, the maximum-entropy distributions of Section 12 are exactly the densities sitting at the "centroids" of certain natural sub-manifolds in this geometry — a perspective due to Amari that sits underneath much of modern variational inference.

Section 14

Where information theory shows up in modern ML.

Almost every loss, regulariser, and architectural choice in modern ML has an information-theoretic ancestor in this chapter. Naming the connections is what turns a list of techniques into a single subject.

A list worth memorising:

Cross-entropy loss in classifiers is KL divergence from the empirical to the model distribution. Section 06.
VAE training objective. The evidence lower bound (ELBO) decomposes as $\mathbb{E}_q[\log p(x \mid z)] - D_{\text{KL}}(q(z \mid x)\,\|\,p(z))$ — a reconstruction term minus a KL regulariser. The KL term penalises the encoder's deviation from the prior and is the only thing keeping latent codes from collapsing to delta functions. The "$\beta$-VAE" multiplies the KL by a tunable $\beta$ to trade reconstruction quality against disentanglement.
Diffusion models are trained by minimising a sum of KL divergences between the forward (noising) and reverse (denoising) processes at each timestep, all of which collapse to a clean $L^2$ loss because the relevant distributions are Gaussian. The closed-form Gaussian KL of Section 12 is the workhorse.
InfoNCE / contrastive losses (SimCLR, CLIP, MoCo) are tractable lower bounds on the mutual information between an anchor and its positive sample. Maximising the loss maximises representational MI.
The information bottleneck framework recasts learning as finding a representation $T$ that maximises $I(T; Y)$ (informative about labels) while minimising $I(T; X)$ (compressed from inputs). Tishby's IB analysis of deep networks, however controversial in detail, popularised the lens.
Maximum-entropy reinforcement learning (Soft Actor-Critic, etc.) augments the reward with an entropy bonus $\alpha H(\pi(\cdot \mid s))$ to keep exploration alive — a direct instance of the maximum-entropy principle from Section 12.
Knowledge distillation trains a student model to match a teacher's output distribution, by minimising cross-entropy or KL between the two distributions. Soft targets (the teacher's full probability vector, not just the top class) carry more information than one-hot labels — literally, they have higher conditional entropy.
Label smoothing mixes one-hot labels with a uniform distribution, which is exactly equivalent to adding an entropy regulariser to the loss and improves calibration.
Mutual-information bounds in self-supervised learning appear in BYOL, Barlow Twins, VICReg, and most modern non-contrastive SSL — the loss functions are explicit attempts to maximise lower bounds on $I(\text{view 1}; \text{view 2})$.
Compression-as-prediction is the principle behind every autoregressive language model. Training on next-token prediction is equivalent to training the model to compress the corpus; Hutter's argument is that a perfect compressor is also a perfect predictor of natural language.
Minimum description length turns Occam's razor into an algorithm: the best model is the one minimising "code length of model + code length of data given model". Equivalent to a Bayesian posterior with a particular prior; gives Bayesian-style model selection without the Bayesian machinery.
Calibration, ECE, Brier score. These standard classifier-evaluation metrics are scoring rules with information-theoretic foundations; the log score is exactly cross-entropy, and the Brier score is a quadratic approximation to it.
Generalisation bounds. PAC-Bayes bounds, mutual-information generalisation bounds (Xu & Raginsky), and the recent flat-minima / SGD-noise literature all use information-theoretic quantities to control how much a learning algorithm "memorises" the training set.

The pattern is the same one we saw in earlier chapters. A handful of equations from a 1948 paper, applied at the scale and function-class flexibility of modern computation, generates an entire technical vocabulary that practitioners use without always knowing it. Knowing the older language doesn't slow you down; it lets you stop pattern-matching to specific papers and start thinking in the underlying ideas.

Up next in the compendium: Bayesian reasoning, which turns the KL divergences and likelihood functions of this chapter into a complete inferential framework; and signal processing, which gives the harmonic-analytic toolkit that audio, image, and time-series models rest on. Together these eight chapters are the mathematical engine room of modern machine learning.

Where to go next

Information theory is a small, deep subject — a handful of quantities and a few theorems that connect them — and almost anything worth knowing about it is covered somewhere in the sources below. Read Shannon's 1948 paper once; it is still clearer than most textbooks.

First-pass textbooks

Elements of Information Theory

Cover & Thomas · 2nd ed. 2006

The standard graduate textbook. Clean definitions, short proofs, and the tightest exposition anywhere of AEP, channel capacity, and rate–distortion. If you read one book, read this one.

Wiley
Information Theory, Inference, and Learning Algorithms

David MacKay · 2003

The rare textbook that covers both classical Shannon theory and the machine-learning descendants (Monte Carlo, variational inference, message-passing). Free PDF on MacKay's site. The exercises alone are worth the download.

Free PDF
A Mathematical Theory of Communication

Claude Shannon · Bell System Technical Journal · 1948

The founding paper. Reads like a modern monograph — definitions, theorems, worked examples, the binary-entropy plot and channel-capacity theorem all in one sitting. Still the best introduction to the subject.

PDF

Deeper and more rigorous

Information Theory and Network Coding

Raymond Yeung · 2nd ed. 2008

Modern, measure-theoretic, and unusually careful about foundations. Extends classical Shannon theory into network coding — worth it once single-user channels feel easy.

Springer
Information Theory: Coding Theorems for Discrete Memoryless Systems

Csiszár & Körner · 2nd ed. 2011

The reference for the method of types, strong converses, and multi-user information theory. Demanding, but the payoff is a fluency in proof techniques that generalize far beyond the discrete-memoryless case.

Cambridge
Information Theory: From Coding to Learning

Polyanskiy & Wu · 2024 draft

A modern, rigorous text that explicitly bridges classical information theory and statistical learning. The "from coding to learning" arc is exactly the story of this chapter, at full graduate depth.

Draft PDF

Information geometry

Methods of Information Geometry

Amari & Nagaoka · 2000

The foundational text for viewing families of probability distributions as Riemannian manifolds. Fisher information as a metric, KL as a Bregman divergence, natural gradient descent — all developed here.

AMS
Information Geometry and Its Applications

Shun-ichi Amari · 2016

Amari's modern, applications-oriented treatment. Chapters on the natural gradient, neural networks, and Bayesian statistics make this the most accessible entry point to the field.

Springer

Free video courses

MIT 6.441 — Information Theory

MIT OpenCourseWare

Graduate information theory, taught from Cover & Thomas. Lecture notes, problem sets, and exams all available. The problem sets in particular are an efficient way to build fluency with typical sets and Fano's inequality.

MIT OCW
Stanford EE 376A — Information Theory

Tsachy Weissman · Stanford

A polished undergraduate/first-year-graduate course. Weissman is a gifted lecturer, and the course's "information theory of human language" project has become a minor internet classic.

Course page
MacKay's Cambridge lectures

David MacKay · YouTube

The full set of lectures that became Information Theory, Inference, and Learning Algorithms. MacKay's teaching style — building each concept with simple binary examples — is exceptionally clear.

YouTube

References and papers

Deep Learning and the Information Bottleneck Principle

Tishby & Zaslavsky · 2015

The modern framing of neural network learning as a trade-off between compression and prediction in the mutual-information plane. Controversial, influential, and widely cited.

arXiv:1503.02406
Representation Learning with Contrastive Predictive Coding

van den Oord, Li & Vinyals · 2018

Introduces InfoNCE, the mutual-information-based contrastive objective that underlies CLIP, SimCLR, MoCo, and most self-supervised learning. The place where information theory meets modern representation learning.

arXiv:1807.03748
Auto-Encoding Variational Bayes

Kingma & Welling · 2013

The original VAE paper. The ELBO as reconstruction minus KL is the clearest case study of how information-theoretic regularisation shapes learned representations.

arXiv:1312.6114
scipy.stats.entropy & torch.nn.functional.cross_entropy

Library references

The two one-liners that sit behind almost every information-theoretic quantity you will compute in practice. Reading the docs — particularly the base/reduction/ignore-index options — heads off a lot of debugging later.

SciPy PyTorch

This is Chapter 06 of an eight-chapter tour of the mathematical foundations of AI. Up next: Bayesian reasoning and signal processing — two further lenses on the same machinery.

How to read this chapter

Contents

Information theory in one idea.

Surprise and entropy.

Joint entropy, conditional entropy, the chain rule.

Mutual information.

Relative entropy: the KL divergence.

Cross-entropy and the H–D triangle.

Source coding: entropy is the compression limit.

Huffman, arithmetic, Lempel–Ziv.

Typical sets and the asymptotic equipartition property.

Channel capacity and Shannon's noisy channel theorem.

The Gaussian channel and the Shannon–Hartley limit.

Differential entropy, for continuous variables.

Fisher information and a glimpse of information geometry.

Where information theory shows up in modern ML.

Where to go next

First-pass textbooks

Deeper and more rigorous

Information geometry

Free video courses

References and papers