Part I · Mathematical Foundations · Chapter 01

Linear Algebra, the quiet engine of modern AI.

Nearly every operation in a neural network is a matrix multiplication. Embeddings are vectors; attention is an outer product; PCA is a singular value decomposition. Linear algebra is not glamorous, but it is the grammar. This chapter teaches it from the ground up — vectors, matrices, transformations, eigenvectors, SVD — with just enough geometric intuition that the ML references in later chapters feel inevitable rather than mysterious.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make the prose precise. If a formula looks scary, look at the picture next to it first.

Notation: scalars are lowercase italic ($x$, $\lambda$); vectors are bold lowercase ($\mathbf{v}$, $\mathbf{x}$); matrices are capitals ($A$, $\Sigma$). $\mathbb{R}^n$ means "the set of lists of $n$ real numbers." Code-like names appear in monospace. Key terms when first defined appear in a slightly different color.

Contents

  1. What is linear algebra about?Motivation
  2. VectorsObjects
  3. Vector operationsAddition, scaling, dot product
  4. Norms and distancesLengths
  5. Linear combinations, span, and basisStructure
  6. MatricesData and transformations
  7. Matrix–vector multiplicationTwo views
  8. Matrix–matrix multiplicationComposition
  9. Linear transformationsGeometry
  10. Special matricesIdentity, orthogonal, symmetric
  11. DeterminantsSigned volume
  12. Rank, column space, null spaceDimension counting
  13. Eigenvalues and eigenvectorsInvariant directions
  14. DiagonalizationChange of basis
  15. Matrix factorizationsLU, QR, Cholesky
  16. The Singular Value DecompositionThe king
  17. Projections and least squaresLinear regression
  18. Where it shows up in MLPayoff
Section 01

What is linear algebra about?

The short answer: linear algebra is the mathematics of vectors and matrices, and of the transformations that move them around. The longer answer explains why that is the most useful abstraction in applied mathematics.

A system behaves linearly if scaling an input scales the output by the same factor, and if adding two inputs produces the sum of their outputs. Formally, a function $f$ is linear when

$$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha\, f(\mathbf{x}) + \beta\, f(\mathbf{y}) $$

for any inputs $\mathbf{x}, \mathbf{y}$ and scalars $\alpha, \beta$. That single property — preserving addition and scaling — turns out to be enough structure to build decomposition theorems, efficient algorithms, and an entire geometry that computers can reason about at trillions of operations per second.

Most of modern machine learning sits on top of this linearity. A dense layer of a neural network is a matrix multiplication. Word embeddings are vectors in $\mathbb{R}^d$. Attention is a dot product. Principal component analysis is a singular value decomposition. Even the non-linear pieces — activations, softmaxes, attention masks — live in between linear operations, which do the actual heavy lifting of mixing information across features and examples.

Key idea

Linear algebra gives us two things: a language for describing high-dimensional data (vectors, matrices, subspaces) and an algebra for transforming it efficiently. Both matter. The language lets you think; the algebra lets your laptop finish in an afternoon.

The plan for the rest of this chapter: we build up from the smallest objects (vectors) to the operations on them (addition, dot products, norms), introduce matrices as both data and as functions, study how matrices stretch and rotate space, and finish with the two results you cannot do serious ML without — the eigendecomposition and the singular value decomposition.

Section 02

Vectors

A vector is the simplest object in linear algebra — just a list of numbers. But the power of the concept comes from holding three different pictures in mind at once.

Three views of the same thing

A vector in $\mathbb{R}^n$ is formally an ordered tuple of $n$ real numbers,

$$ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n. $$

You will see this same object wearing three different hats. The list view treats it as a column of numbers — how it lives inside a computer. The geometric view treats it as an arrow from the origin, which is how you get intuition about length, angle, and direction. The point view treats it as a location in space — which is how you reason about data, where each training example is a point in $\mathbb{R}^n$.

All three views are always correct; the trick is to pick the right one for the question you are asking.

A vector in $\mathbb{R}^2$ as an arrow from the origin, with components along each coordinate axis.

The data-science interpretation

When a data scientist says "a training example is a vector," they mean: every observation is represented as a fixed-length list of numerical features, and every such list is a point in a common space $\mathbb{R}^d$. A house might be represented by its square footage, its number of bedrooms, and its year built; that makes it a point in $\mathbb{R}^3$. A $256 \times 256$ grayscale image flattens to a point in $\mathbb{R}^{65{,}536}$. A word embedding typically lives in $\mathbb{R}^{768}$ or $\mathbb{R}^{4096}$. The dimension of the vector is the number of features; the space it lives in is called an ambient space or feature space.

Intuition

Anything you want a model to reason about — a user, a document, a protein, a trajectory — first gets turned into a vector. Almost every algorithm in this compendium is a story about how to cook up that vector, how to compare it to others, and how to transform it into a prediction.

Section 03

Vector operations

There are only three operations you absolutely need: add two vectors, scale one by a scalar, and take the inner (dot) product of two. Everything else — angles, similarities, projections, norms — comes from these.

Addition and scalar multiplication

Addition is coordinate-wise,

$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}, $$

and geometrically it puts the arrows head to tail: you walk along $\mathbf{u}$, then walk along $\mathbf{v}$, and the sum points from the origin to where you end up. Equivalently, the sum is the diagonal of the parallelogram formed by the two vectors.

Scalar multiplication stretches a vector (and reverses it if the scalar is negative),

$$ \alpha \mathbf{v} = \begin{bmatrix} \alpha v_1 \\ \alpha v_2 \\ \vdots \\ \alpha v_n \end{bmatrix}. $$
Vector addition by the parallelogram rule: walk along $\mathbf{u}$ then $\mathbf{v}$, or along $\mathbf{v}$ then $\mathbf{u}$; you end up at the same point, $\mathbf{u}+\mathbf{v}$.

The dot product

The dot product (also called the inner product) combines two vectors of the same size into a single number:

$$ \mathbf{u} \cdot \mathbf{v} \;=\; \mathbf{u}^\top \mathbf{v} \;=\; \sum_{i=1}^{n} u_i v_i. $$

The geometric identity that makes this operation genuinely useful is

$$ \mathbf{u} \cdot \mathbf{v} \;=\; \|\mathbf{u}\|\, \|\mathbf{v}\|\, \cos\theta, $$

where $\theta$ is the angle between the two vectors. That one line unlocks almost every similarity-based technique in ML. When $\mathbf{u}$ and $\mathbf{v}$ point in the same direction, the dot product is large and positive. When they are perpendicular, it is zero. When they point in opposite directions, it is negative. Normalizing out the lengths gives you the cosine similarity, which is exactly how search engines rank documents and how vector databases match embeddings.

The dot product $\mathbf{u}\cdot\mathbf{v} = \|\mathbf{u}\|\|\mathbf{v}\|\cos\theta$ equals the length of $\mathbf{u}$ times the signed length of $\mathbf{v}$'s projection onto $\mathbf{u}$.
Why it matters

Whenever you see an "alignment" or "similarity" score in a paper — whether it is query-key attention in a transformer, or a contrastive-learning score between an image and its caption — a dot product is doing the work. The softmax that usually follows it just turns the raw scores into probabilities.

Section 04

Norms and distances

A norm measures the size of a vector. Different norms measure size differently, which matters for regularization, optimization, and the geometry of nearest-neighbor search.

The Euclidean norm, or $L^2$ norm, is the familiar straight-line length,

$$ \|\mathbf{v}\|_2 \;=\; \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} \;=\; \sqrt{\mathbf{v}^\top \mathbf{v}}. $$

The $L^1$ norm (taxicab norm) sums absolute values,

$$ \|\mathbf{v}\|_1 \;=\; |v_1| + |v_2| + \cdots + |v_n|. $$

The $L^\infty$ norm picks the largest coordinate in absolute value,

$$ \|\mathbf{v}\|_\infty \;=\; \max_i |v_i|. $$

They are all special cases of the $L^p$ norm $\|\mathbf{v}\|_p = \left(\sum_i |v_i|^p\right)^{1/p}$. The unit ball — the set of vectors with $\|\mathbf{v}\| \le 1$ — tells you what the norm "thinks" a unit is.

Unit balls in $\mathbb{R}^2$ for the $L^1$, $L^2$, and $L^\infty$ norms. The $L^2$ ball is round because the Euclidean norm is the only one that is rotation-invariant.
Why the sharp corners of the L¹ ball matter

Regularization with an $L^2$ penalty (ridge regression, weight decay) shrinks all coefficients toward zero smoothly. Regularization with an $L^1$ penalty (lasso) tends to set coefficients exactly to zero — the optimum often lands on a corner of the $L^1$ ball, where several coordinates are already zero. That is how lasso produces sparse models, and it's entirely a consequence of the shape of the unit ball.

The distance between two vectors is the norm of their difference: $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|$. Choosing a norm chooses a notion of distance, and the choice quietly shapes everything from clustering to nearest-neighbor classifiers.

Section 05

Linear combinations, span, and basis

Once you can add and scale, you can describe whole regions of space with a handful of vectors. That describability is what lets us compress a billion-dimensional data space down to something our laptops — and our intuitions — can handle.

Linear combinations and span

A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ is any vector of the form

$$ \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k $$

with real coefficients $\alpha_i$. The span of a set of vectors is the collection of every linear combination you can build from them. Two non-parallel vectors in $\mathbb{R}^2$ span the whole plane. Three vectors lying on a common plane in $\mathbb{R}^3$ span only that plane — no matter how many of them you take.

Linear independence

A set of vectors is linearly independent if none of them can be written as a linear combination of the others. Equivalently, the only way to get $\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}$ is with all $\alpha_i = 0$. If a vector is dependent on the others, throwing it out doesn't shrink the span — it's redundant.

Basis and dimension

A basis for a space is a set of linearly independent vectors that span it — a minimal spanning set. The most common basis for $\mathbb{R}^n$ is the standard basis $\mathbf{e}_1, \ldots, \mathbf{e}_n$, where $\mathbf{e}_i$ has a 1 in the $i$-th slot and zeros elsewhere. The dimension of a space is the number of vectors in any basis (this number is the same for any basis — a nontrivial theorem).

Any vector in $\mathbb{R}^2$ is a unique linear combination of the standard basis. Here $\mathbf{w} = 3\mathbf{e}_1 + 2\mathbf{e}_2$.
Key idea

A vector is the same object regardless of basis; its coordinates depend on which basis you're using. Changing basis is just re-expressing the same vector in a different coordinate system. A huge portion of linear algebra — eigendecompositions, SVD, PCA — amounts to finding a smarter basis in which the problem becomes easy.

Section 06

Matrices

A matrix is a rectangular grid of numbers. That is the bookkeeping view. The more useful view — the one you should keep returning to — is that a matrix is a function from one vector space to another.

An $m \times n$ matrix $A$ has $m$ rows and $n$ columns,

$$ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n}. $$

You will see two mental pictures used more or less interchangeably. The data view treats a matrix as a way to stack information — rows might be examples and columns features, or the other way around. The function view treats a matrix $A \in \mathbb{R}^{m \times n}$ as a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$: it eats an $n$-dimensional vector and spits out an $m$-dimensional vector. Both pictures are always valid; experienced practitioners switch between them sentence by sentence.

The transpose

The transpose $A^\top$ swaps rows and columns. If $A$ is $m \times n$, then $A^\top$ is $n \times m$, with entries $(A^\top)_{ij} = a_{ji}$. A matrix equal to its own transpose, $A = A^\top$, is called symmetric — these appear everywhere (covariance matrices, kernels, Gram matrices) and have especially clean properties.

Matrices as a stack of rows or a set of columns

Sometimes it is useful to write $A$ in row-stacked form or column-stacked form:

$$ A = \begin{bmatrix} \mathbf{r}_1^\top \\ \mathbf{r}_2^\top \\ \vdots \\ \mathbf{r}_m^\top \end{bmatrix} \qquad\text{or}\qquad A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 & \cdots & \mathbf{c}_n \end{bmatrix}. $$

The row view makes matrix–vector multiplication look like a stack of dot products; the column view makes it look like a weighted sum of columns. Those are the two views of Section 07, and either one can be the right one depending on what you're trying to see.

Section 07

Matrix–vector multiplication

The single most important operation in numerical linear algebra — and, by extension, in deep learning. It deserves two complementary interpretations.

For $A \in \mathbb{R}^{m\times n}$ and $\mathbf{x} \in \mathbb{R}^n$, the product $A\mathbf{x}$ is a vector in $\mathbb{R}^m$ defined by

$$ (A\mathbf{x})_i \;=\; \sum_{j=1}^{n} a_{ij}\, x_j, \qquad i = 1, \ldots, m. $$

View 1: a stack of dot products (the row view)

Read $A$ as rows. The $i$-th entry of $A\mathbf{x}$ is the dot product of row $\mathbf{r}_i$ with $\mathbf{x}$. This view makes $A\mathbf{x}$ feel like evaluating $m$ different linear measurements of the same input vector.

$$ A\mathbf{x} = \begin{bmatrix} \mathbf{r}_1^\top \mathbf{x} \\ \mathbf{r}_2^\top \mathbf{x} \\ \vdots \\ \mathbf{r}_m^\top \mathbf{x} \end{bmatrix}. $$

View 2: a weighted sum of columns (the column view)

Read $A$ as columns. The product $A\mathbf{x}$ is a linear combination of the columns of $A$, using the entries of $\mathbf{x}$ as weights,

$$ A\mathbf{x} = x_1 \mathbf{c}_1 + x_2 \mathbf{c}_2 + \cdots + x_n \mathbf{c}_n. $$

This view is the geometric one. It says the image of $A$ — everything $A$ can produce — lives in the column space, the span of the columns. If your $\mathbf{x}$ doesn't have a clever combination that adds up to the target $\mathbf{b}$, the equation $A\mathbf{x} = \mathbf{b}$ simply has no solution.

Key idea

The row view asks "what measurements does this matrix make on my input?" The column view asks "what outputs can this matrix possibly produce?" Most confusions in a linear algebra argument come from sticking to the wrong view — switching sometimes makes the same calculation look obvious.

Section 08

Matrix–matrix multiplication

Multiplying two matrices composes two linear functions. Get that one idea and the mechanics (which column hits which row) becomes bookkeeping.

If $A \in \mathbb{R}^{m\times k}$ and $B \in \mathbb{R}^{k\times n}$, the product $C = AB \in \mathbb{R}^{m\times n}$ is defined entrywise as

$$ c_{ij} \;=\; \sum_{\ell=1}^{k} a_{i\ell}\, b_{\ell j}. $$

The shapes have to match: the inner dimensions ($k$ and $k$) must agree; the outer dimensions ($m$ and $n$) become the shape of the product. The clean way to remember it: (m, k) × (k, n) → (m, n).

Composition of transformations

If $B$ is the linear map $\mathbb{R}^n \to \mathbb{R}^k$ and $A$ is the linear map $\mathbb{R}^k \to \mathbb{R}^m$, then the matrix $AB$ is the map $\mathbb{R}^n \to \mathbb{R}^m$ you get by applying $B$ first and $A$ second. Formally, $(AB)\mathbf{x} = A(B\mathbf{x})$. This is why matrix multiplication shows up in every neural network: stacking two linear layers is, literally, multiplying their weight matrices.

Properties (and one non-property)

Heads up

Matrix multiplication is the one operation most beginners memorize mechanically without reaching for the composition view. Every time you see $ABC\mathbf{x}$, re-read it as "transform $\mathbf{x}$ by $C$, then by $B$, then by $A$" — right to left. That reading makes the identities above obvious instead of tedious.

Section 09

Linear transformations

A matrix is a linear transformation. Seeing what transformations look like — rotations, scalings, shears, projections — turns matrix algebra from symbol-shuffling into geometry.

The single most useful trick in this entire chapter: a linear transformation is completely determined by where it sends the basis vectors. If $A\mathbf{e}_1 = \mathbf{c}_1$ and $A\mathbf{e}_2 = \mathbf{c}_2$, then $A$'s matrix representation is just those images stacked as columns,

$$ A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 \end{bmatrix}. $$

Read the other way: the columns of $A$ are the images of the basis vectors. So to know what $A$ does to the whole plane, it suffices to know what it does to $\mathbf{e}_1$ and $\mathbf{e}_2$, and everything else follows by linearity.

A linear transformation sends the unit square to the parallelogram spanned by the columns of $A$. The basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ are mapped to $\mathbf{c}_1$ and $\mathbf{c}_2$ — the columns of $A$.

A gallery of common transformations

Every one of these is a matrix, and every one preserves the origin (that's the "linear" bit):

Every more exotic transformation (change of basis, orthogonal transformations, low-rank approximations) is built out of these pieces. When you run a neural network forward pass, each weight matrix is silently performing some mix of scaling, rotation, and shear — usually in very high dimensions.

Section 10

Special matrices

A handful of matrix types show up again and again. Each has structure that makes computation easier and meaning clearer.

The identity

The identity matrix $I_n$ is the matrix that does nothing: $Iv = v$ for every $v$. It has ones on the diagonal and zeros everywhere else. It plays the role of $1$ in the world of matrices.

$$ I_3 = \begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\end{bmatrix} $$

Diagonal and triangular matrices

A diagonal matrix has zeros off the diagonal. It scales each coordinate independently — the cleanest transformation there is. An upper-triangular matrix has zeros below the diagonal; a lower-triangular matrix has zeros above. Triangular systems are trivial to solve by back-substitution, which is why most decomposition algorithms try to produce them.

Symmetric matrices

A square matrix $A$ is symmetric if $A^\top = A$ — the entry at $(i, j)$ equals the entry at $(j, i)$. Covariance matrices, Gram matrices ($A^\top A$), and Hessians of smooth functions are all symmetric. Symmetric matrices have real eigenvalues and orthogonal eigenvectors — we'll see why in Sections 13 and 14.

Orthogonal matrices

A real square matrix $Q$ is orthogonal if its columns (equivalently, its rows) form an orthonormal set: each has unit length and any two distinct columns are perpendicular. Equivalently,

$$ Q^\top Q = Q Q^\top = I, \qquad \text{so} \qquad Q^{-1} = Q^\top. $$

Geometrically, orthogonal matrices are the rigid motions of space — rotations and reflections. They preserve lengths, angles, and volumes. They're numerically wonderful because they never amplify errors.

Positive-definite matrices

A symmetric matrix $A$ is positive definite if $x^\top A x > 0$ for every nonzero $x$. Positive-definite matrices describe quadratic forms that open upward in every direction — think bowl-shaped loss surfaces. Covariance matrices are positive semi-definite; Hessians at a strict local minimum are positive definite.

Why this matters. The adjectives in "symmetric positive-definite covariance matrix" each unlock an algorithm: symmetric $\Rightarrow$ Cholesky factorization is fastest; positive-definite $\Rightarrow$ every eigenvalue is positive, so the matrix has a unique square root.
Section 11

Determinants

The determinant is a single number that captures how a linear transformation scales — and flips — space.

The geometric definition

For a $2 \times 2$ matrix $A$, $\det(A)$ is the signed area of the parallelogram spanned by the columns of $A$. For a $3 \times 3$ matrix, it's the signed volume of the parallelepiped spanned by the columns. In general, $|\det(A)|$ is the factor by which $A$ rescales $n$-dimensional volume.

x y a b det(A) = area
For $A = \begin{bsmallmatrix}3 & 1\\ 1 & 2\end{bsmallmatrix}$, the columns span a parallelogram whose (signed) area is $\det(A) = 3\cdot 2 - 1\cdot 1 = 5$.

The $2\times 2$ formula and general case

$$ \det\!\begin{bmatrix}a & b\\ c & d\end{bmatrix} = ad - bc. $$

In general, the determinant can be computed by cofactor expansion (expensive, $O(n!)$) or — much faster — by reducing $A$ to triangular form and multiplying the diagonal ($O(n^3)$).

What it tells you

Numerical caution. Although $\det(A) = 0$ means singular in exact arithmetic, in floating point the determinant can be astronomically small or large even for well-behaved matrices. To check whether a matrix is close to singular, use the condition number or the smallest singular value — not the determinant.
Section 12

Rank, column space, and null space

A matrix has an input side and an output side. Each has a natural subspace, and their dimensions are linked by one of the most elegant theorems in mathematics.

Column space and rank

The column space of an $m \times n$ matrix $A$, written $\mathrm{Col}(A)$, is the span of its columns — it lives in $\mathbb{R}^m$ and is exactly the set of outputs $Av$ as $v$ ranges over $\mathbb{R}^n$. In other words, $\mathrm{Col}(A)$ is the image or range of the linear map.

The rank of $A$ is the dimension of $\mathrm{Col}(A)$ — the number of linearly independent columns, or equivalently the number of linearly independent rows.

$$ \mathrm{rank}(A) \;\le\; \min(m, n). $$

A matrix is full rank if it achieves this bound. "Low-rank" matrices — ones whose rank is much smaller than their dimensions — are compressible: they can be written as a product $A = UV^\top$ with $U \in \mathbb{R}^{m \times r}$, $V \in \mathbb{R}^{n \times r}$. This is the whole idea behind LoRA and model compression.

Null space

The null space (or kernel) of $A$ is

$$ \mathrm{Null}(A) = \{\, v \in \mathbb{R}^n \;:\; Av = 0 \,\}. $$

It's the set of inputs that $A$ crushes to zero. The null space captures the ambiguity of $A$ as a function: if $Av_1 = Av_2$, then $v_1 - v_2 \in \mathrm{Null}(A)$.

The Rank–Nullity Theorem

Rank–Nullity. For any $m \times n$ matrix $A$, $$ \mathrm{rank}(A) \;+\; \dim \mathrm{Null}(A) \;=\; n. $$ Every input dimension is either used (part of the rank) or collapsed (part of the null space). Conservation of information, made precise.

The four fundamental subspaces

Every matrix $A \in \mathbb{R}^{m\times n}$ has four associated subspaces:

The row space and null space are orthogonal complements in $\mathbb{R}^n$, and so are the column space and left null space in $\mathbb{R}^m$. Gilbert Strang draws these as two pairs of perpendicular planes in his famous "Big Picture" diagram — it's worth looking up.

Section 13

Eigenvalues and eigenvectors

Most vectors rotate when you apply a matrix. A precious few only stretch. Those are the eigenvectors — and they reveal the matrix's hidden skeleton.

Definition

An eigenvector of a square matrix $A$ is a nonzero vector $v$ that $A$ sends to a scalar multiple of itself. The scalar is the eigenvalue $\lambda$:

$$ A v = \lambda v, \qquad v \neq 0. $$

Along an eigenvector, $A$ acts as pure scaling by $\lambda$. All the twisting and shearing that $A$ does to other vectors simply doesn't happen along these special directions.

x y v Av = λv u Au
The eigenvector $v$ lies along a direction that $A$ only rescales (here by $\lambda = 2$). An ordinary vector $u$ gets both rotated and rescaled.

Finding them

Rearranging $Av = \lambda v$ gives $(A - \lambda I)v = 0$. For a nonzero $v$ to satisfy this, $A - \lambda I$ must be singular — its determinant must vanish. The characteristic polynomial

$$ p(\lambda) = \det(A - \lambda I) $$

has the eigenvalues as its roots. For each eigenvalue, the corresponding eigenvectors are the nonzero elements of $\mathrm{Null}(A - \lambda I)$.

The big theorems

Why AI people care

Eigenvalues tell you about stability, dominant directions, and long-term behavior. The largest eigenvalue of a transition matrix governs how fast a Markov chain mixes. The leading eigenvector of a graph's adjacency matrix (or its normalized variant) ranks the most "central" nodes — Google's PageRank is an eigenvector computation. The eigenvalues of the Hessian near a minimum tell you the curvature of the loss surface in each direction, which is why condition number (ratio of largest to smallest eigenvalue) shows up whenever people talk about why optimization is hard.

Section 14

Diagonalization

If you can find enough eigenvectors, you can rewrite a matrix in the simplest possible form: a pure scaling along a tilted set of axes.

The factorization

Suppose $A$ is $n \times n$ and has $n$ linearly independent eigenvectors $v_1, \ldots, v_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$. Stack the eigenvectors as columns of a matrix $V$ and the eigenvalues on the diagonal of $\Lambda$:

$$ A = V \Lambda V^{-1}, \qquad \Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n). $$

Read the identity right to left: $V^{-1}$ expresses any input in the eigenbasis, $\Lambda$ scales each component by its eigenvalue, and $V$ translates back to the original coordinates. In eigen-coordinates, $A$ is just independent scalings on each axis.

Powers and functions of matrices

Diagonalization makes $A^k$ almost free:

$$ A^k = V \Lambda^k V^{-1}, \qquad \Lambda^k = \mathrm{diag}(\lambda_1^k, \ldots, \lambda_n^k). $$

The same trick extends to any analytic function of a matrix: $\exp(A) = V \exp(\Lambda) V^{-1}$, and so on. This is how you solve linear differential equations by hand and why "iterating a linear system" boils down to studying the eigenvalues.

When diagonalization works

Change of basis, the true meaning. Every diagonalization is a statement that the matrix is secretly simple — boring scaling — once you pick the right coordinates. A huge chunk of applied mathematics is about finding the right coordinates.
Section 15

Matrix factorizations: LU, QR, Cholesky

Numerical linear algebra is, to a first approximation, the art of factoring a matrix into pieces you can compute with.

LU: Gaussian elimination, cleaned up

Any (sufficiently nice) square matrix $A$ can be written as

$$ A = LU $$

with $L$ lower-triangular (ones on the diagonal) and $U$ upper-triangular. This is exactly what Gaussian elimination produces. Once you have $L$ and $U$, solving $Ax = b$ takes two triangular solves — forward substitution for $Ly = b$, then back-substitution for $Ux = y$. In practice algorithms use a permutation for stability: $PA = LU$.

QR: orthogonalizing the columns

Any $m \times n$ matrix $A$ with independent columns ($m \ge n$) can be written as

$$ A = QR $$

with $Q \in \mathbb{R}^{m \times n}$ orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper-triangular. Geometrically, $Q$ carries the Gram–Schmidt-orthogonalized columns; $R$ records the coefficients that rebuild the originals from them. QR is the workhorse of least-squares (we'll use it in Section 17) and the starting point for the most common eigenvalue algorithm, QR iteration.

Cholesky: the symmetric positive-definite speedup

If $A$ is symmetric positive-definite, it factors as

$$ A = L L^\top $$

with $L$ lower-triangular (positive diagonal). Cholesky is about twice as fast as LU and more numerically stable. It's the reason covariance-heavy workloads (Gaussian processes, Kalman filters, portfolio optimization) scale as well as they do.

Rule of thumb. Square-and-generic? LU. Tall and thin, or need orthogonality? QR. Symmetric positive-definite? Cholesky. Everything else? SVD, next.
Section 16

The Singular Value Decomposition

If diagonalization is the crown jewel of linear algebra, the SVD is the crown. It works for every matrix, rectangular or square, rank-deficient or not, and says: every linear map is a rotate, scale, rotate.

The theorem

Every real matrix $A \in \mathbb{R}^{m \times n}$ can be factored as

$$ A = U \Sigma V^\top $$

where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ called the singular values of $A$.

The geometric picture

unit disk Vᵀ rotated (Vᵀ) Σ stretched (Σ) U A = UΣVᵀ
Every matrix sends the unit ball to an ellipsoid. The SVD decomposes that action into three simple steps: rotate ($V^\top$), scale along the axes ($\Sigma$), rotate again ($U$). The singular values are the lengths of the ellipsoid's semi-axes.

What the pieces mean

Low-rank approximation (the Eckart–Young theorem)

Truncating the SVD — keeping only the $k$ largest singular values — gives the best rank-$k$ approximation of $A$ in both the Frobenius and spectral norms:

$$ A_k = \sum_{i=1}^{k} \sigma_i\, u_i v_i^\top. $$
One decomposition, many applications. Principal Component Analysis is the SVD of a centered data matrix. Latent Semantic Analysis is the SVD of a term-document matrix. Netflix-style collaborative filtering is a truncated SVD. LoRA fine-tuning exploits the same low-rank idea without ever computing the full SVD. Pseudo-inverses — the Moore–Penrose inverse $A^+ = V \Sigma^+ U^\top$ — are built straight from the SVD and underlie every least-squares solver.
Section 17

Projections and least squares

Most real-world linear systems have no exact solution — there are more equations than unknowns, and the data is noisy. Least squares is the answer to "what's the closest thing to a solution?"

Projections

Given a subspace $W \subseteq \mathbb{R}^n$ and a point $b \in \mathbb{R}^n$, the orthogonal projection of $b$ onto $W$ is the point in $W$ closest to $b$ — the foot of the perpendicular dropped from $b$ to $W$. The residual $b - \mathrm{proj}_W b$ is orthogonal to every vector in $W$.

If the columns of $A$ are a basis for $W$, the projection matrix onto $W$ is

$$ P = A(A^\top A)^{-1} A^\top. $$

$P$ is symmetric, $P^2 = P$, and $P b \in W$ is the closest point in $W$ to $b$.

The least-squares problem

We're given a tall matrix $A \in \mathbb{R}^{m \times n}$ (more rows than columns — more equations than unknowns) and a target $b \in \mathbb{R}^m$. There's usually no $x$ with $Ax = b$ exactly. Least squares asks for the $x$ that minimizes the residual:

$$ \hat{x} = \arg\min_{x} \; \| A x - b \|_2^2. $$
Col(A) b Âx = proj b residual = b − Ax̂
Least squares finds the point $A\hat{x}$ in the column space of $A$ that is closest to $b$. The residual $b - A\hat{x}$ is orthogonal to $\mathrm{Col}(A)$.

The normal equations

The minimizer $\hat{x}$ satisfies $A\hat{x} = Pb$ — i.e. $Ax$ is the projection of $b$ onto the column space. Writing this using orthogonality of the residual gives the normal equations:

$$ A^\top A\, \hat{x} = A^\top b. $$

When $A$ has independent columns, $A^\top A$ is invertible and

$$ \hat{x} = (A^\top A)^{-1} A^\top b. $$

Solving it in practice

Don't actually invert $A^\top A$ — it squares the condition number, and the normal equations can be numerically awful. Use:

Linear regression in one line. Ordinary least squares is this problem. Ridge regression adds a penalty $\lambda \|x\|_2^2$ and solves $(A^\top A + \lambda I) \hat{x} = A^\top b$ — a linear system that's always invertible. LASSO swaps the $\ell_2$ penalty for $\ell_1$ and stops being linear algebra alone, but you can still spot the QR of $A$ lurking inside its solvers.
Section 18

Where it shows up in ML

Every neural network is a pile of matrices with nonlinearities between them. Every optimizer is a game played on vectors. Here's the map from the concepts above to the things you've heard of.

Keep these connections in mind as you read the rest of the guide. The abstractions in modern ML papers — attention, low-rank adapters, state-space models, neural tangent kernels — are rarely more than a creative arrangement of the pieces above.

Further reading

Where to go next

Linear algebra rewards repetition. Read two or three of these in parallel and the subject stops looking like lists of identities and starts looking like geometry.

Textbooks

Free video courses

References and cheat sheets

This page is the first chapter of a longer compendium under construction. Up next: multivariable calculus, probability, and optimization — the rest of the mathematical engine room.