Part I · Mathematical Foundations · Chapter 01

Linear Algebra, the quiet engine of modern AI.

Nearly every operation in a neural network is a matrix multiplication. Embeddings are vectors; attention is an outer product; PCA is a singular value decomposition. Linear algebra is not glamorous, but it is the grammar. This chapter teaches it from the ground up — vectors, matrices, transformations, eigenvectors, SVD — with just enough geometric intuition that the ML references in later chapters feel inevitable rather than mysterious.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make the prose precise. If a formula looks scary, look at the picture next to it first.

Notation: scalars are lowercase italic ($x$, $\lambda$); vectors are bold lowercase ($\mathbf{v}$, $\mathbf{x}$); matrices are capitals ($A$, $\Sigma$). $\mathbb{R}^n$ means "the set of lists of $n$ real numbers." Code-like names appear in monospace. Key terms when first defined appear in a slightly different color.

What is linear algebra about?Motivation
VectorsObjects
Vector operationsAddition, scaling, dot product
Norms and distancesLengths
Linear combinations, span, and basisStructure
MatricesData and transformations
Matrix–vector multiplicationTwo views
Matrix–matrix multiplicationComposition
Linear transformationsGeometry
Special matricesIdentity, orthogonal, symmetric
DeterminantsSigned volume
Rank, column space, null spaceDimension counting
Eigenvalues and eigenvectorsInvariant directions
DiagonalizationChange of basis
Matrix factorizationsLU, QR, Cholesky
The Singular Value DecompositionThe king
Projections and least squaresLinear regression
Where it shows up in MLPayoff

Section 01

What is linear algebra about?

The short answer: linear algebra is the mathematics of vectors and matrices, and of the transformations that move them around. The longer answer explains why that is the most useful abstraction in applied mathematics.

A system behaves linearly if scaling an input scales the output by the same factor, and if adding two inputs produces the sum of their outputs. Formally, a function $f$ is linear when

$$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha\, f(\mathbf{x}) + \beta\, f(\mathbf{y}) $$

for any inputs $\mathbf{x}, \mathbf{y}$ and scalars $\alpha, \beta$. That single property — preserving addition and scaling — turns out to be enough structure to build decomposition theorems, efficient algorithms, and an entire geometry that computers can reason about at trillions of operations per second.

Most of modern machine learning sits on top of this linearity. A dense layer of a neural network is a matrix multiplication. Word embeddings are vectors in $\mathbb{R}^d$. Attention is a dot product. Principal component analysis is a singular value decomposition. Even the non-linear pieces — activations, softmaxes, attention masks — live in between linear operations, which do the actual heavy lifting of mixing information across features and examples.

Key idea

Linear algebra gives us two things: a language for describing high-dimensional data (vectors, matrices, subspaces) and an algebra for transforming it efficiently. Both matter. The language lets you think; the algebra lets your laptop finish in an afternoon.

The plan for the rest of this chapter: we build up from the smallest objects (vectors) to the operations on them (addition, dot products, norms), introduce matrices as both data and as functions, study how matrices stretch and rotate space, and finish with the two results you cannot do serious ML without — the eigendecomposition and the singular value decomposition.

Section 02

Vectors

A vector is the simplest object in linear algebra — just a list of numbers. But the power of the concept comes from holding three different pictures in mind at once.

Three views of the same thing

A vector in $\mathbb{R}^n$ is formally an ordered tuple of $n$ real numbers,

$$ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n. $$

You will see this same object wearing three different hats. The list view treats it as a column of numbers — how it lives inside a computer. The geometric view treats it as an arrow from the origin, which is how you get intuition about length, angle, and direction. The point view treats it as a location in space — which is how you reason about data, where each training example is a point in $\mathbb{R}^n$.

All three views are always correct; the trick is to pick the right one for the question you are asking.

A vector in $\mathbb{R}^2$ as an arrow from the origin, with components along each coordinate axis.

The data-science interpretation

When a data scientist says "a training example is a vector," they mean: every observation is represented as a fixed-length list of numerical features, and every such list is a point in a common space $\mathbb{R}^d$. A house might be represented by its square footage, its number of bedrooms, and its year built; that makes it a point in $\mathbb{R}^3$. A $256 \times 256$ grayscale image flattens to a point in $\mathbb{R}^{65{,}536}$. A word embedding typically lives in $\mathbb{R}^{768}$ or $\mathbb{R}^{4096}$. The dimension of the vector is the number of features; the space it lives in is called an ambient space or feature space.

Intuition

Anything you want a model to reason about — a user, a document, a protein, a trajectory — first gets turned into a vector. Almost every algorithm in this compendium is a story about how to cook up that vector, how to compare it to others, and how to transform it into a prediction.

Section 03

Vector operations

There are only three operations you absolutely need: add two vectors, scale one by a scalar, and take the inner (dot) product of two. Everything else — angles, similarities, projections, norms — comes from these.

Addition and scalar multiplication

Addition is coordinate-wise,

$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}, $$

and geometrically it puts the arrows head to tail: you walk along $\mathbf{u}$, then walk along $\mathbf{v}$, and the sum points from the origin to where you end up. Equivalently, the sum is the diagonal of the parallelogram formed by the two vectors.

Scalar multiplication stretches a vector (and reverses it if the scalar is negative),

$$ \alpha \mathbf{v} = \begin{bmatrix} \alpha v_1 \\ \alpha v_2 \\ \vdots \\ \alpha v_n \end{bmatrix}. $$

Vector addition by the parallelogram rule: walk along $\mathbf{u}$ then $\mathbf{v}$, or along $\mathbf{v}$ then $\mathbf{u}$; you end up at the same point, $\mathbf{u}+\mathbf{v}$.

The dot product

The dot product (also called the inner product) combines two vectors of the same size into a single number:

$$ \mathbf{u} \cdot \mathbf{v} \;=\; \mathbf{u}^\top \mathbf{v} \;=\; \sum_{i=1}^{n} u_i v_i. $$

The geometric identity that makes this operation genuinely useful is

$$ \mathbf{u} \cdot \mathbf{v} \;=\; \|\mathbf{u}\|\, \|\mathbf{v}\|\, \cos\theta, $$

where $\theta$ is the angle between the two vectors. That one line unlocks almost every similarity-based technique in ML. When $\mathbf{u}$ and $\mathbf{v}$ point in the same direction, the dot product is large and positive. When they are perpendicular, it is zero. When they point in opposite directions, it is negative. Normalizing out the lengths gives you the cosine similarity, which is exactly how search engines rank documents and how vector databases match embeddings.

The dot product $\mathbf{u}\cdot\mathbf{v} = \|\mathbf{u}\|\|\mathbf{v}\|\cos\theta$ equals the length of $\mathbf{u}$ times the signed length of $\mathbf{v}$'s projection onto $\mathbf{u}$.

Why it matters

Whenever you see an "alignment" or "similarity" score in a paper — whether it is query-key attention in a transformer, or a contrastive-learning score between an image and its caption — a dot product is doing the work. The softmax that usually follows it just turns the raw scores into probabilities.

Section 04

Norms and distances

A norm measures the size of a vector. Different norms measure size differently, which matters for regularization, optimization, and the geometry of nearest-neighbor search.

The Euclidean norm, or $L^2$ norm, is the familiar straight-line length,

$$ \|\mathbf{v}\|_2 \;=\; \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} \;=\; \sqrt{\mathbf{v}^\top \mathbf{v}}. $$

The $L^1$ norm (taxicab norm) sums absolute values,

$$ \|\mathbf{v}\|_1 \;=\; |v_1| + |v_2| + \cdots + |v_n|. $$

The $L^\infty$ norm picks the largest coordinate in absolute value,

$$ \|\mathbf{v}\|_\infty \;=\; \max_i |v_i|. $$

They are all special cases of the $L^p$ norm $\|\mathbf{v}\|_p = \left(\sum_i |v_i|^p\right)^{1/p}$. The unit ball — the set of vectors with $\|\mathbf{v}\| \le 1$ — tells you what the norm "thinks" a unit is.

Unit balls in $\mathbb{R}^2$ for the $L^1$, $L^2$, and $L^\infty$ norms. The $L^2$ ball is round because the Euclidean norm is the only one that is rotation-invariant.

Why the sharp corners of the L¹ ball matter

Regularization with an $L^2$ penalty (ridge regression, weight decay) shrinks all coefficients toward zero smoothly. Regularization with an $L^1$ penalty (lasso) tends to set coefficients exactly to zero — the optimum often lands on a corner of the $L^1$ ball, where several coordinates are already zero. That is how lasso produces sparse models, and it's entirely a consequence of the shape of the unit ball.

The distance between two vectors is the norm of their difference: $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|$. Choosing a norm chooses a notion of distance, and the choice quietly shapes everything from clustering to nearest-neighbor classifiers.

Section 05

Linear combinations, span, and basis

Once you can add and scale, you can describe whole regions of space with a handful of vectors. That describability is what lets us compress a billion-dimensional data space down to something our laptops — and our intuitions — can handle.

Linear combinations and span

A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ is any vector of the form

$$ \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k $$

with real coefficients $\alpha_i$. The span of a set of vectors is the collection of every linear combination you can build from them. Two non-parallel vectors in $\mathbb{R}^2$ span the whole plane. Three vectors lying on a common plane in $\mathbb{R}^3$ span only that plane — no matter how many of them you take.

Linear independence

A set of vectors is linearly independent if none of them can be written as a linear combination of the others. Equivalently, the only way to get $\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}$ is with all $\alpha_i = 0$. If a vector is dependent on the others, throwing it out doesn't shrink the span — it's redundant.

Basis and dimension

A basis for a space is a set of linearly independent vectors that span it — a minimal spanning set. The most common basis for $\mathbb{R}^n$ is the standard basis $\mathbf{e}_1, \ldots, \mathbf{e}_n$, where $\mathbf{e}_i$ has a 1 in the $i$-th slot and zeros elsewhere. The dimension of a space is the number of vectors in any basis (this number is the same for any basis — a nontrivial theorem).

Any vector in $\mathbb{R}^2$ is a unique linear combination of the standard basis. Here $\mathbf{w} = 3\mathbf{e}_1 + 2\mathbf{e}_2$.

Key idea

A vector is the same object regardless of basis; its coordinates depend on which basis you're using. Changing basis is just re-expressing the same vector in a different coordinate system. A huge portion of linear algebra — eigendecompositions, SVD, PCA — amounts to finding a smarter basis in which the problem becomes easy.

Section 06

Matrices

A matrix is a rectangular grid of numbers. That is the bookkeeping view. The more useful view — the one you should keep returning to — is that a matrix is a function from one vector space to another.

An $m \times n$ matrix $A$ has $m$ rows and $n$ columns,

$$ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n}. $$

You will see two mental pictures used more or less interchangeably. The data view treats a matrix as a way to stack information — rows might be examples and columns features, or the other way around. The function view treats a matrix $A \in \mathbb{R}^{m \times n}$ as a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$: it eats an $n$-dimensional vector and spits out an $m$-dimensional vector. Both pictures are always valid; experienced practitioners switch between them sentence by sentence.

The transpose

The transpose $A^\top$ swaps rows and columns. If $A$ is $m \times n$, then $A^\top$ is $n \times m$, with entries $(A^\top)_{ij} = a_{ji}$. A matrix equal to its own transpose, $A = A^\top$, is called symmetric — these appear everywhere (covariance matrices, kernels, Gram matrices) and have especially clean properties.

Matrices as a stack of rows or a set of columns

Sometimes it is useful to write $A$ in row-stacked form or column-stacked form:

$$ A = \begin{bmatrix} \mathbf{r}_1^\top \\ \mathbf{r}_2^\top \\ \vdots \\ \mathbf{r}_m^\top \end{bmatrix} \qquad\text{or}\qquad A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 & \cdots & \mathbf{c}_n \end{bmatrix}. $$

The row view makes matrix–vector multiplication look like a stack of dot products; the column view makes it look like a weighted sum of columns. Those are the two views of Section 07, and either one can be the right one depending on what you're trying to see.

Section 07

Matrix–vector multiplication

The single most important operation in numerical linear algebra — and, by extension, in deep learning. It deserves two complementary interpretations.

For $A \in \mathbb{R}^{m\times n}$ and $\mathbf{x} \in \mathbb{R}^n$, the product $A\mathbf{x}$ is a vector in $\mathbb{R}^m$ defined by

$$ (A\mathbf{x})_i \;=\; \sum_{j=1}^{n} a_{ij}\, x_j, \qquad i = 1, \ldots, m. $$

View 1: a stack of dot products (the row view)

Read $A$ as rows. The $i$-th entry of $A\mathbf{x}$ is the dot product of row $\mathbf{r}_i$ with $\mathbf{x}$. This view makes $A\mathbf{x}$ feel like evaluating $m$ different linear measurements of the same input vector.

$$ A\mathbf{x} = \begin{bmatrix} \mathbf{r}_1^\top \mathbf{x} \\ \mathbf{r}_2^\top \mathbf{x} \\ \vdots \\ \mathbf{r}_m^\top \mathbf{x} \end{bmatrix}. $$

View 2: a weighted sum of columns (the column view)

Read $A$ as columns. The product $A\mathbf{x}$ is a linear combination of the columns of $A$, using the entries of $\mathbf{x}$ as weights,

$$ A\mathbf{x} = x_1 \mathbf{c}_1 + x_2 \mathbf{c}_2 + \cdots + x_n \mathbf{c}_n. $$

This view is the geometric one. It says the image of $A$ — everything $A$ can produce — lives in the column space, the span of the columns. If your $\mathbf{x}$ doesn't have a clever combination that adds up to the target $\mathbf{b}$, the equation $A\mathbf{x} = \mathbf{b}$ simply has no solution.

Key idea

The row view asks "what measurements does this matrix make on my input?" The column view asks "what outputs can this matrix possibly produce?" Most confusions in a linear algebra argument come from sticking to the wrong view — switching sometimes makes the same calculation look obvious.

Section 08

Matrix–matrix multiplication

Multiplying two matrices composes two linear functions. Get that one idea and the mechanics (which column hits which row) becomes bookkeeping.

If $A \in \mathbb{R}^{m\times k}$ and $B \in \mathbb{R}^{k\times n}$, the product $C = AB \in \mathbb{R}^{m\times n}$ is defined entrywise as

$$ c_{ij} \;=\; \sum_{\ell=1}^{k} a_{i\ell}\, b_{\ell j}. $$

The shapes have to match: the inner dimensions ($k$ and $k$) must agree; the outer dimensions ($m$ and $n$) become the shape of the product. The clean way to remember it: (m, k) × (k, n) → (m, n).

Composition of transformations

If $B$ is the linear map $\mathbb{R}^n \to \mathbb{R}^k$ and $A$ is the linear map $\mathbb{R}^k \to \mathbb{R}^m$, then the matrix $AB$ is the map $\mathbb{R}^n \to \mathbb{R}^m$ you get by applying $B$ first and $A$ second. Formally, $(AB)\mathbf{x} = A(B\mathbf{x})$. This is why matrix multiplication shows up in every neural network: stacking two linear layers is, literally, multiplying their weight matrices.

Properties (and one non-property)

Associative: $(AB)C = A(BC)$.
Distributive over addition: $A(B + C) = AB + AC$.
Transpose reverses order: $(AB)^\top = B^\top A^\top$.
NOT commutative: in general, $AB \ne BA$. Rotation by 90° then scaling gives different geometry than scaling then rotating only in certain special cases.

Heads up

Matrix multiplication is the one operation most beginners memorize mechanically without reaching for the composition view. Every time you see $ABC\mathbf{x}$, re-read it as "transform $\mathbf{x}$ by $C$, then by $B$, then by $A$" — right to left. That reading makes the identities above obvious instead of tedious.

Section 09

Linear transformations

A matrix is a linear transformation. Seeing what transformations look like — rotations, scalings, shears, projections — turns matrix algebra from symbol-shuffling into geometry.

The single most useful trick in this entire chapter: a linear transformation is completely determined by where it sends the basis vectors. If $A\mathbf{e}_1 = \mathbf{c}_1$ and $A\mathbf{e}_2 = \mathbf{c}_2$, then $A$'s matrix representation is just those images stacked as columns,

$$ A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 \end{bmatrix}. $$

Read the other way: the columns of $A$ are the images of the basis vectors. So to know what $A$ does to the whole plane, it suffices to know what it does to $\mathbf{e}_1$ and $\mathbf{e}_2$, and everything else follows by linearity.

A linear transformation sends the unit square to the parallelogram spanned by the columns of $A$. The basis vectors $\mathbf{e}_1$ and $\mathbf{e}_2$ are mapped to $\mathbf{c}_1$ and $\mathbf{c}_2$ — the columns of $A$.

A gallery of common transformations

Every one of these is a matrix, and every one preserves the origin (that's the "linear" bit):

Scaling by factor $s$: $\operatorname{diag}(s, s)$ stretches uniformly; $\operatorname{diag}(s_1, s_2)$ stretches each axis independently.
Rotation by angle $\theta$: $\;R_\theta = \begin{bsmallmatrix}\cos\theta & -\sin\theta\\ \sin\theta & \cos\theta\end{bsmallmatrix}$. Preserves lengths and angles.
Reflection across the $x$-axis: $\operatorname{diag}(1, -1)$.
Shear along $x$ by factor $k$: $\begin{bsmallmatrix}1 & k\\ 0 & 1\end{bsmallmatrix}$ — slides horizontal lines past each other.
Projection onto the $x$-axis: $\begin{bsmallmatrix}1 & 0\\ 0 & 0\end{bsmallmatrix}$ — flattens everything to one dimension.

Every more exotic transformation (change of basis, orthogonal transformations, low-rank approximations) is built out of these pieces. When you run a neural network forward pass, each weight matrix is silently performing some mix of scaling, rotation, and shear — usually in very high dimensions.

Section 10

Special matrices

A handful of matrix types show up again and again. Each has structure that makes computation easier and meaning clearer.

The identity

The identity matrix $I_n$ is the matrix that does nothing: $Iv = v$ for every $v$. It has ones on the diagonal and zeros everywhere else. It plays the role of $1$ in the world of matrices.

$$ I_3 = \begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\end{bmatrix} $$

Diagonal and triangular matrices

A diagonal matrix has zeros off the diagonal. It scales each coordinate independently — the cleanest transformation there is. An upper-triangular matrix has zeros below the diagonal; a lower-triangular matrix has zeros above. Triangular systems are trivial to solve by back-substitution, which is why most decomposition algorithms try to produce them.

Symmetric matrices

A square matrix $A$ is symmetric if $A^\top = A$ — the entry at $(i, j)$ equals the entry at $(j, i)$. Covariance matrices, Gram matrices ($A^\top A$), and Hessians of smooth functions are all symmetric. Symmetric matrices have real eigenvalues and orthogonal eigenvectors — we'll see why in Sections 13 and 14.

Orthogonal matrices

A real square matrix $Q$ is orthogonal if its columns (equivalently, its rows) form an orthonormal set: each has unit length and any two distinct columns are perpendicular. Equivalently,

$$ Q^\top Q = Q Q^\top = I, \qquad \text{so} \qquad Q^{-1} = Q^\top. $$

Geometrically, orthogonal matrices are the rigid motions of space — rotations and reflections. They preserve lengths, angles, and volumes. They're numerically wonderful because they never amplify errors.

Positive-definite matrices

A symmetric matrix $A$ is positive definite if $x^\top A x > 0$ for every nonzero $x$. Positive-definite matrices describe quadratic forms that open upward in every direction — think bowl-shaped loss surfaces. Covariance matrices are positive semi-definite; Hessians at a strict local minimum are positive definite.

Why this matters. The adjectives in "symmetric positive-definite covariance matrix" each unlock an algorithm: symmetric $\Rightarrow$ Cholesky factorization is fastest; positive-definite $\Rightarrow$ every eigenvalue is positive, so the matrix has a unique square root.

Section 11

Determinants

The determinant is a single number that captures how a linear transformation scales — and flips — space.

The geometric definition

For a $2 \times 2$ matrix $A$, $\det(A)$ is the signed area of the parallelogram spanned by the columns of $A$. For a $3 \times 3$ matrix, it's the signed volume of the parallelepiped spanned by the columns. In general, $|\det(A)|$ is the factor by which $A$ rescales $n$-dimensional volume.

For $A = \begin{bsmallmatrix}3 & 1\\ 1 & 2\end{bsmallmatrix}$, the columns span a parallelogram whose (signed) area is $\det(A) = 3\cdot 2 - 1\cdot 1 = 5$.

The $2\times 2$ formula and general case

$$ \det\!\begin{bmatrix}a & b\\ c & d\end{bmatrix} = ad - bc. $$

In general, the determinant can be computed by cofactor expansion (expensive, $O(n!)$) or — much faster — by reducing $A$ to triangular form and multiplying the diagonal ($O(n^3)$).

What it tells you

$\det(A) \neq 0 \iff A$ is invertible $\iff$ the columns of $A$ are linearly independent.
$\det(A) > 0$: $A$ preserves orientation. $\det(A) < 0$: $A$ flips orientation (like a reflection).
$\det(AB) = \det(A)\det(B)$. Volumes multiply under composition.
$\det(A^{-1}) = 1/\det(A)$. Inverting a transformation inverts its volume factor.
$\det(A^\top) = \det(A)$.

Numerical caution. Although $\det(A) = 0$ means singular in exact arithmetic, in floating point the determinant can be astronomically small or large even for well-behaved matrices. To check whether a matrix is close to singular, use the condition number or the smallest singular value — not the determinant.

Section 12

Rank, column space, and null space

A matrix has an input side and an output side. Each has a natural subspace, and their dimensions are linked by one of the most elegant theorems in mathematics.

Column space and rank

The column space of an $m \times n$ matrix $A$, written $\mathrm{Col}(A)$, is the span of its columns — it lives in $\mathbb{R}^m$ and is exactly the set of outputs $Av$ as $v$ ranges over $\mathbb{R}^n$. In other words, $\mathrm{Col}(A)$ is the image or range of the linear map.

The rank of $A$ is the dimension of $\mathrm{Col}(A)$ — the number of linearly independent columns, or equivalently the number of linearly independent rows.

$$ \mathrm{rank}(A) \;\le\; \min(m, n). $$

A matrix is full rank if it achieves this bound. "Low-rank" matrices — ones whose rank is much smaller than their dimensions — are compressible: they can be written as a product $A = UV^\top$ with $U \in \mathbb{R}^{m \times r}$, $V \in \mathbb{R}^{n \times r}$. This is the whole idea behind LoRA and model compression.

Null space

The null space (or kernel) of $A$ is

$$ \mathrm{Null}(A) = \{\, v \in \mathbb{R}^n \;:\; Av = 0 \,\}. $$

It's the set of inputs that $A$ crushes to zero. The null space captures the ambiguity of $A$ as a function: if $Av_1 = Av_2$, then $v_1 - v_2 \in \mathrm{Null}(A)$.

The Rank–Nullity Theorem

Rank–Nullity. For any $m \times n$ matrix $A$, $$ \mathrm{rank}(A) \;+\; \dim \mathrm{Null}(A) \;=\; n. $$ Every input dimension is either used (part of the rank) or collapsed (part of the null space). Conservation of information, made precise.

The four fundamental subspaces

Every matrix $A \in \mathbb{R}^{m\times n}$ has four associated subspaces:

The column space $\mathrm{Col}(A) \subseteq \mathbb{R}^m$, dimension $r$ (the rank).
The null space $\mathrm{Null}(A) \subseteq \mathbb{R}^n$, dimension $n - r$.
The row space $\mathrm{Col}(A^\top) \subseteq \mathbb{R}^n$, dimension $r$.
The left null space $\mathrm{Null}(A^\top) \subseteq \mathbb{R}^m$, dimension $m - r$.

The row space and null space are orthogonal complements in $\mathbb{R}^n$, and so are the column space and left null space in $\mathbb{R}^m$. Gilbert Strang draws these as two pairs of perpendicular planes in his famous "Big Picture" diagram — it's worth looking up.

Section 13

Eigenvalues and eigenvectors

Most vectors rotate when you apply a matrix. A precious few only stretch. Those are the eigenvectors — and they reveal the matrix's hidden skeleton.

Definition

An eigenvector of a square matrix $A$ is a nonzero vector $v$ that $A$ sends to a scalar multiple of itself. The scalar is the eigenvalue $\lambda$:

$$ A v = \lambda v, \qquad v \neq 0. $$

Along an eigenvector, $A$ acts as pure scaling by $\lambda$. All the twisting and shearing that $A$ does to other vectors simply doesn't happen along these special directions.

The eigenvector $v$ lies along a direction that $A$ only rescales (here by $\lambda = 2$). An ordinary vector $u$ gets both rotated and rescaled.

Finding them

Rearranging $Av = \lambda v$ gives $(A - \lambda I)v = 0$. For a nonzero $v$ to satisfy this, $A - \lambda I$ must be singular — its determinant must vanish. The characteristic polynomial

$$ p(\lambda) = \det(A - \lambda I) $$

has the eigenvalues as its roots. For each eigenvalue, the corresponding eigenvectors are the nonzero elements of $\mathrm{Null}(A - \lambda I)$.

The big theorems

The eigenvalues of a triangular matrix are the entries on its diagonal.
$\det(A) = \prod_i \lambda_i$ and $\mathrm{trace}(A) = \sum_i \lambda_i$ (counting algebraic multiplicity).
Spectral Theorem. If $A$ is real symmetric, then all eigenvalues are real, and eigenvectors for distinct eigenvalues are orthogonal. Moreover, $A$ has an orthonormal eigenbasis — $A = Q \Lambda Q^\top$ with $Q$ orthogonal and $\Lambda$ diagonal.

Why AI people care

Eigenvalues tell you about stability, dominant directions, and long-term behavior. The largest eigenvalue of a transition matrix governs how fast a Markov chain mixes. The leading eigenvector of a graph's adjacency matrix (or its normalized variant) ranks the most "central" nodes — Google's PageRank is an eigenvector computation. The eigenvalues of the Hessian near a minimum tell you the curvature of the loss surface in each direction, which is why condition number (ratio of largest to smallest eigenvalue) shows up whenever people talk about why optimization is hard.

Section 14

Diagonalization

If you can find enough eigenvectors, you can rewrite a matrix in the simplest possible form: a pure scaling along a tilted set of axes.

The factorization

Suppose $A$ is $n \times n$ and has $n$ linearly independent eigenvectors $v_1, \ldots, v_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$. Stack the eigenvectors as columns of a matrix $V$ and the eigenvalues on the diagonal of $\Lambda$:

$$ A = V \Lambda V^{-1}, \qquad \Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n). $$

Read the identity right to left: $V^{-1}$ expresses any input in the eigenbasis, $\Lambda$ scales each component by its eigenvalue, and $V$ translates back to the original coordinates. In eigen-coordinates, $A$ is just independent scalings on each axis.

Powers and functions of matrices

Diagonalization makes $A^k$ almost free:

$$ A^k = V \Lambda^k V^{-1}, \qquad \Lambda^k = \mathrm{diag}(\lambda_1^k, \ldots, \lambda_n^k). $$

The same trick extends to any analytic function of a matrix: $\exp(A) = V \exp(\Lambda) V^{-1}$, and so on. This is how you solve linear differential equations by hand and why "iterating a linear system" boils down to studying the eigenvalues.

When diagonalization works

If $A$ has $n$ distinct eigenvalues, it is diagonalizable.
If $A$ is real symmetric (or complex Hermitian), it is diagonalizable with an orthogonal eigenbasis: $A = Q \Lambda Q^\top$.
Some matrices (defective ones, like $\begin{bsmallmatrix}1 & 1\\0 & 1\end{bsmallmatrix}$) are not diagonalizable. The Jordan normal form handles those; in numerics we usually reach for the SVD instead.

Change of basis, the true meaning. Every diagonalization is a statement that the matrix is secretly simple — boring scaling — once you pick the right coordinates. A huge chunk of applied mathematics is about finding the right coordinates.

Section 15

Matrix factorizations: LU, QR, Cholesky

Numerical linear algebra is, to a first approximation, the art of factoring a matrix into pieces you can compute with.

LU: Gaussian elimination, cleaned up

Any (sufficiently nice) square matrix $A$ can be written as

$$ A = LU $$

with $L$ lower-triangular (ones on the diagonal) and $U$ upper-triangular. This is exactly what Gaussian elimination produces. Once you have $L$ and $U$, solving $Ax = b$ takes two triangular solves — forward substitution for $Ly = b$, then back-substitution for $Ux = y$. In practice algorithms use a permutation for stability: $PA = LU$.

QR: orthogonalizing the columns

Any $m \times n$ matrix $A$ with independent columns ($m \ge n$) can be written as

$$ A = QR $$

with $Q \in \mathbb{R}^{m \times n}$ orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper-triangular. Geometrically, $Q$ carries the Gram–Schmidt-orthogonalized columns; $R$ records the coefficients that rebuild the originals from them. QR is the workhorse of least-squares (we'll use it in Section 17) and the starting point for the most common eigenvalue algorithm, QR iteration.

Cholesky: the symmetric positive-definite speedup

If $A$ is symmetric positive-definite, it factors as

$$ A = L L^\top $$

with $L$ lower-triangular (positive diagonal). Cholesky is about twice as fast as LU and more numerically stable. It's the reason covariance-heavy workloads (Gaussian processes, Kalman filters, portfolio optimization) scale as well as they do.

Rule of thumb. Square-and-generic? LU. Tall and thin, or need orthogonality? QR. Symmetric positive-definite? Cholesky. Everything else? SVD, next.

Section 16

The Singular Value Decomposition

If diagonalization is the crown jewel of linear algebra, the SVD is the crown. It works for every matrix, rectangular or square, rank-deficient or not, and says: every linear map is a rotate, scale, rotate.

The theorem

Every real matrix $A \in \mathbb{R}^{m \times n}$ can be factored as

$$ A = U \Sigma V^\top $$

where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ called the singular values of $A$.

The geometric picture

Every matrix sends the unit ball to an ellipsoid. The SVD decomposes that action into three simple steps: rotate ($V^\top$), scale along the axes ($\Sigma$), rotate again ($U$). The singular values are the lengths of the ellipsoid's semi-axes.

What the pieces mean

The columns of $V$ are the right singular vectors — an orthonormal basis of the input space $\mathbb{R}^n$.
The columns of $U$ are the left singular vectors — an orthonormal basis of the output space $\mathbb{R}^m$.
$\Sigma$'s diagonal entries are the singular values, telling you how much each input direction is stretched.
The number of nonzero singular values is exactly $\mathrm{rank}(A)$.
$\sigma_1 / \sigma_n$ is the condition number — how much $A$ can amplify errors.

Low-rank approximation (the Eckart–Young theorem)

Truncating the SVD — keeping only the $k$ largest singular values — gives the best rank-$k$ approximation of $A$ in both the Frobenius and spectral norms:

$$ A_k = \sum_{i=1}^{k} \sigma_i\, u_i v_i^\top. $$

One decomposition, many applications. Principal Component Analysis is the SVD of a centered data matrix. Latent Semantic Analysis is the SVD of a term-document matrix. Netflix-style collaborative filtering is a truncated SVD. LoRA fine-tuning exploits the same low-rank idea without ever computing the full SVD. Pseudo-inverses — the Moore–Penrose inverse $A^+ = V \Sigma^+ U^\top$ — are built straight from the SVD and underlie every least-squares solver.

Section 17

Projections and least squares

Most real-world linear systems have no exact solution — there are more equations than unknowns, and the data is noisy. Least squares is the answer to "what's the closest thing to a solution?"

Projections

Given a subspace $W \subseteq \mathbb{R}^n$ and a point $b \in \mathbb{R}^n$, the orthogonal projection of $b$ onto $W$ is the point in $W$ closest to $b$ — the foot of the perpendicular dropped from $b$ to $W$. The residual $b - \mathrm{proj}_W b$ is orthogonal to every vector in $W$.

If the columns of $A$ are a basis for $W$, the projection matrix onto $W$ is

$$ P = A(A^\top A)^{-1} A^\top. $$

$P$ is symmetric, $P^2 = P$, and $P b \in W$ is the closest point in $W$ to $b$.

The least-squares problem

We're given a tall matrix $A \in \mathbb{R}^{m \times n}$ (more rows than columns — more equations than unknowns) and a target $b \in \mathbb{R}^m$. There's usually no $x$ with $Ax = b$ exactly. Least squares asks for the $x$ that minimizes the residual:

$$ \hat{x} = \arg\min_{x} \; \| A x - b \|_2^2. $$

Least squares finds the point $A\hat{x}$ in the column space of $A$ that is closest to $b$. The residual $b - A\hat{x}$ is orthogonal to $\mathrm{Col}(A)$.

The normal equations

The minimizer $\hat{x}$ satisfies $A\hat{x} = Pb$ — i.e. $Ax$ is the projection of $b$ onto the column space. Writing this using orthogonality of the residual gives the normal equations:

$$ A^\top A\, \hat{x} = A^\top b. $$

When $A$ has independent columns, $A^\top A$ is invertible and

$$ \hat{x} = (A^\top A)^{-1} A^\top b. $$

Solving it in practice

Don't actually invert $A^\top A$ — it squares the condition number, and the normal equations can be numerically awful. Use:

QR: $A = QR \Rightarrow R\hat{x} = Q^\top b$. One triangular solve. The default in most libraries.
SVD: $\hat{x} = V \Sigma^+ U^\top b$. Handles rank-deficient $A$ gracefully.
Cholesky on $A^\top A$: fastest when $A$ is well-conditioned and you can afford forming the Gram matrix.

Linear regression in one line. Ordinary least squares is this problem. Ridge regression adds a penalty $\lambda \|x\|_2^2$ and solves $(A^\top A + \lambda I) \hat{x} = A^\top b$ — a linear system that's always invertible. LASSO swaps the $\ell_2$ penalty for $\ell_1$ and stops being linear algebra alone, but you can still spot the QR of $A$ lurking inside its solvers.

Section 18

Where it shows up in ML

Every neural network is a pile of matrices with nonlinearities between them. Every optimizer is a game played on vectors. Here's the map from the concepts above to the things you've heard of.

Dense layers. $y = Wx + b$ is a matrix–vector multiplication followed by a translation. Width of the layer is the output dimension of $W$.
Attention. $\mathrm{Attention}(Q,K,V) = \mathrm{softmax}(QK^\top / \sqrt{d})\,V$. Three matrix products, one per triplet of projections, plus the softmax.
Convolutions. A convolution is a matrix–vector multiplication with a highly structured, banded, Toeplitz matrix. Sparse matrices are first-class objects.
Embeddings. Word/token embeddings are rows of a matrix; looking up a token is selecting a row. The embedding matrix is a learned change-of-basis.
Backprop. The chain rule for multivariable functions is matrix multiplication — Jacobians composed left to right.
Gradient descent. $\theta \leftarrow \theta - \eta\, \nabla \mathcal{L}(\theta)$ is vector addition with a scalar. Momentum, Adam, RMSProp are recurrences over those vectors.
Normalization. LayerNorm, BatchNorm, RMSNorm each perform an affine rescaling of activations — diagonal matrix times vector plus a bias.
PCA and whitening. SVD of a centered data matrix; project onto the top-$k$ singular directions.
LoRA. Approximates an update $\Delta W$ as a rank-$r$ product $BA$, exploiting low-rank structure you've now seen.
Diffusion noise schedules. Track how a multivariate Gaussian's covariance matrix evolves — pure linear algebra, pure Section 10.
Stability and conditioning. Why training sometimes explodes: the condition number of a layer's Jacobian is what the gradient signal has to squeeze through at every step.

Keep these connections in mind as you read the rest of the guide. The abstractions in modern ML papers — attention, low-rank adapters, state-space models, neural tangent kernels — are rarely more than a creative arrangement of the pieces above.

Where to go next

Linear algebra rewards repetition. Read two or three of these in parallel and the subject stops looking like lists of identities and starts looking like geometry.

Textbooks

Linear Algebra Done Right

Sheldon Axler · 4th ed. 2024

Develops the theory determinant-last, putting invariant subspaces and the spectral theorem front and center. Short, elegant, and the most mathematically satisfying way in if you already have some abstraction under your belt.

Author's page
Introduction to Linear Algebra

Gilbert Strang · 6th ed. 2023

The classic for working engineers and data scientists. Strang's "four fundamental subspaces" diagram alone is worth the price; the book aligns with his MIT 18.06 lectures.

Author's page
Linear Algebra and Learning from Data

Gilbert Strang · 2019

Strang's ML-era follow-up. Weaves in deep learning, randomized numerical linear algebra, and the SVD throughout. The best pairing of "classical linear algebra" with "what ML actually uses".

Author's page
Mathematics for Machine Learning

Deisenroth, Faisal & Ong · 2020

Free PDF. The first five chapters are a tight, modern tour of the linear algebra every ML practitioner needs, pitched at exactly the level of this guide. Then it keeps going into analytic geometry, calculus, probability, and optimization.

mml-book.github.io
Matrix Computations

Golub & Van Loan · 4th ed. 2013

The reference for numerical linear algebra. Dense, authoritative, and the book every library at every lab owns. Pick it up once you care how QR or SVD is actually computed.

JHU Press
Numerical Linear Algebra

Trefethen & Bau · 1997

A shorter, clearer alternative to Golub & Van Loan. Forty lectures, each one tight enough to read in a sitting. Possibly the best-written math book of its generation.

SIAM

Free video courses

MIT 18.06 — Linear Algebra

Gilbert Strang · MIT OpenCourseWare

The undergraduate course that taught a generation. Lectures, problem sets, exams, and solutions — all free. Strang's intuition for column space and the four fundamental subspaces is contagious.

MIT OCW
MIT 18.065 — Matrix Methods in Data Analysis

Gilbert Strang · MIT OpenCourseWare · 2018

Strang's ML-era companion course. Covers randomized linear algebra, the SVD in depth, and how these ideas connect to deep learning. Pairs with Linear Algebra and Learning from Data.

MIT OCW
Essence of Linear Algebra

3Blue1Brown · YouTube · 2016

Fifteen short videos built around the one question this page is also trying to answer: what are matrices actually doing? Watch them once before a textbook and again after; both times will be useful.

3blue1brown.com

References and cheat sheets

The Matrix Cookbook

Petersen & Pedersen · 2012

Hundreds of matrix identities — derivatives, traces, determinants, Kronecker products — gathered in one PDF. Not a book to read front-to-back; a book to keep open while you work.

PDF
The $n$-category café: notes on linear algebra

Tom Leinster

Short, opinionated essays on why determinants, traces, and tensor products are what they are. Good after a first course, before a second.

Blog
NumPy & PyTorch linear-algebra docs

numpy.linalg · torch.linalg

Every factorization on this page has a one-line call in both libraries. Reading the docs once, in order, is a surprisingly good way to consolidate which decomposition to reach for when.

NumPy PyTorch

This page is the first chapter of a longer compendium under construction. Up next: multivariable calculus, probability, and optimization — the rest of the mathematical engine room.

How to read this chapter

Contents

What is linear algebra about?

Vectors

Three views of the same thing

The data-science interpretation

Vector operations

Addition and scalar multiplication

The dot product

Norms and distances

Linear combinations, span, and basis

Linear combinations and span

Linear independence

Basis and dimension

Matrices

The transpose

Matrices as a stack of rows or a set of columns

Matrix–vector multiplication

View 1: a stack of dot products (the row view)

View 2: a weighted sum of columns (the column view)

Matrix–matrix multiplication

Composition of transformations

Properties (and one non-property)

Linear transformations

A gallery of common transformations

Special matrices

The identity

Diagonal and triangular matrices

Symmetric matrices

Orthogonal matrices

Positive-definite matrices

Determinants

The geometric definition

The $2\times 2$ formula and general case

What it tells you

Rank, column space, and null space

Column space and rank

Null space

The Rank–Nullity Theorem

The four fundamental subspaces

Eigenvalues and eigenvectors

Definition

Finding them

The big theorems

Why AI people care

Diagonalization

The factorization

Powers and functions of matrices

When diagonalization works

Matrix factorizations: LU, QR, Cholesky

LU: Gaussian elimination, cleaned up

QR: orthogonalizing the columns

Cholesky: the symmetric positive-definite speedup

The Singular Value Decomposition

The theorem

The geometric picture

What the pieces mean

Low-rank approximation (the Eckart–Young theorem)

Projections and least squares

Projections

The least-squares problem

The normal equations

Solving it in practice

Where it shows up in ML

Where to go next

Textbooks

Free video courses

References and cheat sheets