Nearly every operation in a neural network is a matrix multiplication. Embeddings are vectors; attention is an outer product; PCA is a singular value decomposition. Linear algebra is not glamorous, but it is the grammar. This chapter teaches it from the ground up — vectors, matrices, transformations, eigenvectors, SVD — with just enough geometric intuition that the ML references in later chapters feel inevitable rather than mysterious.
Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make the prose precise. If a formula looks scary, look at the picture next to it first.
Notation: scalars are lowercase italic ($x$, $\lambda$); vectors are bold lowercase ($\mathbf{v}$, $\mathbf{x}$); matrices are capitals ($A$, $\Sigma$). $\mathbb{R}^n$ means "the set of lists of $n$ real numbers." Code-like names appear in monospace. Key terms when first defined appear in a slightly different color.
The short answer: linear algebra is the mathematics of vectors and matrices, and of the transformations that move them around. The longer answer explains why that is the most useful abstraction in applied mathematics.
A system behaves linearly if scaling an input scales the output by the same factor, and if adding two inputs produces the sum of their outputs. Formally, a function $f$ is linear when
$$ f(\alpha \mathbf{x} + \beta \mathbf{y}) = \alpha\, f(\mathbf{x}) + \beta\, f(\mathbf{y}) $$for any inputs $\mathbf{x}, \mathbf{y}$ and scalars $\alpha, \beta$. That single property — preserving addition and scaling — turns out to be enough structure to build decomposition theorems, efficient algorithms, and an entire geometry that computers can reason about at trillions of operations per second.
Most of modern machine learning sits on top of this linearity. A dense layer of a neural network is a matrix multiplication. Word embeddings are vectors in $\mathbb{R}^d$. Attention is a dot product. Principal component analysis is a singular value decomposition. Even the non-linear pieces — activations, softmaxes, attention masks — live in between linear operations, which do the actual heavy lifting of mixing information across features and examples.
Linear algebra gives us two things: a language for describing high-dimensional data (vectors, matrices, subspaces) and an algebra for transforming it efficiently. Both matter. The language lets you think; the algebra lets your laptop finish in an afternoon.
The plan for the rest of this chapter: we build up from the smallest objects (vectors) to the operations on them (addition, dot products, norms), introduce matrices as both data and as functions, study how matrices stretch and rotate space, and finish with the two results you cannot do serious ML without — the eigendecomposition and the singular value decomposition.
A vector is the simplest object in linear algebra — just a list of numbers. But the power of the concept comes from holding three different pictures in mind at once.
A vector in $\mathbb{R}^n$ is formally an ordered tuple of $n$ real numbers,
$$ \mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix} \in \mathbb{R}^n. $$You will see this same object wearing three different hats. The list view treats it as a column of numbers — how it lives inside a computer. The geometric view treats it as an arrow from the origin, which is how you get intuition about length, angle, and direction. The point view treats it as a location in space — which is how you reason about data, where each training example is a point in $\mathbb{R}^n$.
All three views are always correct; the trick is to pick the right one for the question you are asking.
When a data scientist says "a training example is a vector," they mean: every observation is represented as a fixed-length list of numerical features, and every such list is a point in a common space $\mathbb{R}^d$. A house might be represented by its square footage, its number of bedrooms, and its year built; that makes it a point in $\mathbb{R}^3$. A $256 \times 256$ grayscale image flattens to a point in $\mathbb{R}^{65{,}536}$. A word embedding typically lives in $\mathbb{R}^{768}$ or $\mathbb{R}^{4096}$. The dimension of the vector is the number of features; the space it lives in is called an ambient space or feature space.
Anything you want a model to reason about — a user, a document, a protein, a trajectory — first gets turned into a vector. Almost every algorithm in this compendium is a story about how to cook up that vector, how to compare it to others, and how to transform it into a prediction.
There are only three operations you absolutely need: add two vectors, scale one by a scalar, and take the inner (dot) product of two. Everything else — angles, similarities, projections, norms — comes from these.
Addition is coordinate-wise,
$$ \mathbf{u} + \mathbf{v} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \vdots \\ u_n + v_n \end{bmatrix}, $$and geometrically it puts the arrows head to tail: you walk along $\mathbf{u}$, then walk along $\mathbf{v}$, and the sum points from the origin to where you end up. Equivalently, the sum is the diagonal of the parallelogram formed by the two vectors.
Scalar multiplication stretches a vector (and reverses it if the scalar is negative),
$$ \alpha \mathbf{v} = \begin{bmatrix} \alpha v_1 \\ \alpha v_2 \\ \vdots \\ \alpha v_n \end{bmatrix}. $$The dot product (also called the inner product) combines two vectors of the same size into a single number:
$$ \mathbf{u} \cdot \mathbf{v} \;=\; \mathbf{u}^\top \mathbf{v} \;=\; \sum_{i=1}^{n} u_i v_i. $$The geometric identity that makes this operation genuinely useful is
$$ \mathbf{u} \cdot \mathbf{v} \;=\; \|\mathbf{u}\|\, \|\mathbf{v}\|\, \cos\theta, $$where $\theta$ is the angle between the two vectors. That one line unlocks almost every similarity-based technique in ML. When $\mathbf{u}$ and $\mathbf{v}$ point in the same direction, the dot product is large and positive. When they are perpendicular, it is zero. When they point in opposite directions, it is negative. Normalizing out the lengths gives you the cosine similarity, which is exactly how search engines rank documents and how vector databases match embeddings.
Whenever you see an "alignment" or "similarity" score in a paper — whether it is query-key attention in a transformer, or a contrastive-learning score between an image and its caption — a dot product is doing the work. The softmax that usually follows it just turns the raw scores into probabilities.
A norm measures the size of a vector. Different norms measure size differently, which matters for regularization, optimization, and the geometry of nearest-neighbor search.
The Euclidean norm, or $L^2$ norm, is the familiar straight-line length,
$$ \|\mathbf{v}\|_2 \;=\; \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2} \;=\; \sqrt{\mathbf{v}^\top \mathbf{v}}. $$The $L^1$ norm (taxicab norm) sums absolute values,
$$ \|\mathbf{v}\|_1 \;=\; |v_1| + |v_2| + \cdots + |v_n|. $$The $L^\infty$ norm picks the largest coordinate in absolute value,
$$ \|\mathbf{v}\|_\infty \;=\; \max_i |v_i|. $$They are all special cases of the $L^p$ norm $\|\mathbf{v}\|_p = \left(\sum_i |v_i|^p\right)^{1/p}$. The unit ball — the set of vectors with $\|\mathbf{v}\| \le 1$ — tells you what the norm "thinks" a unit is.
Regularization with an $L^2$ penalty (ridge regression, weight decay) shrinks all coefficients toward zero smoothly. Regularization with an $L^1$ penalty (lasso) tends to set coefficients exactly to zero — the optimum often lands on a corner of the $L^1$ ball, where several coordinates are already zero. That is how lasso produces sparse models, and it's entirely a consequence of the shape of the unit ball.
The distance between two vectors is the norm of their difference: $d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\|$. Choosing a norm chooses a notion of distance, and the choice quietly shapes everything from clustering to nearest-neighbor classifiers.
Once you can add and scale, you can describe whole regions of space with a handful of vectors. That describability is what lets us compress a billion-dimensional data space down to something our laptops — and our intuitions — can handle.
A linear combination of vectors $\mathbf{v}_1, \ldots, \mathbf{v}_k$ is any vector of the form
$$ \alpha_1 \mathbf{v}_1 + \alpha_2 \mathbf{v}_2 + \cdots + \alpha_k \mathbf{v}_k $$with real coefficients $\alpha_i$. The span of a set of vectors is the collection of every linear combination you can build from them. Two non-parallel vectors in $\mathbb{R}^2$ span the whole plane. Three vectors lying on a common plane in $\mathbb{R}^3$ span only that plane — no matter how many of them you take.
A set of vectors is linearly independent if none of them can be written as a linear combination of the others. Equivalently, the only way to get $\alpha_1 \mathbf{v}_1 + \cdots + \alpha_k \mathbf{v}_k = \mathbf{0}$ is with all $\alpha_i = 0$. If a vector is dependent on the others, throwing it out doesn't shrink the span — it's redundant.
A basis for a space is a set of linearly independent vectors that span it — a minimal spanning set. The most common basis for $\mathbb{R}^n$ is the standard basis $\mathbf{e}_1, \ldots, \mathbf{e}_n$, where $\mathbf{e}_i$ has a 1 in the $i$-th slot and zeros elsewhere. The dimension of a space is the number of vectors in any basis (this number is the same for any basis — a nontrivial theorem).
A vector is the same object regardless of basis; its coordinates depend on which basis you're using. Changing basis is just re-expressing the same vector in a different coordinate system. A huge portion of linear algebra — eigendecompositions, SVD, PCA — amounts to finding a smarter basis in which the problem becomes easy.
A matrix is a rectangular grid of numbers. That is the bookkeeping view. The more useful view — the one you should keep returning to — is that a matrix is a function from one vector space to another.
An $m \times n$ matrix $A$ has $m$ rows and $n$ columns,
$$ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} \in \mathbb{R}^{m \times n}. $$You will see two mental pictures used more or less interchangeably. The data view treats a matrix as a way to stack information — rows might be examples and columns features, or the other way around. The function view treats a matrix $A \in \mathbb{R}^{m \times n}$ as a linear map from $\mathbb{R}^n$ to $\mathbb{R}^m$: it eats an $n$-dimensional vector and spits out an $m$-dimensional vector. Both pictures are always valid; experienced practitioners switch between them sentence by sentence.
The transpose $A^\top$ swaps rows and columns. If $A$ is $m \times n$, then $A^\top$ is $n \times m$, with entries $(A^\top)_{ij} = a_{ji}$. A matrix equal to its own transpose, $A = A^\top$, is called symmetric — these appear everywhere (covariance matrices, kernels, Gram matrices) and have especially clean properties.
Sometimes it is useful to write $A$ in row-stacked form or column-stacked form:
$$ A = \begin{bmatrix} \mathbf{r}_1^\top \\ \mathbf{r}_2^\top \\ \vdots \\ \mathbf{r}_m^\top \end{bmatrix} \qquad\text{or}\qquad A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 & \cdots & \mathbf{c}_n \end{bmatrix}. $$The row view makes matrix–vector multiplication look like a stack of dot products; the column view makes it look like a weighted sum of columns. Those are the two views of Section 07, and either one can be the right one depending on what you're trying to see.
The single most important operation in numerical linear algebra — and, by extension, in deep learning. It deserves two complementary interpretations.
For $A \in \mathbb{R}^{m\times n}$ and $\mathbf{x} \in \mathbb{R}^n$, the product $A\mathbf{x}$ is a vector in $\mathbb{R}^m$ defined by
$$ (A\mathbf{x})_i \;=\; \sum_{j=1}^{n} a_{ij}\, x_j, \qquad i = 1, \ldots, m. $$Read $A$ as rows. The $i$-th entry of $A\mathbf{x}$ is the dot product of row $\mathbf{r}_i$ with $\mathbf{x}$. This view makes $A\mathbf{x}$ feel like evaluating $m$ different linear measurements of the same input vector.
$$ A\mathbf{x} = \begin{bmatrix} \mathbf{r}_1^\top \mathbf{x} \\ \mathbf{r}_2^\top \mathbf{x} \\ \vdots \\ \mathbf{r}_m^\top \mathbf{x} \end{bmatrix}. $$Read $A$ as columns. The product $A\mathbf{x}$ is a linear combination of the columns of $A$, using the entries of $\mathbf{x}$ as weights,
$$ A\mathbf{x} = x_1 \mathbf{c}_1 + x_2 \mathbf{c}_2 + \cdots + x_n \mathbf{c}_n. $$This view is the geometric one. It says the image of $A$ — everything $A$ can produce — lives in the column space, the span of the columns. If your $\mathbf{x}$ doesn't have a clever combination that adds up to the target $\mathbf{b}$, the equation $A\mathbf{x} = \mathbf{b}$ simply has no solution.
The row view asks "what measurements does this matrix make on my input?" The column view asks "what outputs can this matrix possibly produce?" Most confusions in a linear algebra argument come from sticking to the wrong view — switching sometimes makes the same calculation look obvious.
Multiplying two matrices composes two linear functions. Get that one idea and the mechanics (which column hits which row) becomes bookkeeping.
If $A \in \mathbb{R}^{m\times k}$ and $B \in \mathbb{R}^{k\times n}$, the product $C = AB \in \mathbb{R}^{m\times n}$ is defined entrywise as
$$ c_{ij} \;=\; \sum_{\ell=1}^{k} a_{i\ell}\, b_{\ell j}. $$
The shapes have to match: the inner dimensions ($k$ and $k$) must agree; the outer dimensions ($m$ and $n$) become the shape of the product. The clean way to remember it: (m, k) × (k, n) → (m, n).
If $B$ is the linear map $\mathbb{R}^n \to \mathbb{R}^k$ and $A$ is the linear map $\mathbb{R}^k \to \mathbb{R}^m$, then the matrix $AB$ is the map $\mathbb{R}^n \to \mathbb{R}^m$ you get by applying $B$ first and $A$ second. Formally, $(AB)\mathbf{x} = A(B\mathbf{x})$. This is why matrix multiplication shows up in every neural network: stacking two linear layers is, literally, multiplying their weight matrices.
Matrix multiplication is the one operation most beginners memorize mechanically without reaching for the composition view. Every time you see $ABC\mathbf{x}$, re-read it as "transform $\mathbf{x}$ by $C$, then by $B$, then by $A$" — right to left. That reading makes the identities above obvious instead of tedious.
A matrix is a linear transformation. Seeing what transformations look like — rotations, scalings, shears, projections — turns matrix algebra from symbol-shuffling into geometry.
The single most useful trick in this entire chapter: a linear transformation is completely determined by where it sends the basis vectors. If $A\mathbf{e}_1 = \mathbf{c}_1$ and $A\mathbf{e}_2 = \mathbf{c}_2$, then $A$'s matrix representation is just those images stacked as columns,
$$ A = \begin{bmatrix} \mathbf{c}_1 & \mathbf{c}_2 \end{bmatrix}. $$Read the other way: the columns of $A$ are the images of the basis vectors. So to know what $A$ does to the whole plane, it suffices to know what it does to $\mathbf{e}_1$ and $\mathbf{e}_2$, and everything else follows by linearity.
Every one of these is a matrix, and every one preserves the origin (that's the "linear" bit):
Every more exotic transformation (change of basis, orthogonal transformations, low-rank approximations) is built out of these pieces. When you run a neural network forward pass, each weight matrix is silently performing some mix of scaling, rotation, and shear — usually in very high dimensions.
A handful of matrix types show up again and again. Each has structure that makes computation easier and meaning clearer.
The identity matrix $I_n$ is the matrix that does nothing: $Iv = v$ for every $v$. It has ones on the diagonal and zeros everywhere else. It plays the role of $1$ in the world of matrices.
$$ I_3 = \begin{bmatrix}1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1\end{bmatrix} $$A diagonal matrix has zeros off the diagonal. It scales each coordinate independently — the cleanest transformation there is. An upper-triangular matrix has zeros below the diagonal; a lower-triangular matrix has zeros above. Triangular systems are trivial to solve by back-substitution, which is why most decomposition algorithms try to produce them.
A square matrix $A$ is symmetric if $A^\top = A$ — the entry at $(i, j)$ equals the entry at $(j, i)$. Covariance matrices, Gram matrices ($A^\top A$), and Hessians of smooth functions are all symmetric. Symmetric matrices have real eigenvalues and orthogonal eigenvectors — we'll see why in Sections 13 and 14.
A real square matrix $Q$ is orthogonal if its columns (equivalently, its rows) form an orthonormal set: each has unit length and any two distinct columns are perpendicular. Equivalently,
$$ Q^\top Q = Q Q^\top = I, \qquad \text{so} \qquad Q^{-1} = Q^\top. $$Geometrically, orthogonal matrices are the rigid motions of space — rotations and reflections. They preserve lengths, angles, and volumes. They're numerically wonderful because they never amplify errors.
A symmetric matrix $A$ is positive definite if $x^\top A x > 0$ for every nonzero $x$. Positive-definite matrices describe quadratic forms that open upward in every direction — think bowl-shaped loss surfaces. Covariance matrices are positive semi-definite; Hessians at a strict local minimum are positive definite.
The determinant is a single number that captures how a linear transformation scales — and flips — space.
For a $2 \times 2$ matrix $A$, $\det(A)$ is the signed area of the parallelogram spanned by the columns of $A$. For a $3 \times 3$ matrix, it's the signed volume of the parallelepiped spanned by the columns. In general, $|\det(A)|$ is the factor by which $A$ rescales $n$-dimensional volume.
In general, the determinant can be computed by cofactor expansion (expensive, $O(n!)$) or — much faster — by reducing $A$ to triangular form and multiplying the diagonal ($O(n^3)$).
A matrix has an input side and an output side. Each has a natural subspace, and their dimensions are linked by one of the most elegant theorems in mathematics.
The column space of an $m \times n$ matrix $A$, written $\mathrm{Col}(A)$, is the span of its columns — it lives in $\mathbb{R}^m$ and is exactly the set of outputs $Av$ as $v$ ranges over $\mathbb{R}^n$. In other words, $\mathrm{Col}(A)$ is the image or range of the linear map.
The rank of $A$ is the dimension of $\mathrm{Col}(A)$ — the number of linearly independent columns, or equivalently the number of linearly independent rows.
$$ \mathrm{rank}(A) \;\le\; \min(m, n). $$A matrix is full rank if it achieves this bound. "Low-rank" matrices — ones whose rank is much smaller than their dimensions — are compressible: they can be written as a product $A = UV^\top$ with $U \in \mathbb{R}^{m \times r}$, $V \in \mathbb{R}^{n \times r}$. This is the whole idea behind LoRA and model compression.
The null space (or kernel) of $A$ is
$$ \mathrm{Null}(A) = \{\, v \in \mathbb{R}^n \;:\; Av = 0 \,\}. $$It's the set of inputs that $A$ crushes to zero. The null space captures the ambiguity of $A$ as a function: if $Av_1 = Av_2$, then $v_1 - v_2 \in \mathrm{Null}(A)$.
Every matrix $A \in \mathbb{R}^{m\times n}$ has four associated subspaces:
The row space and null space are orthogonal complements in $\mathbb{R}^n$, and so are the column space and left null space in $\mathbb{R}^m$. Gilbert Strang draws these as two pairs of perpendicular planes in his famous "Big Picture" diagram — it's worth looking up.
Most vectors rotate when you apply a matrix. A precious few only stretch. Those are the eigenvectors — and they reveal the matrix's hidden skeleton.
An eigenvector of a square matrix $A$ is a nonzero vector $v$ that $A$ sends to a scalar multiple of itself. The scalar is the eigenvalue $\lambda$:
$$ A v = \lambda v, \qquad v \neq 0. $$Along an eigenvector, $A$ acts as pure scaling by $\lambda$. All the twisting and shearing that $A$ does to other vectors simply doesn't happen along these special directions.
Rearranging $Av = \lambda v$ gives $(A - \lambda I)v = 0$. For a nonzero $v$ to satisfy this, $A - \lambda I$ must be singular — its determinant must vanish. The characteristic polynomial
$$ p(\lambda) = \det(A - \lambda I) $$has the eigenvalues as its roots. For each eigenvalue, the corresponding eigenvectors are the nonzero elements of $\mathrm{Null}(A - \lambda I)$.
Eigenvalues tell you about stability, dominant directions, and long-term behavior. The largest eigenvalue of a transition matrix governs how fast a Markov chain mixes. The leading eigenvector of a graph's adjacency matrix (or its normalized variant) ranks the most "central" nodes — Google's PageRank is an eigenvector computation. The eigenvalues of the Hessian near a minimum tell you the curvature of the loss surface in each direction, which is why condition number (ratio of largest to smallest eigenvalue) shows up whenever people talk about why optimization is hard.
If you can find enough eigenvectors, you can rewrite a matrix in the simplest possible form: a pure scaling along a tilted set of axes.
Suppose $A$ is $n \times n$ and has $n$ linearly independent eigenvectors $v_1, \ldots, v_n$ with eigenvalues $\lambda_1, \ldots, \lambda_n$. Stack the eigenvectors as columns of a matrix $V$ and the eigenvalues on the diagonal of $\Lambda$:
$$ A = V \Lambda V^{-1}, \qquad \Lambda = \mathrm{diag}(\lambda_1, \ldots, \lambda_n). $$Read the identity right to left: $V^{-1}$ expresses any input in the eigenbasis, $\Lambda$ scales each component by its eigenvalue, and $V$ translates back to the original coordinates. In eigen-coordinates, $A$ is just independent scalings on each axis.
Diagonalization makes $A^k$ almost free:
$$ A^k = V \Lambda^k V^{-1}, \qquad \Lambda^k = \mathrm{diag}(\lambda_1^k, \ldots, \lambda_n^k). $$The same trick extends to any analytic function of a matrix: $\exp(A) = V \exp(\Lambda) V^{-1}$, and so on. This is how you solve linear differential equations by hand and why "iterating a linear system" boils down to studying the eigenvalues.
Numerical linear algebra is, to a first approximation, the art of factoring a matrix into pieces you can compute with.
Any (sufficiently nice) square matrix $A$ can be written as
$$ A = LU $$with $L$ lower-triangular (ones on the diagonal) and $U$ upper-triangular. This is exactly what Gaussian elimination produces. Once you have $L$ and $U$, solving $Ax = b$ takes two triangular solves — forward substitution for $Ly = b$, then back-substitution for $Ux = y$. In practice algorithms use a permutation for stability: $PA = LU$.
Any $m \times n$ matrix $A$ with independent columns ($m \ge n$) can be written as
$$ A = QR $$with $Q \in \mathbb{R}^{m \times n}$ orthonormal columns and $R \in \mathbb{R}^{n \times n}$ upper-triangular. Geometrically, $Q$ carries the Gram–Schmidt-orthogonalized columns; $R$ records the coefficients that rebuild the originals from them. QR is the workhorse of least-squares (we'll use it in Section 17) and the starting point for the most common eigenvalue algorithm, QR iteration.
If $A$ is symmetric positive-definite, it factors as
$$ A = L L^\top $$with $L$ lower-triangular (positive diagonal). Cholesky is about twice as fast as LU and more numerically stable. It's the reason covariance-heavy workloads (Gaussian processes, Kalman filters, portfolio optimization) scale as well as they do.
If diagonalization is the crown jewel of linear algebra, the SVD is the crown. It works for every matrix, rectangular or square, rank-deficient or not, and says: every linear map is a rotate, scale, rotate.
Every real matrix $A \in \mathbb{R}^{m \times n}$ can be factored as
$$ A = U \Sigma V^\top $$where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthogonal matrices, and $\Sigma \in \mathbb{R}^{m \times n}$ is diagonal with non-negative entries $\sigma_1 \ge \sigma_2 \ge \cdots \ge 0$ called the singular values of $A$.
Truncating the SVD — keeping only the $k$ largest singular values — gives the best rank-$k$ approximation of $A$ in both the Frobenius and spectral norms:
$$ A_k = \sum_{i=1}^{k} \sigma_i\, u_i v_i^\top. $$Most real-world linear systems have no exact solution — there are more equations than unknowns, and the data is noisy. Least squares is the answer to "what's the closest thing to a solution?"
Given a subspace $W \subseteq \mathbb{R}^n$ and a point $b \in \mathbb{R}^n$, the orthogonal projection of $b$ onto $W$ is the point in $W$ closest to $b$ — the foot of the perpendicular dropped from $b$ to $W$. The residual $b - \mathrm{proj}_W b$ is orthogonal to every vector in $W$.
If the columns of $A$ are a basis for $W$, the projection matrix onto $W$ is
$$ P = A(A^\top A)^{-1} A^\top. $$$P$ is symmetric, $P^2 = P$, and $P b \in W$ is the closest point in $W$ to $b$.
We're given a tall matrix $A \in \mathbb{R}^{m \times n}$ (more rows than columns — more equations than unknowns) and a target $b \in \mathbb{R}^m$. There's usually no $x$ with $Ax = b$ exactly. Least squares asks for the $x$ that minimizes the residual:
$$ \hat{x} = \arg\min_{x} \; \| A x - b \|_2^2. $$The minimizer $\hat{x}$ satisfies $A\hat{x} = Pb$ — i.e. $Ax$ is the projection of $b$ onto the column space. Writing this using orthogonality of the residual gives the normal equations:
$$ A^\top A\, \hat{x} = A^\top b. $$When $A$ has independent columns, $A^\top A$ is invertible and
$$ \hat{x} = (A^\top A)^{-1} A^\top b. $$Don't actually invert $A^\top A$ — it squares the condition number, and the normal equations can be numerically awful. Use:
Every neural network is a pile of matrices with nonlinearities between them. Every optimizer is a game played on vectors. Here's the map from the concepts above to the things you've heard of.
Keep these connections in mind as you read the rest of the guide. The abstractions in modern ML papers — attention, low-rank adapters, state-space models, neural tangent kernels — are rarely more than a creative arrangement of the pieces above.
Linear algebra rewards repetition. Read two or three of these in parallel and the subject stops looking like lists of identities and starts looking like geometry.
This page is the first chapter of a longer compendium under construction. Up next: multivariable calculus, probability, and optimization — the rest of the mathematical engine room.