Part I · Mathematical Foundations · Chapter 03

Optimization, the art of finding the best.

A trained model is the solution of an optimisation problem. The loss function defines the objective; the parameters are the variables; the constraints (when there are any) carve out the feasible set. This chapter teaches the mathematics of that search — convex sets and convex functions, gradient descent and its descendants, stochastic and adaptive methods, Newton and quasi-Newton, Lagrangians and KKT, duality, proximal methods, and the peculiar non-convex landscape of deep-learning loss surfaces. The goal throughout is to explain why the algorithms that train today's models have the shape they do.

How to read this chapter

Each section builds on the previous one, so read in order the first time through. The prose is the primary explanation; the diagrams exist to make the prose less abstract; the equations exist to make it precise. This chapter assumes the material of chapters 1 (linear algebra) and 2 (calculus): gradients, Hessians, the chain rule, eigenvalues of symmetric matrices, and Taylor's theorem are used without ceremony.

Notation: $f : \mathbb{R}^n \to \mathbb{R}$ is the objective (usually a loss); $\mathbf{x}$ or $\theta$ its variable; $\nabla f$ the gradient, $\nabla^2 f$ the Hessian. Constraint functions are $g_i(\mathbf{x}) \leq 0$ and $h_j(\mathbf{x}) = 0$. Lagrange multipliers are $\lambda$ and $\mu$. The step size is $\eta$ or $\alpha$, and $\|\cdot\|$ without subscript means the Euclidean norm.

Contents

  1. What is optimization, and why ML is made of itMotivation
  2. The optimization problem, formallySetup
  3. Convex sets and convex functionsThe good case
  4. Gradient descentThe workhorse
  5. Step sizes and line searchGetting there
  6. Momentum and Nesterov accelerationSecond-order for free
  7. Stochastic gradient descentBatches of data
  8. Adaptive methods: AdaGrad, RMSProp, AdamPer-coordinate step sizes
  9. Newton and quasi-NewtonSecond-order
  10. Constrained optimizationFeasibility
  11. Lagrange multipliers and dualityThe dual view
  12. The KKT conditionsOptimality
  13. Linear programmingCorners of polytopes
  14. Quadratic programming and SVMsA canonical QP
  15. Proximal methods and splittingNon-smooth regularisation
  16. Non-convex optimizationThe deep-learning reality
  17. Convergence ratesHow fast is fast?
  18. Where it shows up in MLPayoff
Section 01

What optimization is, and why ML is made of it

Machine learning is an optimization problem dressed in statistical clothes. A loss function scores how wrong a model is; an optimiser moves the parameters to make the loss smaller. Every technique you will meet in the rest of this compendium is either a choice of loss, a choice of model, or a choice of optimiser.

The abstract problem is simple. Given a function $f : \mathcal{X} \to \mathbb{R}$ defined over some set $\mathcal{X}$, find

$$ \mathbf{x}^* \;=\; \operatorname*{arg\,min}_{\mathbf{x} \in \mathcal{X}}\; f(\mathbf{x}). $$

In ML, $\mathbf{x}$ is the list of model parameters, $f$ is the training loss — cross-entropy, mean-squared error, a reinforcement-learning return, a variational lower bound — and $\mathcal{X}$ is either all of $\mathbb{R}^n$ (unconstrained) or a feasible set defined by equalities and inequalities (constrained). Learning to recognise objects in images, writing like a novelist, playing Go at superhuman level, folding proteins — all of it reduces, at the bottom, to picking a loss and minimising it.

Two questions dominate the theory. When is the problem well-posed? A minimum exists, is unique (or at least finite), and can be reached by an algorithm. When can we reach it efficiently? Guarantees, convergence rates, step-size rules, and escape from saddle points all live here. The answers depend almost entirely on one property of $f$ — convexity — which gets its own section.

Key idea

There are two worlds of optimisation. In the convex world, local minima are global, gradient descent converges with guarantees, duality is tight, and a 60-year-old body of theory applies. In the non-convex world — deep learning — none of that is guaranteed, and yet in practice it still works. Understanding why requires knowing the convex theory first.

The plan. Sections 02–03 set up the problem and introduce convexity. Sections 04–09 are the algorithmic menagerie: gradient descent, momentum, stochastic and adaptive variants, Newton and quasi-Newton. Sections 10–15 are constrained optimisation — Lagrangians, KKT, duality, LP, QP, proximal methods. Sections 16–17 deal with non-convex landscapes and convergence rates. Section 18 threads the ML connections.

Section 02

The optimization problem, formally

A handful of objects appear in every optimisation problem. Naming them sharpens the intuition and makes the algorithms easier to tell apart.

The standard form of a constrained optimisation problem is

$$ \begin{aligned} \min_{\mathbf{x} \in \mathbb{R}^n} \;\; & f(\mathbf{x}) \\ \text{subject to} \;\; & g_i(\mathbf{x}) \leq 0, \quad i = 1, \ldots, m, \\ & h_j(\mathbf{x}) = 0, \quad j = 1, \ldots, p. \end{aligned} $$

The objective is $f$. The decision variable is $\mathbf{x}$. The feasible set is the collection of $\mathbf{x}$ that satisfy every constraint. A feasible point is any $\mathbf{x}$ in the feasible set; the optimal point $\mathbf{x}^*$ is one that minimises $f$ there; the optimal value is $f^* = f(\mathbf{x}^*)$.

Local vs global

A local minimum is a feasible point with no better feasible neighbour. A global minimum is a feasible point that beats every other feasible point. Most algorithms only find local minima; the reason convexity is so loved is that, for convex problems, every local minimum is global.

Smooth, non-smooth, black-box

Algorithms differ in what they assume about $f$.

For the rest of this chapter, "optimisation" means gradient-based optimisation unless stated otherwise. The ML engines of 2026 are gradient machines, and the derivative-free methods are auxiliaries.

Section 03

Convex sets and convex functions

Convexity is the single most important structural property in optimisation. One definition, a handful of equivalent characterisations, and every hard question about a convex problem has a clean answer.

A set $C \subseteq \mathbb{R}^n$ is convex if, for every pair of points $\mathbf{x}, \mathbf{y} \in C$ and every $t \in [0,1]$,

$$ t \mathbf{x} + (1-t)\mathbf{y} \;\in\; C. $$

Straight lines between points stay in the set. A function $f : \mathbb{R}^n \to \mathbb{R}$ is convex if its domain is convex and

$$ f\bigl(t \mathbf{x} + (1-t) \mathbf{y}\bigr) \;\leq\; t\, f(\mathbf{x}) \;+\; (1-t)\, f(\mathbf{y}) $$

for all $\mathbf{x}, \mathbf{y}$ and $t \in [0,1]$. Geometrically: the chord lies above the graph. If the inequality is strict whenever $\mathbf{x} \neq \mathbf{y}$ and $0 < t < 1$, $f$ is strictly convex. If $f$ is convex and additionally satisfies $f - \tfrac{\mu}{2}\|\mathbf{x}\|^2$ convex for some $\mu > 0$, it is $\mu$-strongly convex.

Left: a convex function $f(x) = (x-2)^2$; any chord lies on or above the graph. Right: a non-convex function; you can find two points whose connecting chord dips below the curve. Convexity is exactly the geometric statement that this cannot happen.

Three equivalent characterisations

For a differentiable $f$, the following are equivalent:

  1. $f$ is convex (chord above graph).
  2. $f(\mathbf{y}) \geq f(\mathbf{x}) + \nabla f(\mathbf{x})^\top (\mathbf{y} - \mathbf{x})$ for all $\mathbf{x}, \mathbf{y}$: the graph lies above every tangent plane.
  3. If $f$ is twice differentiable: $\nabla^2 f(\mathbf{x}) \succeq 0$ (positive semidefinite) for all $\mathbf{x}$.

The second characterisation is the critical one. It says a convex function is always underestimated by its linearisation — which is exactly why first-order methods (gradient descent) that trust the linearisation locally are guaranteed to make progress in the convex case.

Why convexity is magic

For a convex problem: (a) every local minimum is a global minimum. (b) The set of global minima is convex. (c) A critical point is automatically optimal. (d) Strong duality typically holds (section 11). (e) There are algorithms with global convergence guarantees and polynomial-time rates. Losing convexity loses all of this — which is the price deep learning pays for expressive models.

Section 04

Gradient descent

Take the current point, compute the gradient, step a little against it, repeat. Gradient descent is the simplest nontrivial optimisation algorithm, and — in one form or another — it is what trains every modern neural network.

The update rule is a single line:

$$ \mathbf{x}_{t+1} \;=\; \mathbf{x}_t \;-\; \eta\, \nabla f(\mathbf{x}_t). $$

The step size (also called the learning rate) $\eta > 0$ controls how far we move each iteration. The direction $-\nabla f(\mathbf{x}_t)$ is the direction of steepest descent — the one in which a tiny step decreases $f$ the fastest, from section 04 of the calculus chapter.

When does it converge?

For a convex, $L$-smooth function (gradient is $L$-Lipschitz, equivalently $\|\nabla^2 f\| \leq L$), picking any constant step size $\eta \leq 1/L$ gives

$$ f(\mathbf{x}_t) - f^* \;\leq\; \frac{L \, \|\mathbf{x}_0 - \mathbf{x}^*\|^2}{2\, t}. $$

Error decays as $O(1/t)$. Adding strong convexity gives exponential convergence: error shrinks by a factor depending on the condition number $\kappa = L/\mu$ each step, so $O\bigl((1 - 1/\kappa)^t\bigr)$. Well-conditioned problems converge in a handful of steps; ill-conditioned ones crawl.

Why ill-conditioning hurts

On a quadratic $f(\mathbf{x}) = \tfrac{1}{2} \mathbf{x}^\top A \mathbf{x}$ with $A$ positive definite, gradient descent traces the level curves of $f$ — ellipses whose axes are the eigenvectors of $A$ and whose aspect ratio is the condition number. On a highly elongated ellipse, the gradient points mostly toward the nearest wall of the valley rather than toward the minimum, and the trajectory zig-zags. The smaller of the eigenvalues determines the step size; the larger determines the zig-zagging. This is the picture every optimisation engineer has seared into their retinas.

Gradient descent on a poorly conditioned quadratic. The level sets are highly elongated ellipses; the trajectory zig-zags across the narrow valley instead of taking a direct path to the minimum at the centre. Methods that fight this pattern — momentum, Newton, preconditioning — are what the rest of the chapter is about.
When you are doing this in practice

Picking $\eta$ too large diverges; picking it too small crawls. A rule of thumb: the largest stable $\eta$ on a quadratic is $2/L$ where $L$ is the largest eigenvalue of the Hessian. Adaptive methods (section 08) try to estimate $L$ per coordinate from the gradient itself, so practitioners rarely set $\eta$ by hand any more.

Section 05

Step sizes and line search

The descent direction is easy — it is the negative gradient. How far to move along it is harder. Three strategies — fixed, scheduled, and adaptive — cover practically every case.

Fixed step size

Pick a constant $\eta$ and use it forever. For $L$-smooth convex $f$ with $\eta \leq 1/L$ this converges; for larger $\eta$ it may diverge. Works fine in practice when you know $L$ or can guess it from a few test iterations.

Scheduled / decaying

For stochastic problems, the step size usually decays — $\eta_t = \eta_0 / \sqrt{t}$, $\eta_0 / t$, or a step-function schedule (drop by a factor of 10 at epochs 30, 60, 90). Cosine schedules $\eta_t = \tfrac{\eta_0}{2}(1 + \cos(\pi t/T))$ are the current default for training neural networks; they combine a gentle warm-up, a long cruise, and a gentle cool-down. A cycle of linear warm-up then cosine decay has become standard in LLM training.

Line search

Pick the step size adaptively so that the step makes "enough" progress. The Armijo condition (sufficient decrease) insists that

$$ f(\mathbf{x}_t + \eta \mathbf{d}_t) \;\leq\; f(\mathbf{x}_t) \;+\; c_1 \eta\, \nabla f(\mathbf{x}_t)^\top \mathbf{d}_t, $$

for some $c_1 \in (0, 1)$, typically $c_1 = 10^{-4}$. Start with a large $\eta$ and halve it until Armijo is satisfied — the classic backtracking line search. The full Wolfe conditions add a second inequality to prevent step sizes that are too small.

Line search is the default in deterministic optimisation (SciPy's minimize, L-BFGS) but rare in large-scale ML: evaluating $f$ more than once per step costs extra forward passes, which gets expensive at neural-network scale. The stochastic equivalent — probabilistic line search, or Polyak step size $\eta_t = (f(\mathbf{x}_t) - f^*)/\|\nabla f\|^2$ when $f^*$ is known — is an active research area.

Warmup and why it exists

Transformer training is notoriously sensitive to the first few hundred steps. A small warmup — linear ramp of $\eta$ from $0$ to its nominal value — lets the optimiser build up reliable second-moment estimates before stepping aggressively. Without warmup, Adam's early steps are effectively on a poorly conditioned problem, and training often diverges.

Section 06

Momentum and Nesterov acceleration

Adding a memory of past steps turns gradient descent into a method that can cross poorly conditioned valleys in $O(\sqrt{\kappa})$ iterations rather than $O(\kappa)$. This is not a minor tweak — it is the optimal rate for first-order methods on convex functions.

Heavy-ball (Polyak) momentum

Maintain a velocity vector $\mathbf{v}_t$ and add it to the step:

$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta\, \mathbf{v}_t \;-\; \eta\, \nabla f(\mathbf{x}_t), \\ \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{v}_{t+1}. \end{aligned} $$

The coefficient $\beta \in [0, 1)$, typically $0.9$, is the momentum coefficient. The physical picture is a ball rolling down the loss landscape with mass. In narrow valleys, the oscillating components of consecutive gradients cancel in the velocity, while the consistent directions accumulate — so the trajectory ends up travelling along the valley rather than bouncing across it.

Nesterov's accelerated gradient

A small but crucial tweak: evaluate the gradient at a look-ahead point $\mathbf{x}_t + \beta \mathbf{v}_t$ rather than at $\mathbf{x}_t$ itself:

$$ \begin{aligned} \mathbf{v}_{t+1} &= \beta\, \mathbf{v}_t \;-\; \eta\, \nabla f(\mathbf{x}_t + \beta \mathbf{v}_t), \\ \mathbf{x}_{t+1} &= \mathbf{x}_t + \mathbf{v}_{t+1}. \end{aligned} $$

The modification looks small but Nesterov proved in 1983 that it achieves the theoretically optimal $O(1/t^2)$ convergence rate on smooth convex functions — twice the exponent of plain gradient descent. In the strongly-convex case, the iteration count drops from $O(\kappa)$ to $O(\sqrt{\kappa})$. On a condition number of $10^4$, that is 100× fewer iterations.

Momentum as a second-order ODE

In the continuous limit $\eta \to 0$, momentum gradient descent becomes the second-order ODE

$$ \ddot{\mathbf{x}}(t) \;+\; \gamma(t)\, \dot{\mathbf{x}}(t) \;+\; \nabla f(\mathbf{x}(t)) \;=\; 0 $$

— a damped oscillator rolling in the potential $f$. Choice of damping $\gamma(t)$ is the difference between heavy-ball, Nesterov, and Su–Boyd–Candès acceleration. The modern understanding of "why momentum works" is that this ODE has better dynamics than the gradient flow $\dot{\mathbf{x}} = -\nabla f$, and the discretisation inherits them.

What SGD with momentum looks like in PyTorch

torch.optim.SGD(params, lr=0.01, momentum=0.9, nesterov=True). The classic setting that trained nearly every ConvNet published between 2015 and 2020. Even in the Adam era, many vision results are still best with Nesterov momentum and a cosine schedule.

Section 07

Stochastic gradient descent

When the loss is an average over millions of examples, computing the full gradient is prohibitive. Replace it with a cheap unbiased estimate — a mini-batch gradient — and you have stochastic gradient descent, the algorithm that actually trains nearly everything.

The training loss in ML is typically an empirical average,

$$ f(\theta) \;=\; \frac{1}{N}\sum_{i=1}^{N} \ell(\theta; \mathbf{z}_i), $$

where $\mathbf{z}_i$ is the $i$-th training example. Each step of full-batch gradient descent requires summing $N$ gradients — infeasible when $N$ is in the millions or billions. Stochastic gradient descent (SGD) replaces $\nabla f(\theta)$ by an unbiased estimate based on a random subset $\mathcal{B}$ of size $B \ll N$:

$$ \theta_{t+1} \;=\; \theta_t \;-\; \eta\, \underbrace{\frac{1}{B}\sum_{i \in \mathcal{B}_t} \nabla \ell(\theta_t; \mathbf{z}_i)}_{\hat{\mathbf{g}}_t}. $$

$\mathbb{E}[\hat{\mathbf{g}}_t] = \nabla f(\theta_t)$, but $\hat{\mathbf{g}}_t$ itself is noisy. Convergence looks different: even on convex $f$, SGD with a constant step size does not converge — it orbits the minimum in a ball whose radius depends on the gradient noise. To actually converge, the step size must decay: $\sum_t \eta_t = \infty$ and $\sum_t \eta_t^2 < \infty$ — the Robbins–Monro conditions.

Batch size, step size, and noise

The variance of $\hat{\mathbf{g}}_t$ scales like $1/B$. Larger batches give cleaner gradients and enable larger step sizes; smaller batches give noisier gradients and act as an implicit regulariser. The much-repeated practical rule — "linearly scale the learning rate with the batch size" — formalises this: doubling $B$ halves the noise variance, allowing $\eta$ to double while keeping the signal-to-noise ratio roughly constant.

Noise as a regulariser

SGD noise does something Newton's method never does: it helps you escape saddle points. A saddle has negative curvature in some direction; gradient noise pushes you off it almost surely. This is part of why small-batch SGD trains deep networks successfully where deterministic methods stall, and why large-batch training often hurts generalisation even when it reaches a lower training loss.

Key idea

SGD is not "gradient descent with noise". It is a genuinely different algorithm whose noise is a feature, not a bug. The recent theoretical picture (Mandt–Hoffman–Blei, Chaudhari–Soatto, Smith) is that SGD implicitly solves a stochastic differential equation whose stationary distribution concentrates near flat minima — which empirically generalise better than sharp ones.

Section 08

Adaptive methods: AdaGrad, RMSProp, Adam

Different parameters have different curvature, and a single learning rate fits poorly. Adaptive methods estimate per-coordinate step sizes from the gradient history — cheap, diagonal approximations to the Hessian that dramatically simplify tuning.

AdaGrad

Accumulate squared gradients per coordinate and scale the step by their reciprocal square root:

$$ G_t = G_{t-1} + \mathbf{g}_t \odot \mathbf{g}_t, \qquad \theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \varepsilon} \odot \mathbf{g}_t. $$

Coordinates with historically large gradients get small steps, and vice versa. Excellent for sparse problems (NLP bag-of-words, recommendation) where some features are updated rarely but matter when they are. Its flaw is that $G_t$ only grows, so effective step sizes decay monotonically and the optimiser eventually stalls.

RMSProp

Tieleman's unpublished fix: use an exponential moving average instead of a cumulative sum,

$$ v_t = \beta_2\, v_{t-1} + (1 - \beta_2)\, \mathbf{g}_t \odot \mathbf{g}_t. $$

The moving average "forgets" old gradients, so effective step sizes track the recent curvature rather than the all-time-maximum. Hinton's Coursera lecture slides are the original source; it became the default for recurrent neural networks around 2014.

Adam

Combine RMSProp with momentum and bias-correction,

$$ \begin{aligned} \mathbf{m}_t &= \beta_1 \mathbf{m}_{t-1} + (1 - \beta_1) \mathbf{g}_t, & \hat{\mathbf{m}}_t &= \mathbf{m}_t / (1 - \beta_1^t), \\ \mathbf{v}_t &= \beta_2 \mathbf{v}_{t-1} + (1 - \beta_2) \mathbf{g}_t \odot \mathbf{g}_t, & \hat{\mathbf{v}}_t &= \mathbf{v}_t / (1 - \beta_2^t), \\ \theta_{t+1} &= \theta_t - \eta \, \hat{\mathbf{m}}_t / (\sqrt{\hat{\mathbf{v}}_t} + \varepsilon). & \end{aligned} $$

The defaults $\beta_1 = 0.9$, $\beta_2 = 0.999$, $\varepsilon = 10^{-8}$ are used almost universally. AdamW (Loshchilov & Hutter) fixes a subtle bug in weight decay and is now the default for Transformer training; more recent variants (Lion, Sophia, Muon) iterate on the same template — an element-wise rescaling of a momentum vector — with different estimators or second-order signals.

When to reach for what

AdamW is the default for Transformers and most LLM training. SGD with Nesterov momentum is still the default for many convolutional image models and often gives slightly better generalisation. LAMB and similar layer-wise adaptive methods are required for very large batch sizes ($>$ 16k). Adafactor is the memory-efficient alternative when model state dominates GPU memory.

Section 09

Newton and quasi-Newton methods

First-order methods use the gradient; second-order methods use the Hessian. The ideal of Newton's method — one step to the minimum of a quadratic — is unattainable at the scale of modern neural networks, but the idea survives in quasi-Newton approximations that dominate smaller problems.

Newton's method

Minimise the second-order Taylor approximation at every step:

$$ \theta_{t+1} \;=\; \theta_t \;-\; [\nabla^2 f(\theta_t)]^{-1}\, \nabla f(\theta_t). $$

On a strictly convex quadratic, this lands on the minimum in a single step. Near a minimum of a general smooth $f$, convergence is quadratic: the number of correct digits doubles each iteration. Outstanding — but the price is materialising and inverting the $n \times n$ Hessian every iteration at $O(n^3)$ cost, which is ruinous for $n \gtrsim 10^4$. For modern neural networks with $n \sim 10^{10}$, it is physically impossible.

Gauss–Newton and Levenberg–Marquardt

For least-squares losses $f(\theta) = \tfrac{1}{2}\|\mathbf{r}(\theta)\|^2$, the Gauss–Newton approximation drops the second-derivative term in the Hessian and keeps only $J^\top J$ where $J = \partial \mathbf{r}/\partial \theta$. The result is always positive semi-definite, which fixes Newton's instability on non-convex regions. Levenberg–Marquardt adds a damping term $\lambda I$ that interpolates between Newton (small $\lambda$) and gradient descent (large $\lambda$). This is the standard optimiser for non-linear least squares — SLAM, bundle adjustment, physics regressions.

Quasi-Newton: BFGS and L-BFGS

BFGS maintains an approximation to the inverse Hessian, updated from gradient differences at each step — a rank-two update that never explicitly forms $\nabla^2 f$. Convergence is superlinear without the $O(n^3)$ inversion. L-BFGS — "limited memory" — stores only the last $m$ (typically $5$–$20$) pairs of gradient and parameter differences, giving a low-memory algorithm that is the workhorse of mid-scale optimisation: SciPy's minimize, classical NLP logistic-regression code, fine-tuning small neural networks.

Second-order in deep learning

Full Newton is impossible; L-BFGS scales badly with stochastic noise. The modern attempts — K-FAC (Martens & Grosse), Shampoo (Gupta et al.), natural-gradient methods, and Sophia (Liu et al.) — all use some structured approximation of the Hessian or its spectral cousin the Fisher information matrix. The structure is usually block-diagonal by layer, or Kronecker-factored within a layer. These methods can beat AdamW on training speed for large language models; whether they generalise as well is an open empirical question.

Section 10

Constrained optimization

When the feasible set is more than all of $\mathbb{R}^n$, the geometry changes. At an optimum, the gradient no longer vanishes — it points outward, balanced by the constraints pushing inward.

A convex constrained optimisation problem has the standard form

$$ \min_{\mathbf{x}} \; f(\mathbf{x}) \quad \text{s.t.} \quad g_i(\mathbf{x}) \leq 0, \;\; h_j(\mathbf{x}) = 0, $$

with $f$ and every $g_i$ convex and every $h_j$ affine. The feasible set $\mathcal{F}$ is then convex. Even within this friendly setting, three kinds of optimum can occur: interior points (no constraint active), boundary points (some $g_i(\mathbf{x}^*) = 0$ or $h_j(\mathbf{x}^*) = 0$), and vertices of polyhedra (multiple constraints active). The active set is the collection of constraints satisfied with equality at $\mathbf{x}^*$.

Projection

When a constraint is simple — "stay in a ball", "stay non-negative", "stay on the simplex" — the cleanest algorithm is projected gradient descent:

$$ \mathbf{x}_{t+1} \;=\; \Pi_{\mathcal{F}}\!\bigl( \mathbf{x}_t - \eta\, \nabla f(\mathbf{x}_t) \bigr), $$

where $\Pi_{\mathcal{F}}$ is orthogonal projection onto $\mathcal{F}$. Each step takes a gradient step, then pushes back into the feasible set. Works cleanly for Euclidean balls, boxes, simplices, and half-spaces; less cleanly (often intractably) for general convex sets.

Barrier methods

Replace the hard constraint $g_i(\mathbf{x}) \leq 0$ with a soft penalty that explodes as $g_i \to 0$: $-\tau \log(-g_i(\mathbf{x}))$. As the barrier parameter $\tau \to 0$, the minimum of $f + \sum_i -\tau \log(-g_i)$ approaches the constrained minimum of $f$. This is the idea behind interior-point methods — polynomial-time algorithms for linear and quadratic programming that revolutionised optimisation in the 1990s and are still the standard for LP/QP solvers today (Gurobi, Mosek).

Why constraints are rare in deep learning

Neural network parameters are typically unconstrained — weights can be anything. When constraints do appear (orthogonality in normalising flows, probability simplex in attention, unit norm in embeddings), they are often enforced implicitly through parameterisation: express the constrained object as the image of an unconstrained one under a differentiable map. Stiefel manifolds, Cayley transforms, softmax, and householder reflections all do this.

Section 11

Lagrange multipliers and duality

Constrained optimisation has a twin problem — its dual — obtained by moving the constraints into the objective with penalty weights. For convex problems the twins have the same optimum, and the dual is often far easier to solve than the primal.

For the problem of section 10, define the Lagrangian

$$ \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}) \;=\; f(\mathbf{x}) \;+\; \sum_{i} \lambda_i\, g_i(\mathbf{x}) \;+\; \sum_{j} \mu_j\, h_j(\mathbf{x}), $$

with $\lambda_i \geq 0$ (multipliers for inequality constraints) and $\mu_j \in \mathbb{R}$ (equality). The dual function is the infimum over $\mathbf{x}$,

$$ d(\boldsymbol{\lambda}, \boldsymbol{\mu}) \;=\; \inf_{\mathbf{x}} \mathcal{L}(\mathbf{x}, \boldsymbol{\lambda}, \boldsymbol{\mu}). $$

Always concave (an infimum of affine functions). The dual problem is $\max_{\boldsymbol{\lambda} \geq 0, \boldsymbol{\mu}} d(\boldsymbol{\lambda}, \boldsymbol{\mu})$.

Weak and strong duality

Weak duality always holds: $d^* \leq f^*$. The gap $f^* - d^*$ is called the duality gap. For convex problems satisfying Slater's condition (a strictly feasible interior point exists), strong duality holds: $d^* = f^*$. In that case solving the dual solves the primal, and the multipliers at the optimum tell you exactly which constraints are binding and by how much.

At a constrained optimum $\mathbf{x}^*$ on the boundary $g(\mathbf{x}) = 0$, the gradient of the objective $\nabla f$ and the gradient of the constraint $\nabla g$ point in opposite directions along the normal to the boundary. This proportionality — "the gradient of $f$ is a non-negative multiple of the gradient of $g$" — is the Lagrange multiplier condition.

The economic interpretation

$\lambda_i$ has a physical meaning at the optimum: it is the shadow price of constraint $i$ — how much the optimal value $f^*$ would change if that constraint were relaxed by a unit. A zero multiplier means the constraint is inactive (and relaxing it costs nothing); a large multiplier means the constraint is binding tightly. Linear programming duality (section 13) turns this into an economic theorem: primal solutions allocate resources; dual solutions price them.

Section 12

The KKT conditions

The Karush–Kuhn–Tucker conditions are the multivariable, multi-constraint, multi-multiplier generalisation of "set the derivative to zero." They are necessary conditions for optimality in constrained problems and sufficient for convex ones.

Let $\mathbf{x}^*$ be a local optimum of the constrained problem of section 10, satisfying a mild regularity condition (linear independence of the active constraint gradients). Then there exist multipliers $\boldsymbol{\lambda}^* \geq 0$ and $\boldsymbol{\mu}^*$ with:

  1. Stationarity. $\displaystyle \nabla f(\mathbf{x}^*) + \sum_i \lambda_i^* \nabla g_i(\mathbf{x}^*) + \sum_j \mu_j^* \nabla h_j(\mathbf{x}^*) = \mathbf{0}$.
  2. Primal feasibility. $g_i(\mathbf{x}^*) \leq 0$ and $h_j(\mathbf{x}^*) = 0$.
  3. Dual feasibility. $\lambda_i^* \geq 0$.
  4. Complementary slackness. $\lambda_i^*\, g_i(\mathbf{x}^*) = 0$ for every $i$.

Reading the conditions

The stationarity condition generalises $\nabla f = \mathbf{0}$: the gradient of the objective is balanced by a non-negative combination of the active constraints' gradients, pushing you against the walls of the feasible region. Complementary slackness is the subtle one: it says that for every constraint, either the constraint is tight ($g_i = 0$) or its multiplier is zero ($\lambda_i = 0$). In short — inactive constraints have zero multipliers. This is exactly the combinatorial structure of the support vectors in an SVM (section 14).

When KKT is sufficient

For a convex problem satisfying Slater's condition, any $(\mathbf{x}^*, \boldsymbol{\lambda}^*, \boldsymbol{\mu}^*)$ triple that satisfies the KKT conditions is optimal. The conditions become an equation system — nonlinear in general, but with convex structure — and solving them is solving the problem. Interior-point methods for LP and QP, active-set methods, and SMO for SVMs all reduce to solving a sequence of KKT systems.

A small example

Minimise $\tfrac{1}{2}(x^2 + y^2)$ subject to $x + y \geq 1$. Stationarity: $(x, y) = \mu (1,1)$ for some $\mu \geq 0$. Primal feasibility: $x + y \geq 1$. Complementary slackness: $\mu(1 - x - y) = 0$. Setting $x + y = 1$ (active) gives $x = y = 1/2$ and $\mu = 1/2$. The answer is the point on the line closest to the origin — exactly what you would draw with a picture, formalised by KKT.

Section 13

Linear programming

When both the objective and the constraints are linear, you get a linear program. The feasible set is a polytope; the optimum, if finite, lives at a vertex. LP is the cradle of modern optimisation, and it powers more industrial decision-making than any other algorithm.

The standard form of a linear program is

$$ \min_{\mathbf{x}} \; \mathbf{c}^\top \mathbf{x} \quad \text{s.t.} \quad A\mathbf{x} = \mathbf{b}, \;\; \mathbf{x} \geq \mathbf{0}. $$

The feasible set — a set of linear equalities and non-negativities — is a convex polytope. A classical theorem says: if the LP has a finite optimal value, there is an optimal solution at a vertex of the polytope. This is what makes LPs combinatorial: there are finitely many vertices, and you need only search through them.

The simplex method

Dantzig's 1947 algorithm. Start at any feasible vertex; move along an edge to an adjacent vertex that improves the objective; repeat until no improving move exists. On practical problems it usually terminates in a number of iterations close to linear in the size of the LP, but the worst-case complexity is exponential. In practice it dominates for LPs with a modest number of constraints.

Interior-point methods

Karmarkar's 1984 algorithm and its successors follow a central path through the interior of the feasible polytope rather than its vertices. Each iteration is more expensive, but the number of iterations is bounded polynomially — and typically only $20$–$50$ for problems simplex might take millions of iterations to solve. Interior-point methods are the standard in modern LP solvers for very large problems.

LP duality

The dual of the primal LP above is

$$ \max_{\mathbf{y}} \; \mathbf{b}^\top \mathbf{y} \quad \text{s.t.} \quad A^\top \mathbf{y} \leq \mathbf{c}. $$

Strong duality is automatic for LPs (Slater's condition is immediate for non-trivial polytopes), so primal optimal value equals dual optimal value. The dual variables are shadow prices on the constraints: raising a capacity $b_i$ by one unit changes the optimal cost by $y_i^*$. This is the economic interpretation that John von Neumann and George Dantzig worked out together in 1947.

LP in ML

Direct LP training is rare in deep learning, but LPs appear everywhere nearby: optimal transport (the Earth Mover's Distance), assignment problems in multi-target tracking and Hungarian matching, certified defences against adversarial examples, and the LP relaxation of integer programs in combinatorial optimisation.

Section 14

Quadratic programming and SVMs

A quadratic objective with linear constraints — a convex QP — has exactly the structure of a support vector machine. The SVM is the archetypal applied QP, and walking through it makes Lagrangian duality feel concrete.

A standard convex quadratic program is

$$ \min_{\mathbf{x}} \; \tfrac{1}{2} \mathbf{x}^\top P \mathbf{x} + \mathbf{q}^\top \mathbf{x} \quad \text{s.t.} \quad A \mathbf{x} \leq \mathbf{b}, \;\; C \mathbf{x} = \mathbf{d}, $$

with $P \succeq 0$. The KKT conditions give a linear system plus complementarity — still much harder than LP, but polynomial-time solvable. Modern QP solvers (OSQP, ECOS, Mosek) handle millions of variables routinely.

The linear SVM

Given labelled data $\{(\mathbf{x}_i, y_i)\}$ with $y_i \in \{-1, +1\}$, the hard-margin SVM primal is

$$ \min_{\mathbf{w}, b} \; \tfrac{1}{2} \|\mathbf{w}\|^2 \quad \text{s.t.} \quad y_i (\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \text{ for all } i. $$

Minimise the squared norm of the weights subject to each training point lying on the correct side of a margin. This is a convex QP — $P = I$ and linear inequality constraints. Its dual, derived by the Lagrangian machinery of section 11, is

$$ \max_{\boldsymbol{\alpha} \geq 0} \; \sum_i \alpha_i - \tfrac{1}{2} \sum_{i,j} \alpha_i \alpha_j y_i y_j \mathbf{x}_i^\top \mathbf{x}_j \quad \text{s.t.} \quad \sum_i \alpha_i y_i = 0. $$

Three things happen here that deserve a careful look. First, only inner products $\mathbf{x}_i^\top \mathbf{x}_j$ appear in the dual — replace them with a kernel $K(\mathbf{x}_i, \mathbf{x}_j)$ and you have the kernel trick. Second, complementary slackness says $\alpha_i > 0$ only for points on the margin — the support vectors. Most $\alpha_i$ are zero; the classifier depends on a sparse subset of the training data. Third, the dual is of dimension $n$ (number of examples), not $d$ (number of features), which is what made SVMs practical for very high-dimensional features before deep learning.

Soft-margin and SMO

Real data are not linearly separable. The soft-margin SVM adds slack variables and a penalty, giving the QP that scikit-learn's SVC actually solves. The SMO (Sequential Minimal Optimisation) algorithm of John Platt decomposes it into a sequence of tiny 2-variable QP subproblems, each solvable in closed form — the algorithmic breakthrough that made SVM training scale to millions of examples in the 1990s.

Section 15

Proximal methods and splitting

Many ML losses split naturally into a smooth part and a non-smooth regulariser: $\ell_2$ data fit plus $\ell_1$ regularisation, log-likelihood plus total-variation denoising. Proximal methods take a gradient step on the smooth part and a shrinkage step on the non-smooth part — a clean, modular decomposition that matches the structure of the problem.

Consider the composite problem

$$ \min_{\mathbf{x}} \; f(\mathbf{x}) + g(\mathbf{x}), $$

where $f$ is smooth and convex and $g$ is convex but possibly non-smooth — an $\ell_1$ norm, an indicator of a convex set, a total-variation seminorm. The proximal operator of $g$ with step size $\eta$ is

$$ \operatorname{prox}_{\eta g}(\mathbf{v}) \;=\; \arg\min_{\mathbf{x}} \; g(\mathbf{x}) + \frac{1}{2\eta}\| \mathbf{x} - \mathbf{v} \|^2. $$

The prox is "the closest point to $\mathbf{v}$ that also keeps $g$ small." For the indicator of a set, it is orthogonal projection. For $g = \lambda\|\mathbf{x}\|_1$, it is element-wise soft thresholding: $\operatorname{prox}_{\eta \lambda \|\cdot\|_1}(v) = \operatorname{sign}(v) \max(|v| - \eta \lambda, 0)$. For a quadratic, it has a closed-form linear solution.

Proximal gradient descent

Alternate a gradient step on $f$ with a prox step on $g$:

$$ \mathbf{x}_{t+1} \;=\; \operatorname{prox}_{\eta g}\!\bigl( \mathbf{x}_t - \eta\, \nabla f(\mathbf{x}_t) \bigr). $$

When $g = \lambda\|\cdot\|_1$, this is ISTA (Iterative Soft-Thresholding Algorithm); when accelerated with Nesterov momentum, FISTA, which has $O(1/t^2)$ convergence. They are the default solvers for the Lasso, sparse logistic regression, and a huge family of structured-regularisation problems.

ADMM

The Alternating Direction Method of Multipliers handles two-block problems $\min f(\mathbf{x}) + g(\mathbf{z})$ subject to $A\mathbf{x} + B\mathbf{z} = \mathbf{c}$. Each iteration alternates a prox of $f$, a prox of $g$, and a dual update. It is the Swiss Army knife of convex splitting: distributed optimisation, large-scale LASSO, total-variation image denoising, consensus problems across servers. Boyd's 2011 survey is a much-cited reference.

Proximal in deep learning

Pure proximal methods are uncommon in neural-network training because the regularisers are usually smooth (weight decay $= \ell_2$) and folded into the gradient. They re-emerge in policy optimisation — TRPO, PPO, and DPO all have KL-divergence proximal structure — and in diffusion-model sampling, where proximal operators arise as denoisers in plug-and-play and RED-style algorithms.

Section 16

Non-convex optimization

Deep networks define loss surfaces that are almost never convex. Most of the theory of sections 03–11 gives no guarantees. And yet SGD, Adam, and their descendants routinely find good solutions. Understanding why — even partially — is the frontier.

The shape of the landscape

A few empirical and theoretical facts have survived a decade of careful investigation:

Escape from saddles

Without noise, gradient descent can take $O(1/\varepsilon^2)$ time to escape a saddle. With noise or random perturbations, a result of Ge, Lee & Ma (2015) shows escape in polynomial time. This is one of the cleanest theoretical justifications for injecting noise into optimisation, and the first formal explanation of why stochastic methods outperform deterministic ones on non-convex problems even when both have access to the same gradient oracle.

The role of overparameterisation

A key mechanism: in overparameterised regimes (many more parameters than training examples), the loss landscape near the typical initialisation looks, to a good approximation, convex in the parameter space around initialisation. SGD does not need to explore — it just follows a nearly-linearised path to an interpolating solution. This is the Neural Tangent Kernel story (Jacot, Gabriel & Hongler 2018) and it is one of the few theoretical frameworks that captures real training dynamics.

Key idea

Non-convex optimisation in deep learning succeeds not despite the landscape's complexity but partly because of it. Overparameterisation makes the effective landscape nearly convex near initialisation; noise helps escape the saddles that remain; flat minima are the ones that get picked. The theory is incomplete, but each of these three ingredients is now rigorous somewhere.

Section 17

Convergence rates

An algorithm's convergence rate tells you how many iterations it takes to shave a factor off the error. A surprisingly small menu covers most of the guarantees you will meet in optimisation papers.

For minimising $f$ with optimum $f^*$, write $\varepsilon_t = f(\mathbf{x}_t) - f^*$ and ask how it shrinks.

Sublinear

$\varepsilon_t \leq C/t$ or $C/t^2$ — error shrinks like a polynomial in $t$. This is what plain gradient descent gives on smooth convex $f$ ($1/t$), accelerated methods on the same class ($1/t^2$), and SGD on strongly convex $f$ with decreasing step size ($1/t$). To halve the error, you need to do roughly as much work as you already did — so the total iteration count to reach accuracy $\varepsilon$ is $O(1/\varepsilon)$ or $O(1/\sqrt{\varepsilon})$.

Linear

$\varepsilon_t \leq C \, \rho^t$ with $\rho \in (0, 1)$. Error shrinks by a constant factor every iteration — this is the rate you get on strongly convex $f$ from gradient descent ($\rho = 1 - 1/\kappa$), heavy-ball ($\rho = (1 - 1/\sqrt{\kappa})^2$), or gradient descent with line search. Reaching accuracy $\varepsilon$ takes $O(\log(1/\varepsilon))$ iterations — exponentially faster than sublinear.

Quadratic / superlinear

Near a minimum, Newton's method has quadratic convergence: $\varepsilon_{t+1} \leq C \varepsilon_t^2$. Number of correct digits doubles per iteration. Quasi-Newton methods are typically superlinear: the ratio $\varepsilon_{t+1}/\varepsilon_t \to 0$, faster than any linear rate. Both require being close enough to a minimum already.

The right way to read a rate

Rates depend on the problem class and the algorithm. Gradient descent on smooth convex $f$ is $O(1/t)$; on smooth strongly convex $f$, it is linear; on smooth non-convex $f$ with stochastic gradients, $\mathbb{E}[\|\nabla f\|^2] = O(1/\sqrt{t})$ — which says we find a critical point, not a minimum. Always note which class the theorem lives in. A paper that "beats" Adam by an order of magnitude in one regime typically inhabits a different theoretical class from the one you are in.

Lower bounds and optimality

Nemirovski–Yudin showed that no first-order method can beat $\Omega(1/t^2)$ on the class of smooth convex functions — so Nesterov's method is optimal up to constants. Similar lower bounds exist for strongly convex, non-convex, and stochastic settings. "Optimality" in optimisation theory always means "no algorithm using the same information can do asymptotically better on this problem class."

Section 18

Where it shows up in ML

Every modern machine-learning system is a stack of optimisation problems. Here is the explicit map from the theory above to the frameworks and papers you will meet in the rest of the compendium.

Read the next few chapters of this compendium with optimisation goggles on and you will see the same shapes everywhere. Probability (chapter 04) is almost always "minimise a negative log-likelihood"; statistics (chapter 05) is "choose an estimator to minimise expected loss"; information theory (chapter 06) is "minimise a KL divergence"; Bayesian reasoning (chapter 07) eventually becomes "maximise an ELBO." The language in which all of it is cashed out is the language of this chapter.

Further reading

Where to go next

Optimization is a field in its own right; this chapter only scratches the surface. Pair one convex-optimization textbook with one numerical-methods book and one set of recorded lectures, and the subject stops feeling like a bag of training tricks and starts looking like a coherent theory.

Textbooks

Free video courses

Seminal and survey papers

References and cheat sheets

This is the third chapter of Part I. Up next: probability and statistics, the language in which data and uncertainty are written down — and, not incidentally, the language in which most of the loss functions on this page are derived.