Part II · Programming & Software Engineering · Chapter 03

Algorithms & data structures, the scaffolding beneath the models.

Every modern ML system is, at heart, a collection of classical algorithms stitched together. A tokenizer is a trie; a beam search is a heap; a feature store is a hash table; a vector index is a metric tree; a cache is a doubly linked list with a dictionary bolted on the side. This chapter is a selective tour of the data structures and algorithms an ML practitioner actually reaches for — the $O(1)$, $O(\log n)$, and $O(n \log n)$ primitives that decide whether a pipeline finishes in minutes or in hours.

How to read this chapter

The first three sections are the foundation — complexity analysis, Python's built-in containers, and hashing. Everything afterwards refers back to them. The middle arc surveys the classical data-structure and algorithm families: sorting, heaps, trees, graphs, dynamic programming, and divide-and-conquer. The last four sections are the places where these ideas meet the ML stack most visibly — spatial indexes, string algorithms, probabilistic sketches, and caching / streaming.

Conventions: $n$ is input size; $O$, $\Omega$, $\Theta$ denote upper, lower, and tight asymptotic bounds. Python examples assume 3.11+; standard-library modules appear as module.name. We prefer clarity over cleverness — the point of this chapter is not virtuoso coding but a working map of which data structure to reach for, and why.

What algorithms do for ML practitionersMotivation
Complexity, the only slide that mattersBig-O, amortized, constants
Python's containerslist, tuple, set, dict, deque
Hashing and hash tablesDicts, feature hashing
Sorting and order statisticsTimsort, quickselect, top-k
Heaps and priority queuesheapq, beam search, Dijkstra
TreesBSTs, decision trees, tries
Graphs and graph algorithmsBFS, DFS, shortest paths, GNNs
Dynamic programmingViterbi, edit distance, CTC
Divide and conquer, recursionMaster theorem
Spatial and metric treeskd-tree, ball tree, HNSW
String algorithmsBPE, WordPiece, KMP
Probabilistic data structuresBloom, Count-Min, HLL
Caching and streaming algorithmsLRU, KV cache, Welford
Where it shows up in MLPayoff

Section 01

What algorithms do for ML practitioners

A classical algorithms course devotes months to self-balancing search trees, shortest-path reductions, and the amortized analysis of splay operations. Most working ML practitioners use almost none of that directly. What they use every day is narrower, deeper, and worth being deliberate about.

The algorithms you meet in an ML pipeline fall into a handful of well-worn families. You will hash feature strings to bucket identifiers. You will sort candidate scores to find the top-$k$. You will walk a graph of task dependencies — of files, of gradients, of compute nodes. You will reach for dynamic programming whenever a score must be computed over an exponentially large set of alignments. You will compare two embeddings in a vector index that is really a metric tree. You will tokenize text with a trie, look up a cache with LRU, deduplicate a stream with a Bloom filter, and forever wonder why your dict is faster than anyone has any right to expect.

This chapter is organised around those families. The first three sections are foundation — complexity analysis, Python's built-in containers, and hashing — because every later section calls back to them. Then we walk through the data structures and algorithms an ML practitioner actually reaches for: sorting and order statistics, heaps, trees and graphs, dynamic programming, recursion, spatial indexes, string algorithms, probabilistic sketches, caching, and streaming. The last section maps each idea to the ML systems in which it lives.

Key idea

The goal of this chapter is not to prepare you for a whiteboard interview. It is to give you the working vocabulary you need to read a research codebase and see the data structures beneath the model: the trie inside the tokenizer, the heap inside beam search, the hash table inside the feature store, the kd-tree inside the retriever.

One honest caveat. The rest of this compendium assumes Python, and Python hides most of these mechanisms inside built-in types and standard-library modules. You will rarely write a hash table; you will call dict. You will rarely write a heap; you will call heapq. The point of this chapter is not that you should reimplement them — it is that you should know what you are calling, because the difference between an ML system that finishes in five minutes and one that grinds for five hours is usually a data-structure choice a thoughtful author made somewhere upstream.

Section 02

Complexity, the only slide that matters

Algorithms are ranked above all by how their cost grows with input size. A program that is ten percent slower than another hardly matters; a program that is a hundred times slower at ten-million rows does not finish at all.

Big-O in one breath

For an input of size $n$, an algorithm's running time — the total number of primitive operations — is some function $T(n)$. Big-O notation throws away constant factors and lower-order terms and keeps the dominant growth:

$$T(n) = O(f(n)) \;\iff\; \exists\, c, n_0 > 0 \text{ such that } T(n) \leq c\, f(n) \text{ for all } n \geq n_0.$$

The typical growth classes, from best to worst on large $n$:

The five growth classes every ML practitioner meets daily. The gap between $O(n \log n)$ and $O(n^2)$ is the reason for almost every "switch from the naive loop to sort" optimization.

$O(1)$ — constant. Dictionary lookup, array access, appending to a list or deque.
$O(\log n)$ — logarithmic. Binary search, heap push/pop, balanced-tree lookup.
$O(n)$ — linear. A single pass over the data; any for loop with constant work per element.
$O(n \log n)$ — linearithmic. Comparison sorts (sorted, np.sort), FFT, divide-and-conquer merges.
$O(n^2)$ — quadratic. Nested loops over the same input — pairwise distances, naïve transformer attention.
$O(2^n)$ and beyond — exponential, factorial. Brute-force subset enumeration, TSP without heuristics. Never acceptable for $n \geq 30$.

Amortized analysis

A single operation can be slow, yet a sequence of operations can be cheap on average. Appending to a Python list is $O(1)$ amortized: most appends are a pointer write, but occasionally the underlying array doubles in capacity at cost $O(n)$. Averaged over $n$ appends, the total work is $O(n)$ — so each append is $O(1)$ on average. This is why list.append is fine in a hot loop even though it looks like it might be $O(n)$.

Constants matter more than theorists admit

Big-O throws away constants, but in practice they decide how big $n$ has to be before the asymptotics kick in. Quicksort and mergesort are both $O(n \log n)$, but quicksort's constant is smaller and its memory access is friendlier, which is why Python's sorted uses an adaptive variant (Timsort) that is essentially mergesort with optimizations for partly sorted data. NumPy's array addition runs at roughly the speed of memory — billions of operations per second — while a Python loop over the same array runs at roughly ten million. The constants differ by a factor of a hundred; the asymptotic complexity is the same.

Rule of thumb

Write code for correctness, then profile. If a hot loop turns out to be $O(n^2)$ on an input of size $10^5$, the fix is almost always a data structure that converts it to $O(n \log n)$ — a sort, a heap, or a hash set. Cache, batching, and vectorization help with the constant; only algorithmic change helps with the exponent.

Section 03

Python's containers, and when to pick which

Python gives you a handful of built-in data structures, each with distinct performance characteristics. Ninety percent of the algorithmic wins in day-to-day ML code come from picking the right container — and from not reaching for a sophisticated data structure where a humble set would do.

The essential four

list, tuple, set, and dict cover the majority of data-manipulation tasks. Their operating costs are worth memorizing:

Container	Access / lookup	Insert / delete	Ordered?	Hashable?
`list`	$O(1)$ by index	$O(1)$ append / $O(n)$ insert	yes	no
`tuple`	$O(1)$ by index	immutable	yes	yes (if elements are)
`set`	$O(1)$ membership	$O(1)$ add / discard	no	no (but elements must be)
`dict`	$O(1)$ by key	$O(1)$	insertion-ordered	no (but keys must be)

The "$O(1)$" entries for set and dict mean average constant time — they rely on hashing, which almost always behaves that way. We will cover when it doesn't in Section 04.

The specialized containers

Three more from the standard library carry their weight:

collections.deque — a doubly-ended queue. $O(1)$ append and pop at both ends, unlike list which is $O(n)$ at the left. Use for BFS queues, sliding windows, and any FIFO.
collections.Counter — a dict subclass for counting hashable items. Counter(tokens).most_common(10) is the cleanest top-10 in the language.
collections.defaultdict — a dict that builds missing values on access. Removes the if key not in d: d[key] = [] pattern entirely.

Immutability and hashability

To live inside a set or as a dict key, an object must be hashable: immutable and producing a stable hash() value across its lifetime. tuple, str, bytes, frozenset, and int are hashable; list, set, and dict are not. This is why grouping by a composite key of (user_id, day) is natural and by [user_id, day] is an error.

Lists vs arrays vs pandas

Once numerical homogeneity enters the picture, Python's built-in list becomes expensive. A list of one million floats costs ~56 MB (each float is a boxed Python object); the equivalent numpy.ndarray costs 8 MB (one C double per element) and supports vectorized operations. Pandas' Series and DataFrame add column labels and index alignment on top. For numerical workloads, the rule is "leave list at the edges and move to NumPy as soon as you can."

Rule of thumb

Need membership tests? set. Need counts? Counter. Need lookups by key? dict. Need ordered numerical data? numpy.ndarray. Need ordered labelled tabular data? pandas.DataFrame. Need both ends fast? deque. Everything else is a list.

Section 04

Hashing, the silent workhorse

Every dict lookup, every set membership test, every string interning, every Parquet partition key resolves to the same primitive — a hash function that maps a variable-length object to a fixed-width integer. Hashing is the difference between $O(1)$ and $O(n)$ and between a system that scales and one that does not.

What makes a good hash function

A hash function $h: \mathcal{X} \to \{0, 1\}^{64}$ should (a) be cheap to compute, (b) spread its outputs uniformly across the range, (c) behave like a random function on typical inputs. It need not be cryptographic — resistance to adversarial preimage attacks matters only when the input is attacker-controlled. Python's built-in hash() is a fast non-cryptographic hash (SipHash, chosen precisely to resist basic dict-flooding attacks).

How a hash table actually works

A hash table is an array of $m$ buckets plus a hash function. To insert a key-value pair, compute $h(k) \bmod m$, place the entry in that bucket. To look up a key, do the same — if the bucket is occupied by a different key, scan forward (open addressing) or walk a linked list (separate chaining). The operation is $O(1)$ expected as long as the load factor $\alpha = n/m$ stays bounded; when it crosses a threshold (0.66 in CPython), the table resizes, rehashing all entries. That resize is a rare $O(n)$ event amortized across many $O(1)$ operations.

Collisions and the worst case

When two keys hash to the same bucket, the table must resolve the collision. CPython uses open addressing with a clever probe sequence. For well-distributed hashes, expected collisions per lookup are $O(1)$. For adversarial inputs, an attacker can craft keys that all collide, turning every insert into $O(n)$ and the whole table into $O(n^2)$ — this is the hash-DoS vulnerability, defended against by hash randomization (PYTHONHASHSEED).

Feature hashing in ML

The hashing trick turns a high-cardinality categorical feature into a fixed-size index without ever building a vocabulary. Given a string feature value $s$, store the entry at position $h(s) \bmod d$ of a sparse vector. The dimension $d$ controls the collision rate; for $d = 2^{20}$ and a million unique features, collisions are rare enough to ignore. This is how sklearn.feature_extraction.FeatureHasher turns arbitrary strings into a fixed-dimensional representation, and how production ad-ranking systems handle billions of feature values without a vocabulary lookup table.

$$\phi_j(x) = \sum_{i \,:\, h(x_i) = j} \text{sign}(x_i)\, x_i, \qquad j = 1, \ldots, d.$$

A signed version of the hashing trick gives unbiased estimates of inner products — a trick central to count sketches and random projection, both covered in Section 13.

Key idea

Hash tables are the reason every dict-based operation in Python is as fast as it is. They are also the reason that a feature name typo ("user_id" vs "userId") silently produces two features that never align across systems. When you think about performance, think about which lookups in your pipeline are hash-backed and which are not.

Section 05

Sorting and order statistics

Sorting is the most common algorithmic workhorse in a data pipeline, and one of the few where the textbook answer actually gets used. But there are at least three jobs it does, and only one of them is "put everything in order."

Comparison sorts

Any sort based on pairwise comparisons needs at least $O(n \log n)$ comparisons in the worst case — that is an information-theoretic lower bound, not a design choice. The practical algorithms that hit this bound are mergesort and heapsort (stable, guaranteed worst-case), and quicksort (unstable, average-case $O(n \log n)$ with excellent constants). Python's sorted and list.sort use Timsort, a hybrid that detects and exploits already-sorted runs — it is $O(n)$ on nearly-sorted input and $O(n \log n)$ otherwise, and it is stable (equal elements keep their original order, which matters for deterministic tie-breaking).

Non-comparison sorts

If your data is integers in a bounded range, you can do better than $O(n \log n)$. Counting sort is $O(n + k)$ for $n$ elements in the range $[0, k)$; radix sort is $O(n \log_k U)$ for $U$-bit integers and base $k$. NumPy's np.argsort(kind="stable") switches to radix under the hood for integer arrays. In ML, non-comparison sorts show up almost never — but when they do (say, sorting 100 million $u8$ labels), they are dramatic.

Order statistics: don't sort when you don't have to

A common mistake is sorting a million-element array to pull out the largest ten. Finding the $k$ smallest or largest is an $O(n)$ problem by quickselect: pick a pivot, partition, recurse into the side that contains the element you want. The full sort would be $O(n \log n)$, so for top-$k$ queries with $k \ll n$, quickselect wins by a factor of $\log n$.

Python's heapq.nlargest(k, iterable) and NumPy's np.partition(a, -k)[-k:] both implement order statistics directly. Use them. In ML, every time you ask "what are the top-10 predicted classes for this example?" you are calling an order-statistics routine — and if you sort the whole vector of 100,000 logits to pull off ten, you have made the system roughly $\log_2 100{,}000 \approx 17$ times slower than it needs to be.

Rule of thumb

Need a full sort? sorted or np.sort. Need the top $k$? heapq.nlargest or np.argpartition. Need the median? np.median, which is quickselect with $k = n/2$. Never sort when a partition suffices.

Section 06

Heaps and priority queues

A heap is a tree-shaped array that lets you find the minimum (or maximum) element and remove it in logarithmic time. It is the right data structure whenever you are incrementally pulling "the best thing so far" from a growing set.

The binary heap

A min-heap is an array where each element is $\leq$ its two children at positions $2i+1$ and $2i+2$. The operations:

push: append to the end and sift up. $O(\log n)$.
pop: return the root, move the last element to the root, sift down. $O(\log n)$.
peek: return the root. $O(1)$.
heapify: turn an arbitrary array into a heap in-place. $O(n)$ — not $O(n \log n)$, because of a neat amortized-analysis trick.

Python's heapq operates directly on a list. No wrapper object, no constructor. heapq.heappush(h, x), heapq.heappop(h), and heapq.heapify(xs) are the whole API.

Priority queues in ML

Whenever you hear "explore the most promising X first," a heap is underneath. Concrete cases:

Beam search in sequence generation. Keep the top-$k$ hypotheses by score; at each step, expand them, rescore the children, and keep the top-$k$ of the expanded set. The prune-to-$k$ step is a heap.
Dijkstra's shortest path. Process vertices in order of tentative distance — a min-heap keyed on distance.
A* search. Same as Dijkstra, but the key is distance + heuristic.
Streaming top-$k$. Maintain a min-heap of size $k$; every new element larger than the heap's minimum evicts the minimum.
Event simulation. Any discrete-event simulator (network simulators, reinforcement-learning environments with delayed events) processes events in time order — a heap keyed on event time.

Rule of thumb

Need the smallest-so-far of a changing collection? Min-heap. Need the largest? Push the negated priority. Need both ends efficient? Indexed priority queue with decrease-key — Python does not have one built in, but heapdict on PyPI does, and networkx's shortest-path routines implement one internally.

Section 07

Trees, more than one kind

In a textbook, "tree" usually means "balanced binary search tree." In ML, the trees you actually use are almost never search trees at all — they are decision trees, tries, heaps, B-trees inside databases, abstract syntax trees inside compilers, and metric trees inside retrieval systems. Knowing the right tree for the job is more useful than knowing the rotation rules of red-black trees.

Binary search trees in one paragraph

A BST is a rooted tree in which every node's key is larger than its left subtree's and smaller than its right subtree's. Lookup, insert, and delete are $O(\log n)$ if balanced, $O(n)$ if not. Self-balancing variants — red-black, AVL — maintain balance with rotations. Python does not include one in its standard library; reach for sortedcontainers.SortedDict, which implements an order-preserving mapping using a B-tree internally, when you need ordered iteration plus fast lookup.

Decision trees

The trees you will actually train. A decision tree splits the input space along one axis at a time, and at each leaf predicts a class or a regression value. Each split is greedy — chosen to maximize information gain (classification) or reduce variance (regression). Lookup is $O(\text{depth})$, not $O(\log n)$: a 1000-row training set produces a tree of depth $\leq 10$ typically. Random forests, gradient-boosted trees (XGBoost, LightGBM, CatBoost), and their variants are ensembles of these. Part IV covers them in depth.

Tries (prefix trees)

A trie indexes a collection of strings by their prefixes. Each edge is a character; a leaf holds the string (or its value). Insert and lookup are $O(|\text{key}|)$ rather than $O(\log n)$, and common prefixes are shared. Tries power autocomplete, regex matching, IP-routing tables, and — central to ML — every modern tokenizer. The BPE (byte-pair encoding) algorithm in GPT-style tokenizers is essentially a trie-building procedure that merges the most frequent adjacent pairs iteratively.

Heaps revisited

Heaps (Section 06) are also trees, stored implicitly in an array. The tree structure is what makes $O(\log n)$ operations possible; the array layout is what makes them fast.

Metric / spatial trees (preview)

kd-trees, ball trees, cover trees, and vantage-point trees index points in a metric space for fast nearest-neighbor queries. These are the engines beneath sklearn.neighbors, and they will get full treatment in Section 11.

Section 08

Graphs, everywhere in disguise

A graph is a set of nodes and a set of edges between them. Viewed that way, almost every ML system has graphs inside it — computational graphs for autodiff, dependency graphs for data pipelines, knowledge graphs for retrieval, social graphs for recommenders, computation graphs for model parallelism.

Representations

Two canonical ways to store a graph:

Adjacency list. A dict mapping each node to a list of its neighbors. Memory is $O(V + E)$ — proportional to the actual edges. Iteration over a node's neighbors is immediate. The default for sparse graphs.
Adjacency matrix. A $V \times V$ matrix with a $1$ (or a weight) at position $(i, j)$ iff there is an edge $i \to j$. Memory is $O(V^2)$ — awful for sparse graphs. Where it wins is linear-algebra operations: for GNNs, the normalized Laplacian $L = D^{-1/2}(D - A) D^{-1/2}$ is a matrix, and spectral methods treat it as one. In practice, store the adjacency as a sparse matrix in scipy.sparse format — best of both worlds.

BFS, DFS, and topological sort

The two canonical graph traversals:

BFS — breadth-first search. Visit nodes in increasing distance (in edges) from the source. Uses a queue (collections.deque). Finds shortest paths in unweighted graphs. $O(V + E)$.
DFS — depth-first search. Visit by plunging down the current branch before backtracking. Uses a stack (or recursion). Useful for topological sort, cycle detection, strongly connected components. $O(V + E)$.

Topological sort produces an ordering of the nodes of a directed acyclic graph (DAG) such that every edge goes from earlier to later. It is the data-pipeline scheduler's core routine — Airflow, Dagster, and Prefect all topologically sort their task graphs at every run. It is also how PyTorch's autograd decides the order in which to apply backward ops.

Shortest paths

Dijkstra's algorithm computes single-source shortest paths in weighted graphs with non-negative edge weights in $O((V + E) \log V)$ using a min-heap. Bellman–Ford handles negative weights at $O(VE)$. A\* extends Dijkstra with a heuristic — if the heuristic is admissible, A* still finds the optimum, but visits fewer nodes. Path-finding in robotics, routing, and game AI is almost always one of these three.

Graphs as tensors

Modern ML treats graphs as objects of deep learning. A graph neural network applies a learned function to each node that depends on its neighbors — a message-passing update

$$h_v^{(k+1)} = \text{UPDATE}\!\left(h_v^{(k)},\; \text{AGGREGATE}\!\left(\{\,h_u^{(k)} : u \in \mathcal{N}(v)\,\}\right)\right).$$

Inside PyTorch Geometric and DGL, this computes as a sparse matrix multiplication — the graph's adjacency list is the sparse matrix. Section 05 of the Scientific Computing chapter is load-bearing for GNN performance.

Rule of thumb

Small graph, quick exploration: networkx (Python-pure, pedagogically clear, slow). Medium-size (millions of edges): igraph or graph-tool (C backends). Massive, or you need autodiff: PyTorch Geometric / DGL / JAX-based GNN libraries.

Section 09

Dynamic programming, memoization with ambition

Dynamic programming is the technique of solving a problem by solving smaller subproblems first, caching their answers, and composing them into the full solution. It is not the elegant recursion of computer-science textbooks so much as the bread-and-butter of sequence models, alignment algorithms, and any ML problem with a Markov structure.

The pattern

Two ingredients make a problem DP-friendly:

Optimal substructure. The optimal solution to the whole problem contains optimal solutions to subproblems.
Overlapping subproblems. The same subproblem is asked about many times — so caching its answer pays off.

The implementation is either top-down — a recursive function with @functools.lru_cache — or bottom-up, an iteration over an explicit table. Bottom-up is usually faster and uses less stack; top-down is often shorter to write.

The canonical ML applications

Three DP algorithms show up in ML every week:

Viterbi. The most probable state sequence in a hidden Markov model, in $O(T \cdot K^2)$ for a sequence of length $T$ and $K$ states. A DP over the trellis: $\delta_t(k) = \max_{k'} \delta_{t-1}(k') \cdot p(k \mid k') \cdot p(x_t \mid k)$.
Forward / backward. Sum (rather than max) over the same trellis — exact marginal likelihood in an HMM, and the expectation step of the Baum-Welch algorithm.
CTC (connectionist temporal classification). The loss function of modern speech recognizers. A DP over alignments between an acoustic sequence and a label sequence, with a clever "blank" state that absorbs the length difference.

Two more that work the same way: edit distance (Levenshtein) is a DP over (source-prefix, target-prefix) pairs; Smith–Waterman and Needleman–Wunsch are DPs over (sequence-A-position, sequence-B-position) for biological sequence alignment.

Why DP is everywhere

Any model whose likelihood factorizes along a chain — HMMs, CRFs, autoregressive language models, RNNs — admits an exact inference algorithm that is a DP. The model fitness of a given structure often determines which DP to run: the forward algorithm on an HMM, the Viterbi algorithm for decoding, the Baum–Welch algorithm for learning. Transformers broke this pattern by replacing the chain with full attention, but their decoding at inference still relies on beam-search DP over the output space.

Key idea

Whenever a sequence model gives you an exact answer about a probability over an exponentially large set (most-probable path, marginal likelihood, best alignment), the engine underneath is almost always a dynamic program over a two-dimensional table. Recognizing that shape is how you read an HMM paper or a speech-recognition loss in ten minutes.

Section 10

Divide and conquer

Dynamic programming's older sibling. Break a problem into independent subproblems, solve each recursively, combine their answers. When the subproblems are disjoint, divide-and-conquer wins; when they overlap, DP does.

The master theorem

A divide-and-conquer recurrence of the form

$$T(n) = a\, T(n/b) + O(n^d)$$

— $a$ subproblems of size $n/b$ plus $O(n^d)$ work to combine — solves to

$T(n) = O(n^d)$ if $a < b^d$,
$T(n) = O(n^d \log n)$ if $a = b^d$,
$T(n) = O(n^{\log_b a})$ if $a > b^d$.

Mergesort is $a = 2, b = 2, d = 1$ — the middle case, $O(n \log n)$. Binary search is $a = 1, b = 2, d = 0$ — middle case, $O(\log n)$. Strassen's matrix multiplication is $a = 7, b = 2, d = 2$ — third case, $O(n^{\log_2 7}) \approx O(n^{2.81})$, beating the naïve $O(n^3)$.

Canonical examples in the ML stack

Mergesort / Timsort. Every sorted() call.
FFT. The Cooley–Tukey algorithm of Section 13 is divide-and-conquer over the DFT — $O(n \log n)$ instead of $O(n^2)$.
Strassen and beyond. Matrix multiplication in $o(n^3)$. Used mostly as a benchmark; BLAS-3 kernels in LAPACK stick to the $O(n^3)$ algorithm with excellent constants.
Fast median of medians. A deterministic linear-time selection algorithm — divide-and-conquer again.
Tree-based model training. Recursive partitioning of the training set. Each split is a subproblem on a subset — no overlap, so it is divide-and-conquer, not DP.

Recursion in Python

Python has a default recursion limit of 1000, raisable via sys.setrecursionlimit. For deep DPs — say, a 10,000-length sequence alignment — prefer bottom-up iteration or convert to an explicit stack. Tail-call optimization is not a thing in Python, so deep recursion is a trap that should be unrolled before it bites.

Rule of thumb

When you see a function that calls itself twice with roughly half the input, it is divide-and-conquer and its complexity follows the master theorem. When you see a function that calls itself with overlapping inputs, cache it — you are doing dynamic programming and you should say so.

Section 11

Spatial trees and approximate nearest neighbors

"Find the closest point to this one" is the central operation of retrieval, clustering, and a large share of modern ML. A linear scan is $O(n)$ per query, fine for thousands of points, untenable for billions. Spatial data structures and modern approximate-nearest-neighbor indexes reduce it to near-logarithmic — at a cost in exactness.

Exact: kd-trees and ball trees

A kd-tree recursively partitions a $d$-dimensional space along axis-aligned hyperplanes: split on $x_1$ at its median, then each half on $x_2$, and so on. Lookup walks the tree and prunes branches whose bounding box is farther from the query than the best candidate so far. For low-dimensional problems ($d \lesssim 20$) with $n$ points, a kd-tree answers nearest-neighbor queries in $O(\log n)$ average.

A ball tree partitions points into nested balls (hyperspheres), handles non-axis-aligned data better, and is the default in sklearn.neighbors for general metrics. Both structures degrade in high dimensions — the curse of dimensionality — because in $\mathbb{R}^{100}$, the "bounding box" of any interesting set of points is almost all of the space.

Approximate: HNSW, Annoy, FAISS

In high dimensions the only practical path is approximation. The modern winners all have the same shape: build an index structure that routes a query to a small number of candidate points, check those candidates exactly, return the best. Sacrifice a few percent of recall for orders of magnitude in speed.

HNSW (Hierarchical Navigable Small World). A multi-layer graph where each layer is a random sparser version of the one below. Queries greedy-descend through layers. Used by pgvector, Weaviate, Qdrant, Milvus.
Annoy (Spotify). Builds a forest of random-projection trees. Each tree splits the space by a random hyperplane; a query follows each tree to a leaf and unions the candidates.
FAISS (Meta). A toolkit of indexes: IVF (inverted file — first coarse cluster, then within-cluster search), PQ (product quantization — compress vectors by splitting into sub-vectors and learning a codebook), and their combinations. The state of the art for billion-scale vector search.

The retrieval stack

A vector database is essentially an ANN index plus a storage layer. When you query a RAG system with a question, the system embeds the question into a vector, the vector index returns the $k$ nearest chunks, and the language model consumes them as context. Every part of that pipeline is covered somewhere in this compendium; the index part is here.

Rule of thumb

$d \leq 20$ and exact: kd-tree or ball tree. $d \gg 20$ and approximate is fine: HNSW (good recall, modest memory) or IVFPQ (best compression, largest scale). Up to a million vectors: FAISS on a single machine. Billions: FAISS with GPU backends, or dedicated vector DBs.

Section 12

String algorithms and tokenizers

NLP runs on sequences of strings long before it runs on tensors. The algorithms that turn raw bytes into model-ready tokens are a small, sharp-edged toolkit with direct lineage to classical string processing.

Tries and prefix matching

A trie (Section 07) matches a prefix against a vocabulary in $O(|\text{prefix}|)$. Regex engines and early text processors lived on tries. In modern NLP, the tokenizer is a trie-like dictionary that greedily matches the longest known token at each position in the input.

Byte-pair encoding (BPE)

BPE starts with a vocabulary of bytes and iteratively merges the most frequent adjacent pair into a new token, updating frequencies. After $K$ merges, the vocabulary has size $256 + K$. Popular because it handles out-of-vocabulary words gracefully (rare words decompose into subwords) and because the tokenizer state is just the ordered list of merges. GPT-2, GPT-3, GPT-4, LLaMA, and Claude all use variants. tiktoken and sentencepiece are the canonical implementations.

WordPiece and Unigram LM

WordPiece (BERT) is a close cousin of BPE: merges are chosen to maximize likelihood rather than frequency. Unigram LM (T5, mT5) starts with a large vocabulary and prunes tokens that contribute least to the corpus likelihood — effectively an EM algorithm over tokenization. All three produce similar-feeling vocabularies and all three are, algorithmically, agglomerative or divisive clustering over byte sequences.

Classical string matching, briefly

KMP (Knuth–Morris–Pratt), Boyer–Moore, and Rabin–Karp are the classical linear-time substring-matching algorithms. They remain the right answer for log analysis, grep-like tools, and security work, but they show up rarely in modern ML pipelines. Suffix arrays and FM-indexes still anchor bioinformatics — variant calling, read mapping, genome assembly are all built on them.

Rule of thumb

Training a tokenizer: sentencepiece or tokenizers (the Hugging Face Rust implementation). Encoding text fast: tiktoken. Fuzzy string matching: rapidfuzz. Genome-scale string indexes: bwa or bowtie2 under the hood.

Section 13

Probabilistic data structures

When a data set is too large to store exactly, trade a small, controllable error for dramatic savings in memory. A family of data structures — all built on hashing and the law of large numbers — gives you approximate membership, approximate counts, and approximate cardinality at a tiny fraction of the memory of the exact equivalent.

Bloom filter: approximate membership

A Bloom filter answers "have I seen this item?" with no false negatives and a controllable false-positive rate. Store $m$ bits; use $k$ independent hash functions. To insert an item, hash it $k$ times and set those $k$ bits. To query, check all $k$ bits; if any is zero, the item is definitely absent; if all are one, it is probably present. For $n$ items, false-positive rate is

$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k,$$

minimized at $k = (m/n) \ln 2$. Bloom filters live in BigQuery's partition pruning, Cassandra's read path, caching-proxy negative caches, and the deduplication step of many LLM pretraining pipelines.

Count-Min Sketch: approximate counts

A Count-Min Sketch estimates the frequency of each item in a stream using a fixed-size 2D array of counters and $d$ hash functions — one row per hash. To increment, hash the item by each function and increment the counter at each row/column. To query, return the minimum across rows. Overestimates by construction; the error is bounded by $\epsilon \cdot N$ with probability $1 - \delta$ when the sketch has dimensions $\lceil e/\epsilon \rceil \times \lceil \ln(1/\delta) \rceil$. Used for top-$k$ streaming queries, heavy-hitter detection, and any "which item appeared most?" query over a firehose of data.

HyperLogLog: approximate cardinality

HyperLogLog estimates the number of distinct elements in a stream. Hash each item; look at the leading zeros of the hash; keep a small register per hash bucket. Cardinality estimates come from the distribution of maximum leading zeros. A few kilobytes of state can estimate cardinalities of $10^9$ with relative error around 2%. Redis, BigQuery, Presto, and almost every analytic database implement HLL as a built-in.

Reservoir sampling

A final cousin: given a stream of unknown length, sample $k$ items uniformly at random with one pass and $O(k)$ memory. Keep the first $k$ in a reservoir; for each subsequent item $i$, replace a random existing reservoir entry with probability $k/i$. The resulting reservoir is exactly a uniform sample. Used in subsampling large logs, online evaluation, and anytime "give me $k$ random rows from this huge table" arises.

Key idea

Exact data structures cost memory proportional to data size. Probabilistic data structures cost memory proportional to error tolerance. For stream-scale data, that is the difference between "we can do this in a few KB" and "we can't do this at all."

Section 14

Caching, memoization, and streaming

The two cheapest wins in a real system are "do it once and remember" and "do it once and forget." Caching covers the first; streaming covers the second. Both are what separate a pipeline that fits on one laptop from one that needs a cluster.

Memoization and LRU caches

Memoization stores the return value of a function keyed by its arguments so repeated calls are free. functools.lru_cache and the newer functools.cache are the Python decorators. LRU ("least-recently-used") bounds the cache size by evicting the entry unused for the longest. The data structure underneath is a doubly-linked list plus a hash map — $O(1)$ hit, miss, and eviction. It is the default caching strategy for everything from CPU TLBs to GPT-style KV caches to HTTP proxies.

Cache hierarchies in ML systems

Modern ML systems are stacks of caches. In LLM inference, the KV cache stores attention keys and values for already-processed tokens so each new token costs $O(\text{seq len})$ rather than $O(\text{seq len}^2)$. In retrieval, a vector index is a cache over embedding computations. In training pipelines, a feature store is a cache over feature engineering. In hyperparameter sweeps, memoizing a model's training output lets overnight experiments be replayed in seconds.

Streaming algorithms

When data arrives too fast to store, the only option is a single pass with bounded memory. The probabilistic structures of Section 13 are streaming algorithms. So are Welford's online mean and variance, exponentially weighted moving averages, and the reservoir sampler. Welford's algorithm maintains $\mu_n$ and $M_{2,n} = \sum_i (x_i - \mu_n)^2$ by

$$\mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n}, \qquad M_{2,n} = M_{2,n-1} + (x_n - \mu_{n-1})(x_n - \mu_n).$$

Numerically stable, single-pass, bounded memory — the template for any streaming-statistics routine.

External-memory and out-of-core

Same idea in slower motion. When data is too large for RAM but fits on disk, structure the algorithm so each pass reads sequentially — external mergesort, chunked aggregations in pandas / polars / dask, and disk-backed key-value stores like RocksDB. The bottleneck is bandwidth, and algorithms that randomly access disk are orders of magnitude slower than those that scan.

Rule of thumb

Repeated identical work: @functools.lru_cache. Data larger than memory but smaller than disk: stream with polars / dask / pyarrow chunks. Data larger than disk: distributed (ray, spark) plus probabilistic approximations where exact answers are not needed. Every time you catch yourself computing the same thing twice, ask whether caching or streaming changes the cost entirely.

Section 15

Where it shows up in ML

Every non-trivial ML system is an orchestration of the data structures and algorithms in this chapter. Here is the map — one line per system, pointing at the primitives doing the work underneath.

Tokenizers. Trie walk over BPE merges (§07, §12). Any time you hit tiktoken.encode, a trie of a hundred thousand merges is being traversed.
Beam-search decoding. Heap-keyed on hypothesis score (§06), $k$ survivors per step. The heap is what makes $k = 32$ tractable instead of exponential in sequence length.
KV caching in transformers. LRU-like eviction in constrained-memory inference (§14), exact retention under memory; the cache turns $O(T^2)$ per-token cost into $O(T)$.
Vector retrieval / RAG. HNSW or IVFPQ index (§11) over an embedding matrix. Every RAG query resolves to an approximate-nearest-neighbor walk through a graph of vectors.
Graph neural networks. Sparse adjacency matrix (§08) times a node-feature matrix — message passing as a sparse linear-algebra operation.
HMMs, CRFs, CTC. Dynamic programming (§09) over a trellis of states. Every speech-recognition loss, every tagging model, every neural-HMM decoder.
Decision-tree models. Recursive partitioning (§10) with greedy splits. XGBoost, LightGBM, CatBoost are all ensembles of these and collectively win more Kaggle competitions than deep learning does.
Feature hashing. Hash a string to a bucket (§04) — the core of billion-feature online-learning systems where a vocabulary is simply too expensive.
Experiment dedup. Bloom filters (§13) over seen training-document hashes to prevent data leakage between pretraining and evaluation.
Streaming metrics. Welford for mean/variance, HyperLogLog for unique counts, Count-Min Sketch for hot-key detection (§13, §14) — the backbone of production monitoring.
Data-pipeline orchestration. Topological sort over task graphs (§08) — Airflow, Dagster, Prefect, Pants, Bazel, make.
Autodiff. DAG traversal (§08) over the computation graph; reverse topological order is the backward pass.

Recognizing these primitives in a research codebase is what separates reading a paper from understanding it. When you open transformers or vllm or faiss or fairseq, what you are reading is not the mathematics of attention or the geometry of embedding spaces — it is a careful composition of the data structures in this chapter, arranged to push each expensive operation onto a cheaper one.

Where to go next

The algorithms literature is one of the oldest and most settled in computer science, but only a small fraction of it matters day-to-day for an ML practitioner. The list below is opinionated: one or two foundational textbooks, a few focused treatments of the pieces this chapter could only sketch, and the community's preferred entry points for the probabilistic and approximate-nearest-neighbour material that is genuinely live.

Foundational books

Introduction to Algorithms

Cormen, Leiserson, Rivest & Stein · 4th ed. 2022 · MIT Press

The canonical reference. Every algorithm you are likely to meet is in here, carefully analyzed and carefully proved. Not a page-turner, but the book practitioners keep on the shelf for the next twenty years. Chapters 1–17 and 22–26 cover almost everything in this chapter.

MIT Press
The Algorithm Design Manual

Steven Skiena · 3rd ed. 2020

Where CLRS proves, Skiena pragmatically advises. The book is half catalogue of classical algorithms, half field notes on which ones actually work in practice. The "war stories" are as instructive as the theorems. The best second book after CLRS.

Springer
Algorithms

Sedgewick & Wayne · 4th ed. 2011 · Princeton

A beautifully visual treatment, keyed to Java code but directly translatable. The companion online course at Princeton is the best free lecture series on the subject. Particularly strong on sorting, search trees, and graph algorithms.

algs4.cs.princeton.edu
Algorithm Design

Kleinberg & Tardos · 2005

The best book for developing algorithmic intuition. Organized by technique — greedy, divide-and-conquer, dynamic programming, network flow — rather than by data structure. The examples are drawn with care and the exercises train the reflex of recognizing which technique a problem wants.

Pearson

Free and pragmatic

Algorithms

Jeff Erickson · 2019 · free online

A free, well-written, and surprisingly comprehensive undergraduate textbook. Erickson's tone is wry and the problem sets excellent. If CLRS feels forbidding, this is the approachable alternative — and it is legitimately good.

jeffe.cs.illinois.edu
Competitive Programmer's Handbook

Antti Laaksonen · 2017 · free online

A practical toolkit of algorithms and data structures written for contest programmers. Covers segment trees, suffix structures, number-theoretic tricks, and advanced DP patterns that rarely appear in ML but are useful vocabulary. Short, concrete, code-forward.

Free PDF
Probability and Computing

Mitzenmacher & Upfal · 2nd ed. 2017 · Cambridge

The textbook for randomized algorithms, hashing, concentration inequalities, and the analysis of data-stream sketches. If you want to understand why Bloom filters and Count-Min work, this is the book. Pairs directly with Sections 04 and 13.

Cambridge

Streaming and sketches

Data Streams: Algorithms and Applications

S. Muthukrishnan · 2005 · free

The first systematic survey of streaming algorithms — sampling, sketches, and space-efficient summaries. Short, accessible, and still the right place to start before reading the Count-Min and HyperLogLog papers.

NOW Publishers
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

Flajolet, Fusy, Gandouet & Meunier · 2007

The original paper. Dense but beautifully argued — $O(\log \log n)$ bits to count distinct items to a few percent. Read it alongside Heule, Nunkesser, Hall's "HyperLogLog in Practice" (2013), which describes the version Redis actually implements.

Free PDF
An Improved Data Stream Summary: the Count-Min Sketch

Cormode & Muthukrishnan · 2005

The paper that introduced the sketch now used everywhere for approximate frequency queries in streaming systems. Short, clean, and directly implementable. Graham Cormode's tutorials on his homepage are the ideal companion.

Free PDF

Approximate nearest neighbours

Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality

Indyk & Motwani · 1998

The paper that launched LSH. Proves that approximate $k$-nearest-neighbour search in high dimensions is genuinely tractable with the right hash families. Foundational for every modern vector index.

Free PDF
Efficient and robust approximate nearest neighbor search using HNSW graphs

Malkov & Yashunin · 2016

The paper behind the index that powers most modern vector databases — Pinecone, Weaviate, Qdrant, Milvus. Hierarchical small-world graphs give logarithmic-query search with superb recall. The construction is elegant; the empirical results are the benchmark every newer method measures itself against.

arXiv
Billion-scale similarity search with GPUs (FAISS)

Johnson, Douze & Jégou · 2017

The system paper for FAISS — the library every RAG pipeline on earth quietly depends on. Product quantization, IVF indexes, GPU kernels. Read it alongside the FAISS wiki, which is an unusually good operational reference.

arXiv FAISS wiki

Python ecosystem

The Python standard library: heapq, bisect, collections, functools

docs.python.org

The four standard-library modules every ML engineer should read cover-to-cover. heapq for priority queues and top-$k$; bisect for sorted-list insertion; collections for deque, Counter, defaultdict, OrderedDict; functools for lru_cache. Under two thousand words of documentation buys years of productivity.

heapq collections
sortedcontainers

grantjenks.com/docs/sortedcontainers

A pure-Python implementation of sorted list, sorted dict, and sorted set with log-factor operations — the missing balanced-tree primitive in CPython. Fast, surprisingly, and actively maintained. Reach for it when you need ordered lookup.

sortedcontainers docs
tokenizers & faiss

huggingface.co · github.com/facebookresearch/faiss

The two libraries that are the string-algorithms and vector-index chapters of this chapter, implemented in production-quality code. tokenizers is the reference BPE/WordPiece/Unigram implementation; faiss is the reference ANN index. Both have readable source worth browsing.

tokenizers faiss

This page is the third chapter of Part II: Programming & Software Engineering. Up next: software-engineering principles, databases and SQL, and version control. Then Part III — data engineering — and Part IV, where every classical model you build will be a composition of the data structures in this chapter.

How to read this chapter

Contents

What algorithms do for ML practitioners

Complexity, the only slide that matters

Big-O in one breath

Amortized analysis

Constants matter more than theorists admit

Python's containers, and when to pick which

The essential four

The specialized containers

Immutability and hashability

Lists vs arrays vs pandas

Hashing, the silent workhorse

What makes a good hash function

How a hash table actually works

Collisions and the worst case

Feature hashing in ML

Sorting and order statistics

Comparison sorts

Non-comparison sorts

Order statistics: don't sort when you don't have to

Heaps and priority queues

The binary heap

Priority queues in ML

Trees, more than one kind

Binary search trees in one paragraph

Decision trees

Tries (prefix trees)

Heaps revisited

Metric / spatial trees (preview)

Graphs, everywhere in disguise

Representations

BFS, DFS, and topological sort

Shortest paths

Graphs as tensors

Dynamic programming, memoization with ambition

The pattern

The canonical ML applications

Why DP is everywhere

Divide and conquer

The master theorem

Canonical examples in the ML stack

Recursion in Python

Spatial trees and approximate nearest neighbors

Exact: kd-trees and ball trees

Approximate: HNSW, Annoy, FAISS

The retrieval stack

String algorithms and tokenizers

Tries and prefix matching

Byte-pair encoding (BPE)

WordPiece and Unigram LM

Classical string matching, briefly

Probabilistic data structures

Bloom filter: approximate membership

Count-Min Sketch: approximate counts

HyperLogLog: approximate cardinality

Reservoir sampling

Caching, memoization, and streaming

Memoization and LRU caches

Cache hierarchies in ML systems

Streaming algorithms

External-memory and out-of-core

Where it shows up in ML

Where to go next

Foundational books

Free and pragmatic

Streaming and sketches

Approximate nearest neighbours

Python ecosystem