Every modern ML system is, at heart, a collection of classical algorithms stitched together. A tokenizer is a trie; a beam search is a heap; a feature store is a hash table; a vector index is a metric tree; a cache is a doubly linked list with a dictionary bolted on the side. This chapter is a selective tour of the data structures and algorithms an ML practitioner actually reaches for — the $O(1)$, $O(\log n)$, and $O(n \log n)$ primitives that decide whether a pipeline finishes in minutes or in hours.
The first three sections are the foundation — complexity analysis, Python's built-in containers, and hashing. Everything afterwards refers back to them. The middle arc surveys the classical data-structure and algorithm families: sorting, heaps, trees, graphs, dynamic programming, and divide-and-conquer. The last four sections are the places where these ideas meet the ML stack most visibly — spatial indexes, string algorithms, probabilistic sketches, and caching / streaming.
Conventions: $n$ is input size; $O$, $\Omega$, $\Theta$ denote upper, lower, and tight asymptotic bounds. Python examples assume 3.11+; standard-library modules appear as module.name. We prefer clarity over cleverness — the point of this chapter is not virtuoso coding but a working map of which data structure to reach for, and why.
A classical algorithms course devotes months to self-balancing search trees, shortest-path reductions, and the amortized analysis of splay operations. Most working ML practitioners use almost none of that directly. What they use every day is narrower, deeper, and worth being deliberate about.
The algorithms you meet in an ML pipeline fall into a handful of well-worn families. You will hash feature strings to bucket identifiers. You will sort candidate scores to find the top-$k$. You will walk a graph of task dependencies — of files, of gradients, of compute nodes. You will reach for dynamic programming whenever a score must be computed over an exponentially large set of alignments. You will compare two embeddings in a vector index that is really a metric tree. You will tokenize text with a trie, look up a cache with LRU, deduplicate a stream with a Bloom filter, and forever wonder why your dict is faster than anyone has any right to expect.
This chapter is organised around those families. The first three sections are foundation — complexity analysis, Python's built-in containers, and hashing — because every later section calls back to them. Then we walk through the data structures and algorithms an ML practitioner actually reaches for: sorting and order statistics, heaps, trees and graphs, dynamic programming, recursion, spatial indexes, string algorithms, probabilistic sketches, caching, and streaming. The last section maps each idea to the ML systems in which it lives.
The goal of this chapter is not to prepare you for a whiteboard interview. It is to give you the working vocabulary you need to read a research codebase and see the data structures beneath the model: the trie inside the tokenizer, the heap inside beam search, the hash table inside the feature store, the kd-tree inside the retriever.
One honest caveat. The rest of this compendium assumes Python, and Python hides most of these mechanisms inside built-in types and standard-library modules. You will rarely write a hash table; you will call dict. You will rarely write a heap; you will call heapq. The point of this chapter is not that you should reimplement them — it is that you should know what you are calling, because the difference between an ML system that finishes in five minutes and one that grinds for five hours is usually a data-structure choice a thoughtful author made somewhere upstream.
Algorithms are ranked above all by how their cost grows with input size. A program that is ten percent slower than another hardly matters; a program that is a hundred times slower at ten-million rows does not finish at all.
For an input of size $n$, an algorithm's running time — the total number of primitive operations — is some function $T(n)$. Big-O notation throws away constant factors and lower-order terms and keeps the dominant growth:
$$T(n) = O(f(n)) \;\iff\; \exists\, c, n_0 > 0 \text{ such that } T(n) \leq c\, f(n) \text{ for all } n \geq n_0.$$The typical growth classes, from best to worst on large $n$:
sort" optimization.list or deque.for loop with constant work per element.sorted, np.sort), FFT, divide-and-conquer merges.
A single operation can be slow, yet a sequence of operations can be cheap on average. Appending to a Python list is $O(1)$ amortized: most appends are a pointer write, but occasionally the underlying array doubles in capacity at cost $O(n)$. Averaged over $n$ appends, the total work is $O(n)$ — so each append is $O(1)$ on average. This is why list.append is fine in a hot loop even though it looks like it might be $O(n)$.
Big-O throws away constants, but in practice they decide how big $n$ has to be before the asymptotics kick in. Quicksort and mergesort are both $O(n \log n)$, but quicksort's constant is smaller and its memory access is friendlier, which is why Python's sorted uses an adaptive variant (Timsort) that is essentially mergesort with optimizations for partly sorted data. NumPy's array addition runs at roughly the speed of memory — billions of operations per second — while a Python loop over the same array runs at roughly ten million. The constants differ by a factor of a hundred; the asymptotic complexity is the same.
Write code for correctness, then profile. If a hot loop turns out to be $O(n^2)$ on an input of size $10^5$, the fix is almost always a data structure that converts it to $O(n \log n)$ — a sort, a heap, or a hash set. Cache, batching, and vectorization help with the constant; only algorithmic change helps with the exponent.
Python gives you a handful of built-in data structures, each with distinct performance characteristics. Ninety percent of the algorithmic wins in day-to-day ML code come from picking the right container — and from not reaching for a sophisticated data structure where a humble set would do.
list, tuple, set, and dict cover the majority of data-manipulation tasks. Their operating costs are worth memorizing:
| Container | Access / lookup | Insert / delete | Ordered? | Hashable? |
|---|---|---|---|---|
list | $O(1)$ by index | $O(1)$ append / $O(n)$ insert | yes | no |
tuple | $O(1)$ by index | immutable | yes | yes (if elements are) |
set | $O(1)$ membership | $O(1)$ add / discard | no | no (but elements must be) |
dict | $O(1)$ by key | $O(1)$ | insertion-ordered | no (but keys must be) |
The "$O(1)$" entries for set and dict mean average constant time — they rely on hashing, which almost always behaves that way. We will cover when it doesn't in Section 04.
Three more from the standard library carry their weight:
collections.deque — a doubly-ended queue. $O(1)$ append and pop at both ends, unlike list which is $O(n)$ at the left. Use for BFS queues, sliding windows, and any FIFO.collections.Counter — a dict subclass for counting hashable items. Counter(tokens).most_common(10) is the cleanest top-10 in the language.collections.defaultdict — a dict that builds missing values on access. Removes the if key not in d: d[key] = [] pattern entirely.
To live inside a set or as a dict key, an object must be hashable: immutable and producing a stable hash() value across its lifetime. tuple, str, bytes, frozenset, and int are hashable; list, set, and dict are not. This is why grouping by a composite key of (user_id, day) is natural and by [user_id, day] is an error.
Once numerical homogeneity enters the picture, Python's built-in list becomes expensive. A list of one million floats costs ~56 MB (each float is a boxed Python object); the equivalent numpy.ndarray costs 8 MB (one C double per element) and supports vectorized operations. Pandas' Series and DataFrame add column labels and index alignment on top. For numerical workloads, the rule is "leave list at the edges and move to NumPy as soon as you can."
Need membership tests? set. Need counts? Counter. Need lookups by key? dict. Need ordered numerical data? numpy.ndarray. Need ordered labelled tabular data? pandas.DataFrame. Need both ends fast? deque. Everything else is a list.
Every dict lookup, every set membership test, every string interning, every Parquet partition key resolves to the same primitive — a hash function that maps a variable-length object to a fixed-width integer. Hashing is the difference between $O(1)$ and $O(n)$ and between a system that scales and one that does not.
A hash function $h: \mathcal{X} \to \{0, 1\}^{64}$ should (a) be cheap to compute, (b) spread its outputs uniformly across the range, (c) behave like a random function on typical inputs. It need not be cryptographic — resistance to adversarial preimage attacks matters only when the input is attacker-controlled. Python's built-in hash() is a fast non-cryptographic hash (SipHash, chosen precisely to resist basic dict-flooding attacks).
A hash table is an array of $m$ buckets plus a hash function. To insert a key-value pair, compute $h(k) \bmod m$, place the entry in that bucket. To look up a key, do the same — if the bucket is occupied by a different key, scan forward (open addressing) or walk a linked list (separate chaining). The operation is $O(1)$ expected as long as the load factor $\alpha = n/m$ stays bounded; when it crosses a threshold (0.66 in CPython), the table resizes, rehashing all entries. That resize is a rare $O(n)$ event amortized across many $O(1)$ operations.
When two keys hash to the same bucket, the table must resolve the collision. CPython uses open addressing with a clever probe sequence. For well-distributed hashes, expected collisions per lookup are $O(1)$. For adversarial inputs, an attacker can craft keys that all collide, turning every insert into $O(n)$ and the whole table into $O(n^2)$ — this is the hash-DoS vulnerability, defended against by hash randomization (PYTHONHASHSEED).
The hashing trick turns a high-cardinality categorical feature into a fixed-size index without ever building a vocabulary. Given a string feature value $s$, store the entry at position $h(s) \bmod d$ of a sparse vector. The dimension $d$ controls the collision rate; for $d = 2^{20}$ and a million unique features, collisions are rare enough to ignore. This is how sklearn.feature_extraction.FeatureHasher turns arbitrary strings into a fixed-dimensional representation, and how production ad-ranking systems handle billions of feature values without a vocabulary lookup table.
A signed version of the hashing trick gives unbiased estimates of inner products — a trick central to count sketches and random projection, both covered in Section 13.
Hash tables are the reason every dict-based operation in Python is as fast as it is. They are also the reason that a feature name typo ("user_id" vs "userId") silently produces two features that never align across systems. When you think about performance, think about which lookups in your pipeline are hash-backed and which are not.
Sorting is the most common algorithmic workhorse in a data pipeline, and one of the few where the textbook answer actually gets used. But there are at least three jobs it does, and only one of them is "put everything in order."
Any sort based on pairwise comparisons needs at least $O(n \log n)$ comparisons in the worst case — that is an information-theoretic lower bound, not a design choice. The practical algorithms that hit this bound are mergesort and heapsort (stable, guaranteed worst-case), and quicksort (unstable, average-case $O(n \log n)$ with excellent constants). Python's sorted and list.sort use Timsort, a hybrid that detects and exploits already-sorted runs — it is $O(n)$ on nearly-sorted input and $O(n \log n)$ otherwise, and it is stable (equal elements keep their original order, which matters for deterministic tie-breaking).
If your data is integers in a bounded range, you can do better than $O(n \log n)$. Counting sort is $O(n + k)$ for $n$ elements in the range $[0, k)$; radix sort is $O(n \log_k U)$ for $U$-bit integers and base $k$. NumPy's np.argsort(kind="stable") switches to radix under the hood for integer arrays. In ML, non-comparison sorts show up almost never — but when they do (say, sorting 100 million $u8$ labels), they are dramatic.
A common mistake is sorting a million-element array to pull out the largest ten. Finding the $k$ smallest or largest is an $O(n)$ problem by quickselect: pick a pivot, partition, recurse into the side that contains the element you want. The full sort would be $O(n \log n)$, so for top-$k$ queries with $k \ll n$, quickselect wins by a factor of $\log n$.
Python's heapq.nlargest(k, iterable) and NumPy's np.partition(a, -k)[-k:] both implement order statistics directly. Use them. In ML, every time you ask "what are the top-10 predicted classes for this example?" you are calling an order-statistics routine — and if you sort the whole vector of 100,000 logits to pull off ten, you have made the system roughly $\log_2 100{,}000 \approx 17$ times slower than it needs to be.
Need a full sort? sorted or np.sort. Need the top $k$? heapq.nlargest or np.argpartition. Need the median? np.median, which is quickselect with $k = n/2$. Never sort when a partition suffices.
A heap is a tree-shaped array that lets you find the minimum (or maximum) element and remove it in logarithmic time. It is the right data structure whenever you are incrementally pulling "the best thing so far" from a growing set.
A min-heap is an array where each element is $\leq$ its two children at positions $2i+1$ and $2i+2$. The operations:
Python's heapq operates directly on a list. No wrapper object, no constructor. heapq.heappush(h, x), heapq.heappop(h), and heapq.heapify(xs) are the whole API.
Whenever you hear "explore the most promising X first," a heap is underneath. Concrete cases:
Need the smallest-so-far of a changing collection? Min-heap. Need the largest? Push the negated priority. Need both ends efficient? Indexed priority queue with decrease-key — Python does not have one built in, but heapdict on PyPI does, and networkx's shortest-path routines implement one internally.
In a textbook, "tree" usually means "balanced binary search tree." In ML, the trees you actually use are almost never search trees at all — they are decision trees, tries, heaps, B-trees inside databases, abstract syntax trees inside compilers, and metric trees inside retrieval systems. Knowing the right tree for the job is more useful than knowing the rotation rules of red-black trees.
A BST is a rooted tree in which every node's key is larger than its left subtree's and smaller than its right subtree's. Lookup, insert, and delete are $O(\log n)$ if balanced, $O(n)$ if not. Self-balancing variants — red-black, AVL — maintain balance with rotations. Python does not include one in its standard library; reach for sortedcontainers.SortedDict, which implements an order-preserving mapping using a B-tree internally, when you need ordered iteration plus fast lookup.
The trees you will actually train. A decision tree splits the input space along one axis at a time, and at each leaf predicts a class or a regression value. Each split is greedy — chosen to maximize information gain (classification) or reduce variance (regression). Lookup is $O(\text{depth})$, not $O(\log n)$: a 1000-row training set produces a tree of depth $\leq 10$ typically. Random forests, gradient-boosted trees (XGBoost, LightGBM, CatBoost), and their variants are ensembles of these. Part IV covers them in depth.
A trie indexes a collection of strings by their prefixes. Each edge is a character; a leaf holds the string (or its value). Insert and lookup are $O(|\text{key}|)$ rather than $O(\log n)$, and common prefixes are shared. Tries power autocomplete, regex matching, IP-routing tables, and — central to ML — every modern tokenizer. The BPE (byte-pair encoding) algorithm in GPT-style tokenizers is essentially a trie-building procedure that merges the most frequent adjacent pairs iteratively.
Heaps (Section 06) are also trees, stored implicitly in an array. The tree structure is what makes $O(\log n)$ operations possible; the array layout is what makes them fast.
kd-trees, ball trees, cover trees, and vantage-point trees index points in a metric space for fast nearest-neighbor queries. These are the engines beneath sklearn.neighbors, and they will get full treatment in Section 11.
A graph is a set of nodes and a set of edges between them. Viewed that way, almost every ML system has graphs inside it — computational graphs for autodiff, dependency graphs for data pipelines, knowledge graphs for retrieval, social graphs for recommenders, computation graphs for model parallelism.
Two canonical ways to store a graph:
dict mapping each node to a list of its neighbors. Memory is $O(V + E)$ — proportional to the actual edges. Iteration over a node's neighbors is immediate. The default for sparse graphs.scipy.sparse format — best of both worlds.The two canonical graph traversals:
collections.deque). Finds shortest paths in unweighted graphs. $O(V + E)$.Topological sort produces an ordering of the nodes of a directed acyclic graph (DAG) such that every edge goes from earlier to later. It is the data-pipeline scheduler's core routine — Airflow, Dagster, and Prefect all topologically sort their task graphs at every run. It is also how PyTorch's autograd decides the order in which to apply backward ops.
Dijkstra's algorithm computes single-source shortest paths in weighted graphs with non-negative edge weights in $O((V + E) \log V)$ using a min-heap. Bellman–Ford handles negative weights at $O(VE)$. A\* extends Dijkstra with a heuristic — if the heuristic is admissible, A* still finds the optimum, but visits fewer nodes. Path-finding in robotics, routing, and game AI is almost always one of these three.
Modern ML treats graphs as objects of deep learning. A graph neural network applies a learned function to each node that depends on its neighbors — a message-passing update
$$h_v^{(k+1)} = \text{UPDATE}\!\left(h_v^{(k)},\; \text{AGGREGATE}\!\left(\{\,h_u^{(k)} : u \in \mathcal{N}(v)\,\}\right)\right).$$
Inside PyTorch Geometric and DGL, this computes as a sparse matrix multiplication — the graph's adjacency list is the sparse matrix. Section 05 of the Scientific Computing chapter is load-bearing for GNN performance.
Small graph, quick exploration: networkx (Python-pure, pedagogically clear, slow). Medium-size (millions of edges): igraph or graph-tool (C backends). Massive, or you need autodiff: PyTorch Geometric / DGL / JAX-based GNN libraries.
Dynamic programming is the technique of solving a problem by solving smaller subproblems first, caching their answers, and composing them into the full solution. It is not the elegant recursion of computer-science textbooks so much as the bread-and-butter of sequence models, alignment algorithms, and any ML problem with a Markov structure.
Two ingredients make a problem DP-friendly:
The implementation is either top-down — a recursive function with @functools.lru_cache — or bottom-up, an iteration over an explicit table. Bottom-up is usually faster and uses less stack; top-down is often shorter to write.
Three DP algorithms show up in ML every week:
Two more that work the same way: edit distance (Levenshtein) is a DP over (source-prefix, target-prefix) pairs; Smith–Waterman and Needleman–Wunsch are DPs over (sequence-A-position, sequence-B-position) for biological sequence alignment.
Any model whose likelihood factorizes along a chain — HMMs, CRFs, autoregressive language models, RNNs — admits an exact inference algorithm that is a DP. The model fitness of a given structure often determines which DP to run: the forward algorithm on an HMM, the Viterbi algorithm for decoding, the Baum–Welch algorithm for learning. Transformers broke this pattern by replacing the chain with full attention, but their decoding at inference still relies on beam-search DP over the output space.
Whenever a sequence model gives you an exact answer about a probability over an exponentially large set (most-probable path, marginal likelihood, best alignment), the engine underneath is almost always a dynamic program over a two-dimensional table. Recognizing that shape is how you read an HMM paper or a speech-recognition loss in ten minutes.
Dynamic programming's older sibling. Break a problem into independent subproblems, solve each recursively, combine their answers. When the subproblems are disjoint, divide-and-conquer wins; when they overlap, DP does.
A divide-and-conquer recurrence of the form
$$T(n) = a\, T(n/b) + O(n^d)$$— $a$ subproblems of size $n/b$ plus $O(n^d)$ work to combine — solves to
Mergesort is $a = 2, b = 2, d = 1$ — the middle case, $O(n \log n)$. Binary search is $a = 1, b = 2, d = 0$ — middle case, $O(\log n)$. Strassen's matrix multiplication is $a = 7, b = 2, d = 2$ — third case, $O(n^{\log_2 7}) \approx O(n^{2.81})$, beating the naïve $O(n^3)$.
sorted() call.
Python has a default recursion limit of 1000, raisable via sys.setrecursionlimit. For deep DPs — say, a 10,000-length sequence alignment — prefer bottom-up iteration or convert to an explicit stack. Tail-call optimization is not a thing in Python, so deep recursion is a trap that should be unrolled before it bites.
When you see a function that calls itself twice with roughly half the input, it is divide-and-conquer and its complexity follows the master theorem. When you see a function that calls itself with overlapping inputs, cache it — you are doing dynamic programming and you should say so.
"Find the closest point to this one" is the central operation of retrieval, clustering, and a large share of modern ML. A linear scan is $O(n)$ per query, fine for thousands of points, untenable for billions. Spatial data structures and modern approximate-nearest-neighbor indexes reduce it to near-logarithmic — at a cost in exactness.
A kd-tree recursively partitions a $d$-dimensional space along axis-aligned hyperplanes: split on $x_1$ at its median, then each half on $x_2$, and so on. Lookup walks the tree and prunes branches whose bounding box is farther from the query than the best candidate so far. For low-dimensional problems ($d \lesssim 20$) with $n$ points, a kd-tree answers nearest-neighbor queries in $O(\log n)$ average.
A ball tree partitions points into nested balls (hyperspheres), handles non-axis-aligned data better, and is the default in sklearn.neighbors for general metrics. Both structures degrade in high dimensions — the curse of dimensionality — because in $\mathbb{R}^{100}$, the "bounding box" of any interesting set of points is almost all of the space.
In high dimensions the only practical path is approximation. The modern winners all have the same shape: build an index structure that routes a query to a small number of candidate points, check those candidates exactly, return the best. Sacrifice a few percent of recall for orders of magnitude in speed.
pgvector, Weaviate, Qdrant, Milvus.A vector database is essentially an ANN index plus a storage layer. When you query a RAG system with a question, the system embeds the question into a vector, the vector index returns the $k$ nearest chunks, and the language model consumes them as context. Every part of that pipeline is covered somewhere in this compendium; the index part is here.
$d \leq 20$ and exact: kd-tree or ball tree. $d \gg 20$ and approximate is fine: HNSW (good recall, modest memory) or IVFPQ (best compression, largest scale). Up to a million vectors: FAISS on a single machine. Billions: FAISS with GPU backends, or dedicated vector DBs.
NLP runs on sequences of strings long before it runs on tensors. The algorithms that turn raw bytes into model-ready tokens are a small, sharp-edged toolkit with direct lineage to classical string processing.
A trie (Section 07) matches a prefix against a vocabulary in $O(|\text{prefix}|)$. Regex engines and early text processors lived on tries. In modern NLP, the tokenizer is a trie-like dictionary that greedily matches the longest known token at each position in the input.
BPE starts with a vocabulary of bytes and iteratively merges the most frequent adjacent pair into a new token, updating frequencies. After $K$ merges, the vocabulary has size $256 + K$. Popular because it handles out-of-vocabulary words gracefully (rare words decompose into subwords) and because the tokenizer state is just the ordered list of merges. GPT-2, GPT-3, GPT-4, LLaMA, and Claude all use variants. tiktoken and sentencepiece are the canonical implementations.
WordPiece (BERT) is a close cousin of BPE: merges are chosen to maximize likelihood rather than frequency. Unigram LM (T5, mT5) starts with a large vocabulary and prunes tokens that contribute least to the corpus likelihood — effectively an EM algorithm over tokenization. All three produce similar-feeling vocabularies and all three are, algorithmically, agglomerative or divisive clustering over byte sequences.
KMP (Knuth–Morris–Pratt), Boyer–Moore, and Rabin–Karp are the classical linear-time substring-matching algorithms. They remain the right answer for log analysis, grep-like tools, and security work, but they show up rarely in modern ML pipelines. Suffix arrays and FM-indexes still anchor bioinformatics — variant calling, read mapping, genome assembly are all built on them.
Training a tokenizer: sentencepiece or tokenizers (the Hugging Face Rust implementation). Encoding text fast: tiktoken. Fuzzy string matching: rapidfuzz. Genome-scale string indexes: bwa or bowtie2 under the hood.
When a data set is too large to store exactly, trade a small, controllable error for dramatic savings in memory. A family of data structures — all built on hashing and the law of large numbers — gives you approximate membership, approximate counts, and approximate cardinality at a tiny fraction of the memory of the exact equivalent.
A Bloom filter answers "have I seen this item?" with no false negatives and a controllable false-positive rate. Store $m$ bits; use $k$ independent hash functions. To insert an item, hash it $k$ times and set those $k$ bits. To query, check all $k$ bits; if any is zero, the item is definitely absent; if all are one, it is probably present. For $n$ items, false-positive rate is
$$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k,$$minimized at $k = (m/n) \ln 2$. Bloom filters live in BigQuery's partition pruning, Cassandra's read path, caching-proxy negative caches, and the deduplication step of many LLM pretraining pipelines.
A Count-Min Sketch estimates the frequency of each item in a stream using a fixed-size 2D array of counters and $d$ hash functions — one row per hash. To increment, hash the item by each function and increment the counter at each row/column. To query, return the minimum across rows. Overestimates by construction; the error is bounded by $\epsilon \cdot N$ with probability $1 - \delta$ when the sketch has dimensions $\lceil e/\epsilon \rceil \times \lceil \ln(1/\delta) \rceil$. Used for top-$k$ streaming queries, heavy-hitter detection, and any "which item appeared most?" query over a firehose of data.
HyperLogLog estimates the number of distinct elements in a stream. Hash each item; look at the leading zeros of the hash; keep a small register per hash bucket. Cardinality estimates come from the distribution of maximum leading zeros. A few kilobytes of state can estimate cardinalities of $10^9$ with relative error around 2%. Redis, BigQuery, Presto, and almost every analytic database implement HLL as a built-in.
A final cousin: given a stream of unknown length, sample $k$ items uniformly at random with one pass and $O(k)$ memory. Keep the first $k$ in a reservoir; for each subsequent item $i$, replace a random existing reservoir entry with probability $k/i$. The resulting reservoir is exactly a uniform sample. Used in subsampling large logs, online evaluation, and anytime "give me $k$ random rows from this huge table" arises.
Exact data structures cost memory proportional to data size. Probabilistic data structures cost memory proportional to error tolerance. For stream-scale data, that is the difference between "we can do this in a few KB" and "we can't do this at all."
The two cheapest wins in a real system are "do it once and remember" and "do it once and forget." Caching covers the first; streaming covers the second. Both are what separate a pipeline that fits on one laptop from one that needs a cluster.
Memoization stores the return value of a function keyed by its arguments so repeated calls are free. functools.lru_cache and the newer functools.cache are the Python decorators. LRU ("least-recently-used") bounds the cache size by evicting the entry unused for the longest. The data structure underneath is a doubly-linked list plus a hash map — $O(1)$ hit, miss, and eviction. It is the default caching strategy for everything from CPU TLBs to GPT-style KV caches to HTTP proxies.
Modern ML systems are stacks of caches. In LLM inference, the KV cache stores attention keys and values for already-processed tokens so each new token costs $O(\text{seq len})$ rather than $O(\text{seq len}^2)$. In retrieval, a vector index is a cache over embedding computations. In training pipelines, a feature store is a cache over feature engineering. In hyperparameter sweeps, memoizing a model's training output lets overnight experiments be replayed in seconds.
When data arrives too fast to store, the only option is a single pass with bounded memory. The probabilistic structures of Section 13 are streaming algorithms. So are Welford's online mean and variance, exponentially weighted moving averages, and the reservoir sampler. Welford's algorithm maintains $\mu_n$ and $M_{2,n} = \sum_i (x_i - \mu_n)^2$ by
$$\mu_n = \mu_{n-1} + \frac{x_n - \mu_{n-1}}{n}, \qquad M_{2,n} = M_{2,n-1} + (x_n - \mu_{n-1})(x_n - \mu_n).$$Numerically stable, single-pass, bounded memory — the template for any streaming-statistics routine.
Same idea in slower motion. When data is too large for RAM but fits on disk, structure the algorithm so each pass reads sequentially — external mergesort, chunked aggregations in pandas / polars / dask, and disk-backed key-value stores like RocksDB. The bottleneck is bandwidth, and algorithms that randomly access disk are orders of magnitude slower than those that scan.
Repeated identical work: @functools.lru_cache. Data larger than memory but smaller than disk: stream with polars / dask / pyarrow chunks. Data larger than disk: distributed (ray, spark) plus probabilistic approximations where exact answers are not needed. Every time you catch yourself computing the same thing twice, ask whether caching or streaming changes the cost entirely.
Every non-trivial ML system is an orchestration of the data structures and algorithms in this chapter. Here is the map — one line per system, pointing at the primitives doing the work underneath.
tiktoken.encode, a trie of a hundred thousand merges is being traversed.make.
Recognizing these primitives in a research codebase is what separates reading a paper from understanding it. When you open transformers or vllm or faiss or fairseq, what you are reading is not the mathematics of attention or the geometry of embedding spaces — it is a careful composition of the data structures in this chapter, arranged to push each expensive operation onto a cheaper one.
The algorithms literature is one of the oldest and most settled in computer science, but only a small fraction of it matters day-to-day for an ML practitioner. The list below is opinionated: one or two foundational textbooks, a few focused treatments of the pieces this chapter could only sketch, and the community's preferred entry points for the probabilistic and approximate-nearest-neighbour material that is genuinely live.
heapq for priority queues and top-$k$; bisect for sorted-list insertion; collections for deque, Counter, defaultdict, OrderedDict; functools for lru_cache. Under two thousand words of documentation buys years of productivity.tokenizers is the reference BPE/WordPiece/Unigram implementation; faiss is the reference ANN index. Both have readable source worth browsing.This page is the third chapter of Part II: Programming & Software Engineering. Up next: software-engineering principles, databases and SQL, and version control. Then Part III — data engineering — and Part IV, where every classical model you build will be a composition of the data structures in this chapter.