Part XIV · Applied Domains · Chapter 01

Recommender Systems, predicting what you'll want next.

Recommender systems are arguably the most economically consequential application of machine learning. Netflix's recommendations drive most of what gets watched; Spotify's drive most of what gets listened to; Amazon's, TikTok's, and YouTube's recommenders shape what people buy, hear, and see at planetary scale. The methodology has evolved through three waves — collaborative filtering and matrix factorisation in the 2000s, deep-learning-based architectures in the late 2010s, sequence models and foundation-model-flavoured systems in the 2020s — and modern production systems blend all three. This chapter develops the recommendation problem, the major model families (collaborative filtering, content-based, matrix factorisation, deep models, sequence models), the engineering of two-stage retrieval-and-ranking pipelines, and the deployment realities of working on systems where every percentage point of click-through translates into hundreds of millions of dollars.

Prerequisites & orientation

This chapter assumes basic linear algebra (Part I Ch 01), supervised learning fundamentals (Part IV Ch 01–02), neural-network basics (Part V Ch 01–02), and familiarity with embeddings (Part VI Ch 03). The matrix-factorisation material in Section 4 connects directly to dimensionality-reduction methods (Part IV Ch 05). Sequence models in Section 6 build on transformer architecture (Part VI Ch 02). The reinforcement-learning framing in Section 9 is helpful but not required — the chapter develops what is needed for bandit-style and policy-gradient recommendation. No prior exposure to recommender-system literature is assumed.

Two threads run through the chapter. The first is the retrieval-and-ranking architecture: at the scale of millions of items and billions of users, no single model can score every (user, item) pair, so production recommenders use a two-stage pipeline — a fast retriever narrows the candidate set from millions to hundreds, and a heavier ranker scores those hundreds precisely. Almost every modern recommender uses this pattern, and the methodology of the chapter divides accordingly. The second thread is the feedback loop: recommender systems generate the data they are then trained on, which creates exposure bias, popularity feedback, and filter-bubble effects that are central to both the methodology and the ethics of the field. Sections 7 and 9 develop these.

In this chapter

Why Recommender Systems Are Their Own Field scale · sparsity · cold start · feedback loops
Collaborative Filtering: The Foundation user-user · item-item · neighborhood methods
Content-Based and Hybrid Methods item features · TF-IDF · hybrid recommenders
Matrix Factorisation and Latent Factors SVD · ALS · BPR · implicit feedback · Netflix Prize
Deep Recommenders and Two-Tower Models NCF · two-tower · YouTube DNN · Wide & Deep
Sequential and Session-Based Recommendation SASRec · BERT4Rec · Transformers4Rec · session models
Retrieval and Ranking Architecture two-stage pipeline · candidate generation · re-ranking · diversity
Evaluation: Offline, Online, and Counterfactual NDCG · A/B testing · IPS · counterfactual evaluation
Bandits, Reinforcement Learning, and Exploration contextual bandits · Thompson · slate RL · LinUCB
Applications and Frontier YouTube · TikTok · Netflix · LLM rec · generative · frontier

Why Recommender Systems Are Their Own Field

Recommendation looks like ranking, which looks like classification, which looks like standard supervised learning — until you actually try to deploy a recommender system and discover that none of the standard machinery transfers cleanly. The problem has its own scale, its own data structure, its own evaluation pitfalls, and its own ethical complications. Understanding what makes recommendation different is the precondition for everything else in the chapter.

The defining problem

The recommendation problem in its most general form: given a user, a context (time of day, device, location, recent activity), and a catalogue of items, return a small ordered list of items the user is likely to engage with. The ranking is personalised — different users get different lists — and is updated continuously as user preferences and item availability change. Each engagement (or non-engagement) becomes a new training signal for the next iteration.

Phrased like this, recommendation sounds like a personalised ranking problem. The complications come from scale (item catalogues of hundreds of millions, user bases of billions, predictions per second in the millions), from the structure of the data (extremely sparse user-item interactions; for any given user, the vast majority of items are unrated), and from the feedback loop (the system trains on data it generates, which biases the data toward items the system already favours).

Sparsity and the cold-start problem

The standard recommendation dataset is a sparse matrix — users on rows, items on columns, ratings or interactions in cells, almost everything missing. The Netflix Prize dataset had 100 million ratings across 480,000 users and 17,770 movies — looks large until you realise that's 1.2% density, with the other 98.8% blank. Modern industrial recommendation matrices are even sparser: TikTok has billions of users and trillions of videos, with each user having seen maybe a thousand videos.

Worse, every new user starts with zero history (the cold-start problem for users) and every new item starts with zero ratings (the cold-start for items). A pure collaborative-filtering recommender has nothing to say about either case until interactions accumulate. The standard fixes — incorporating side information about users and items, using popularity-based bootstrapping, asking users explicit questions — are the substance of Sections 3 and 5, but the structural problem cannot be eliminated.

The feedback loop

A recommender's data comes from its own decisions. Users see what the recommender showed them; the recommender trains on what users engaged with; the next round shows more of what got engaged with. This is the exposure bias at the heart of recommendation, and it has several consequences. Popularity bias: popular items get shown more, get rated more, and dominate the training data, which makes them get shown more. Filter bubbles: users get progressively narrower recommendations because the system never sees signal from items it didn't show. Survivorship in evaluation: a model evaluated on logged data is evaluated only on items the previous model chose, which introduces bias into the comparison. Sections 8 and 9 develop the methodology for fighting these — counterfactual evaluation, exploration, and inverse-propensity weighting.

Engagement is not satisfaction

The most-measured signal in recommendation is engagement: did the user click, watch, listen, buy. The most-wanted property is usually something else: did the user enjoy the experience, find what they were looking for, want to come back tomorrow. The two correlate but diverge — engagement-maximising recommenders learn to feed dark patterns (rage-bait, addiction-shaped engagement, low-quality content with strong hooks) that increase clicks but degrade satisfaction. The 2024 generation of production recommenders explicitly trains on "satisfaction" signals (post-engagement surveys, return rates, app-uninstall risk) alongside engagement, with the trade-off mediated by reinforcement-learning-style optimisation that Section 9 develops.

Why Recommendation Is High-Stakes

Modern recommender systems are responsible for a substantial fraction of human attention and a large fraction of the world's commerce. Netflix has reported that its recommendation system saves the company over a billion dollars a year by reducing churn. YouTube's recommendation system delivers ~70% of all video views. Amazon attributes ~35% of sales to recommendations. The methodology of this chapter is not a research curiosity; it is the substrate of the modern attention economy, with all of the technical and ethical complications that implies.

Collaborative Filtering: The Foundation

The original idea of automated recommendation, and still the conceptual foundation: predict what a user will like based on what similar users have liked. Collaborative filtering formalises this — find users similar to the target user, average their preferences, recommend the items they liked that the target user has not seen. The methodology is simple, the intuitions transfer to almost every modern method, and a properly-tuned neighbourhood-based collaborative filter is still a strong baseline that more elaborate methods need to beat.

User-user collaborative filtering

The classical setup, due to GroupLens (Resnick et al. 1994): represent each user as a vector of their ratings (with most entries missing); compute similarity between user pairs as cosine similarity or Pearson correlation over their commonly-rated items; for each candidate item the target user has not rated, predict their rating as a similarity-weighted average of similar users' ratings on that item.

User-user collaborative filtering r̂ u,i = r̄ u + Σ v \in N(u) sim(u, v) \cdot ( r v,i - r̄ v) / Σ v \in N(u) |sim(u, v)| N(u) is the set of users similar to u who have also rated item i; r̄ u is u's average rating (the mean-centring corrects for users who rate generously vs. harshly); sim(u, v) is the cosine or Pearson similarity computed over commonly-rated items. The formula's intuition: predict u's rating on i by combining the deviations from average that similar users showed on i.

User-user CF works but has obvious computational problems at scale. Computing similarity between every pair of users is O(N²), prohibitive when N is in the hundreds of millions. The standard fixes — clustering, locality-sensitive hashing for nearest-neighbour search — make the method work at moderate scale. For tens of thousands of users it remains practical and surprisingly effective.

Item-item collaborative filtering

A small change of perspective produces a method that scales much better. Item-item collaborative filtering (Sarwar et al. 2001, Linden et al. 2003 at Amazon) computes similarity between items rather than users. Two items are similar if the users who rated one tend to rate the other similarly. To recommend for a user, find items similar to those the user has already rated highly, weight by similarity, and recommend the top.

Item-item has two practical advantages. First, the number of items is typically smaller and changes more slowly than the number of users, so the item-item similarity matrix is more stable and computable offline. Second, item-item recommendations have a clean interpretation — "people who bought X also bought Y" — that user-user does not. Amazon's deployment of item-item CF in 2003 was the first widely-publicised production recommender at internet scale and remains conceptually influential.

Implicit versus explicit feedback

Classical CF assumes explicit feedback — users provide numerical ratings (1–5 stars). Most modern recommendation systems work with implicit feedback: users click, watch, dwell, buy, or skip. Implicit feedback is much more abundant but has different statistics — there is no "negative" rating, only the absence of a positive interaction, which could mean dislike or just "didn't see it." Adapting CF to implicit feedback is mostly about choosing the right loss function (typically a ranking loss like BPR, Section 4) and the right negative-sampling strategy.

Why CF still matters

Pure neighbourhood-based collaborative filtering is no longer state-of-the-art on academic benchmarks — matrix factorisation and deep models beat it. But it is still the right starting baseline for any new recommendation problem because it is simple to implement, easy to debug, and shockingly competitive in many real-world settings. The 2026 production trend is toward hybrid systems where CF features (item-item similarities, user neighbour graphs) are inputs to richer downstream models. Understanding CF is the precondition for understanding what later methods buy you.

Content-Based and Hybrid Methods

Collaborative filtering ignores the items themselves — it knows only that user A liked item X and user B also liked X, not what X actually is. This works fine when there is plenty of interaction data but fails on cold-start. Content-based methods use features of the items (and sometimes of the users) directly, and the modern production stack almost always combines content and collaborative signals into a hybrid recommender.

The content-based recipe

Represent each item as a feature vector — for movies, perhaps a TF-IDF representation of the description plus genre tags plus director and cast embeddings; for products, the title, category, attributes, and image features; for music, audio features plus genre tags plus collaborative-style co-listening signals. Represent each user as some aggregation of the items they have engaged with — a centroid of liked-item vectors, plus separate vectors for explicit preferences (genre, language). Score candidate items by similarity (cosine, dot product) to the user vector.

The classical formulation due to Pazzani and Billsus (2007) used TF-IDF text features and a per-user logistic-regression classifier trained on the user's positive and negative interactions. The modern equivalent uses pretrained embeddings (BERT, CLIP, audio encoders) for the item features and either a centroid-style user model or a learned user encoder. The recipe is the same — featurise the items, build user representations, score by similarity — only the featurisation has gotten more sophisticated.

Why content-based handles cold start

The cold-start advantage is structural: a content-based recommender can score items it has never seen interactions for, as long as it has features for them. Add a new movie to the catalogue with title, description, genre, and director — the recommender immediately has a vector to match against user preferences. The same logic applies to new users: ask a new user a few onboarding questions about their preferences, build a partial user vector, recommend items similar to the stated preferences. Pure collaborative filtering can do neither.

The drawback is that content-based recommendations tend to be more of the same. If a user has watched five action movies, the content-based recommender will keep recommending action movies — it has no signal that this user might also enjoy a documentary about marine biology unless the system explicitly diversifies. Collaborative filtering produces unexpected recommendations because it borrows from neighbours' preferences across categories; content-based does not.

Hybrid recommenders

Hybrid recommenders combine collaborative and content signals. Several integration patterns dominate:

Feature combination: include both content-based features (item attributes, user demographics) and collaborative-style features (user neighbours, item similarities) as inputs to a single ranker. This is the dominant 2026 production pattern — modern deep recommenders take both types of features and let the network learn how to combine them.

Switching: use content-based when the user has little history (cold start), switch to collaborative once enough interactions accumulate. Operationally clean but loses the advantages of either method outside its regime.

Weighted ensemble: produce two ranked lists (one collaborative, one content-based) and combine via weighted average of scores or rank fusion. Useful as a fallback architecture but rarely the strongest.

Cascade: use one model to retrieve candidates, another to re-rank them. The collaborative model retrieves; the content-based model breaks ties by matching user-stated preferences. This is essentially the retrieval-and-ranking pattern of Section 7.

Knowledge-graph extensions

A modern variant uses knowledge-graph features — items connected to entities (directors, genres, actors, ingredients, materials) in a knowledge graph; recommendations propagate through graph paths. Knowledge-graph recommendation is a natural application of the GNN material from Part XIII Ch 05. The advantage is interpretability and structured cold-start handling; the disadvantage is the engineering complexity of maintaining a clean knowledge graph at scale. Production deployments at Spotify, Pinterest, and Amazon all incorporate knowledge-graph features in their recommendation pipelines, though usually as one of several signals rather than the sole approach.

Matrix Factorisation and Latent Factors

The Netflix Prize competition (2006–2009) made matrix factorisation the dominant recommendation paradigm for a decade and remains its most influential teaching example. The idea is mathematically clean and produces strong recommendations from sparse rating data alone: factor the user-item matrix into a product of low-rank user and item matrices; the resulting factors are latent embeddings that capture user preferences and item characteristics in a shared geometric space.

The factorisation idea

Let R be the (mostly missing) user-item rating matrix of size U × I. The matrix-factorisation hypothesis: R can be approximated as the product of a user matrix P (size U × k) and an item matrix Q^T (size k × I), where k is the latent dimensionality:

Matrix factorisation prediction r̂ u,i = p u ⊤ q i + b u + b i + μ p u \in ℝ k is the user u's latent factor vector; q i \in ℝ k is item i's; b u and b i are user and item biases (correcting for users who rate harshly and items that are universally loved or hated); μ is the global mean. The factor dimensionality k is typically 50-200. Training fits all these parameters by minimising squared error on observed ratings, plus L2 regularisation to prevent overfitting on the sparse data.

The training objective is straightforward — squared error plus regularisation, summed over observed ratings. Two algorithms dominate optimisation. Alternating Least Squares (ALS) alternates between fixing the item factors and solving for user factors in closed form, then vice versa; the closed-form per-user solution makes ALS easy to parallelise across users. Stochastic Gradient Descent on the same objective is competitive and easier to extend to non-quadratic losses.

The Netflix Prize lesson

The Netflix Prize was won in 2009 by an ensemble called BellKor's Pragmatic Chaos that combined hundreds of models, but the workhorse of the winning ensemble was a sophisticated matrix factorisation called SVD++ (Koren 2008) that incorporated implicit feedback (which items a user rated, regardless of the rating value) into the user factors. The lesson the field took from the prize: matrix factorisation with carefully-chosen extensions handles the bulk of the signal in explicit-rating recommendation; ensembling on top adds marginal value at substantial complexity cost. Production deployments since have mostly favoured a single well-tuned factorisation over an ensemble.

Bayesian Personalized Ranking for implicit feedback

The squared-error matrix factorisation assumes explicit ratings and treats missing entries as missing data. For implicit feedback (clicks, views), missing entries are not "missing" — they are mostly negative. Bayesian Personalized Ranking (BPR, Rendle et al. 2009) reframes the problem: for each (user, observed-positive item, sampled-negative item) triple, train the model to score the positive higher than the negative. The training objective is a pairwise ranking loss rather than a pointwise rating regression.

Bayesian Personalized Ranking objective ℒ BPR = - Σ (u, i⁺, i⁻) log σ( r̂ u,i⁺ - r̂ u,i⁻) For each user u, sample a positive item i⁺ they interacted with and a negative item i⁻ they did not; the loss pushes r̂ u,i⁺ above r̂ u,i⁻ via a logistic-style ranking objective. BPR became the dominant training objective for implicit-feedback matrix factorisation and influenced essentially every subsequent ranking-based recommender.

The latent-space picture

The geometric interpretation of matrix factorisation is the conceptual takeaway. Each user lives in a k-dimensional embedding space; each item lives in the same space; the predicted rating is the inner product. Users with similar tastes end up near each other; items with similar audiences end up near each other; predictions amount to "match users to items in this shared space." This picture transfers directly to deep recommenders (Section 5) which generalise the inner product to a learned scoring function but keep the embedding-space framing.

Deep Recommenders and Two-Tower Models

Matrix factorisation handles the basic CF signal well but cannot easily incorporate side features (user demographics, item content, context) and is limited to bilinear scoring. The deep-learning era brought architectures that handle arbitrary features and learn richer scoring functions, while preserving the embedding-space picture that makes recommendation systems engineerable at scale.

Neural Collaborative Filtering

The most direct extension of matrix factorisation to neural networks is Neural Collaborative Filtering (NCF, He et al. 2017). The user-item score is a multi-layer perceptron applied to the concatenation of user and item embeddings, replacing the bilinear inner product with a learned non-linear function. NCF demonstrates that neural networks can match or beat matrix factorisation on the same data, especially when feature interactions are non-linear.

NCF's lesson is real but its specific architecture has been mostly superseded. Subsequent work showed that a well-tuned matrix factorisation often beats NCF on standard benchmarks (Rendle et al. 2020), and that the gains attributed to "deep" models often came from training tricks that also help shallow ones. The takeaway: depth alone is not the value; what neural architectures buy you is the ability to incorporate features that matrix factorisation cannot, and to compose multiple signals end-to-end.

Two-tower models

The dominant production architecture for retrieval is the two-tower model. A user tower takes user features (ID, demographics, recent history) and produces a user embedding; an item tower takes item features (ID, content, attributes) and produces an item embedding; the score is the dot product (or a small learned function) of the two embeddings. Both towers are trained end-to-end on engagement data, typically with a sampled-softmax loss that scales to enormous item catalogues.

The architecture's power is its scalability. At inference time, the item tower can be applied offline to every item in the catalogue, producing a static embedding index. At query time, the user tower runs in real time to produce the user embedding, and an approximate-nearest-neighbour search retrieves the top-K nearest items. This makes the two-tower the standard architecture for the candidate-generation stage of production recommenders (Section 7), and Google, Pinterest, Spotify, and most other major recommendation systems use some variant.

Wide & Deep, DeepFM, and feature crosses

For the ranking stage where a small candidate set is scored precisely, more elaborate architectures dominate. Wide & Deep (Cheng et al. 2016) combines a "wide" linear model on cross-features (one-hot interactions of categorical features) with a "deep" MLP on dense features and embeddings; the two are jointly trained. The wide component captures memorisation (specific feature combinations that historically predict the label), the deep component captures generalisation (embedding similarities that handle novel combinations).

DeepFM (Guo et al. 2017) and the various Factorisation-Machine-style architectures generalise the wide component using factorisation-machine-style feature interactions, learnt in an embedding space rather than enumerated explicitly. DCN (Deep & Cross Network, Wang et al. 2017) provides an alternative parameterisation of feature crossing. These architectures dominate the click-through-rate-prediction component of modern advertising and recommendation pipelines.

The YouTube DNN architecture

One of the most influential public descriptions of an industrial recommender is the YouTube DNN paper (Covington et al. 2016). The architecture has two stages: a deep candidate-generation network that retrieves a few hundred videos from millions, and a deep ranking network that scores them. The ranking network takes hundreds of features (watch history, demographics, time of day, device type, video metadata, channel embeddings) and outputs an estimated watch time. This two-stage neural-network design has become the template for every major video and content recommender since.

What deep buys you

The right way to think about deep recommenders in 2026: they are not magically better than matrix factorisation on pure interaction data, but they integrate features and signals that simpler methods cannot. The features that matter are: rich user history (sequence models, Section 6), text and image content (pretrained embeddings as features), context (time, device, location, intent), and cross-feature interactions (Wide & Deep, DCN). For any production recommender with this kind of feature richness, a deep model is the right tool; for pure CF on a sparse interaction matrix, classical MF remains a strong choice.

Sequential and Session-Based Recommendation

User preferences are not static. The video you want to watch right after a comedy is different from the video you want to watch right after a documentary. Modern recommendation increasingly treats user history as a sequence and uses sequence models — typically transformers — to predict the next item. This sequential recommendation framing is now standard in production, especially for short-form video and music streaming where the next-item prediction problem is the recommendation problem.

The sequential recommendation problem

Reframe recommendation as sequence prediction: given a user's interaction history (i₁, i₂, ..., i_t), predict the next item i_t+1. The training data is naturally autoregressive — the user's history is a sequence, and the model is trained to predict each next item given the prior context. The framing is identical to language modelling, with item IDs in place of token IDs, and the methodology has converged accordingly.

From RNN-based methods to transformers

The first sequential recommenders used recurrent networks. GRU4Rec (Hidasi et al. 2016) used GRUs for session-based recommendation; the natural follow-up SASRec (Kang and McAuley 2018) replaced the RNN with a self-attention transformer and substantially improved results on standard benchmarks. The transformer's parallelism and global-attention pattern matched the recommendation problem better than sequential RNN updates.

BERT4Rec (Sun et al. 2019) extended the picture with a bidirectional masked-prediction objective borrowed from BERT — randomly mask items in the sequence and train the model to predict them. The bidirectional context produces stronger embeddings than left-to-right training. Transformers4Rec (Moreira et al. 2021) provided the production-ready library that combined these methods with the standard HuggingFace ecosystem.

Time-aware and feature-rich sequence models

Vanilla sequential recommenders use only item IDs in the sequence. Production systems augment with timestamps (the gap between interactions matters), interaction types (click vs. watch vs. skip), context features (device, location), and item content (title, image, audio). The architecture is the same — a transformer over the augmented sequence — but the inputs are richer. TiSASRec incorporates time gaps explicitly; FDSA incorporates feature-level information; the various 2024 sequence-foundation-model papers extend the idea to large-scale pretraining.

Session-based recommendation

A specific sub-problem: session-based recommendation, where the model has access to a single short session (a user's last 5–20 interactions in this app open) but no longer-term user identity. This regime matters for anonymous users, privacy-respecting deployments, and fast-moving consumer applications where session-level intent dominates long-term preferences. Transformer-based session models (SASRec, BERT4Rec, the various GNN-on-session approaches) are the standard solution and consistently outperform classical "most popular" or "matrix-factorisation-with-anonymous-user" baselines.

The frontier: foundation-model recommenders

The 2024–2026 frontier extends the sequence-modelling framing to foundation-model scale. P5 (Geng et al. 2022) cast recommendation as a text-to-text problem, with item interactions formatted as natural language and a shared encoder-decoder model handling rating prediction, sequential recommendation, and explanation generation. Generative recommendation approaches treat next-item prediction as token generation — the model produces the next item ID directly, often using a hierarchical token structure that allows the catalogue to be addressed without a softmax over millions of items. The early production deployments at TikTok, Pinterest, and others suggest that foundation-model-flavoured recommenders are competitive with classical two-tower designs on the metrics that matter, with substantial room to improve as the methodology matures.

Retrieval and Ranking Architecture

No single neural network can score every (user, item) pair when the catalogue has hundreds of millions of items. Production recommender systems instead use a two-stage architecture: a retriever narrows the candidate set from millions to hundreds, and a ranker scores those hundreds precisely. This two-stage design is the most important architectural pattern in production recommendation, and almost every major system uses it.

The two-stage pipeline

The architecture: at request time, the user's query and context are passed to several candidate-generation modules in parallel. Each module retrieves ~100–1000 items using a fast scoring method — two-tower nearest-neighbour search, item-item collaborative filtering, content-based matching, popularity-based filters, recently-trending modules. The combined candidate set (typically 500–5000 items after deduplication) is then scored by a heavy ranking model that uses many features and complex feature interactions. The top K (typically 10–50) ranked items are returned to the user.

The fundamental trade-off the two-stage design exploits: the retriever must be fast (it has to score every item or use approximate-nearest-neighbour shortcuts) but only needs to identify a roughly-good superset; the ranker can afford to be slow per item (it scores at most a few thousand) but must be precise. This division of labour matches the reality that exact scoring is impossible at retrieval scale and unnecessary at ranking scale.

The two-stage production-recommendation pipeline. The retriever runs several lightweight modules in parallel (two-tower ANN search, item-item CF, trending / popular, content-based) to narrow the catalogue from ~10⁸ items to ~10³. The deep ranker scores those candidates precisely with hundreds of features and multi-task heads (CTR, watch time, satisfaction). A re-ranking stage applies diversity, freshness, and policy constraints to produce the final 10–50-item list. The fast-then-slow division of labour is what makes large-catalogue recommendation tractable at single-digit-millisecond latency.

Candidate generation

Modern candidate generation uses several modules in parallel. The two-tower dominant module produces dense user and item embeddings, with retrieval via approximate-nearest-neighbour search (FAISS, ScaNN, HNSW) over the precomputed item embeddings. Item-based collaborative filtering retrieves items similar to those the user recently interacted with. Popularity / trending modules retrieve items that are currently in vogue. Content-based modules retrieve items similar to the user's stated preferences. The combined set covers different parts of the catalogue with different bias profiles, which is critical for diversity.

Ranking

The ranker is typically a deep neural network trained on logged user feedback. Inputs include: user features (demographics, recent history, embedding from a sequence model), item features (ID, content embeddings, popularity, freshness), context features (time, device, query intent), cross-features (engagement on similar items, dwell time on similar content), and increasingly LLM-generated features (semantic categorisation, intent extraction). The output is a multi-task prediction — click probability, watch-time prediction, conversion probability, satisfaction estimate — combined into a single ranking score via a learned aggregation.

Modern rankers at scale handle hundreds to thousands of features and produce predictions in single-digit milliseconds. The architectural patterns are diverse — DCN-V2, DeepFM, MMoE for multi-task — but the engineering constraints are similar across systems.

Diversity, freshness, and re-ranking

The ranking score alone often produces lists that are too homogeneous (the same kind of content over and over) or too stale (heavy users have seen everything in their preferred niche). Production systems add a re-ranking stage on top of the ranker that diversifies and refreshes the final list. Maximal Marginal Relevance (MMR) re-ranking penalises items too similar to those already chosen. Determinantal Point Process re-ranking maximises a determinant of similarity, mathematically encouraging diverse selections. Slot-aware re-rankers explicitly fill different content types into different positions in the list. The re-ranker's job is to enforce business and UX constraints that the pure ranker cannot.

Operational realities

Production recommendation systems are operational beasts as much as they are model-stack questions. They must serve millions of QPS at single-digit-millisecond latency. They must be robust to model failures (any module can crash without taking down the whole pipeline). They must support continuous online learning (every user interaction becomes training data). They must integrate with content-moderation, ads, and policy-compliance systems. The methodology of the chapter is the substance of these systems but the engineering scale is the dominant cost — a recommender at YouTube or TikTok is at least as much an infrastructure problem as a modelling problem.

Evaluation: Offline, Online, and Counterfactual

Recommender evaluation has a hard structural problem: the data is biased by the system that produced it, so offline metrics measured on logged data systematically misjudge how a new model would actually perform. The standard production discipline is a layered evaluation that combines offline benchmarks (cheap, fast, biased) with online A/B testing (expensive, slow, unbiased) and counterfactual estimators that try to bridge the gap.

Offline metrics

The standard offline metrics for recommendation come in two families. Rating-prediction metrics — RMSE, MAE — measure how accurately the model predicts numerical ratings; they were dominant during the Netflix Prize era and are now mostly historical. Ranking metrics measure the quality of the produced ordered list. Hit Rate at K: did the held-out item appear in the top-K predictions? NDCG at K (Normalised Discounted Cumulative Gain): a position-discounted relevance score that rewards correct items in higher positions. Mean Reciprocal Rank: focuses on the rank of the first correct item. NDCG is the dominant 2026 offline metric and is the right starting baseline for any new method.

The offline-online gap

Offline metrics are systematically biased. The data was generated by the previous recommender; the held-out items in the evaluation set are exactly the items the previous recommender chose to show; a new model that recommends items the previous system never showed will appear to perform poorly even if it would actually be better. This is the offline-online gap, and it can be substantial — methods that win on offline NDCG often lose in online A/B tests, and vice versa.

The 2020s field-wide reckoning with the offline-online gap has produced a more chastened evaluation practice. New methods that win on offline benchmarks but have not been tested in online experiments are treated with substantial skepticism. Production recommendation teams maintain a permanent offline-online correlation analysis, tracking which offline metrics predict online wins and which do not. The general lesson: offline evaluation is necessary for model development and required for cost reasons, but the headline result has to come from online testing.

Online A/B testing

The gold standard is A/B testing: split users into a control group (sees the old model) and a treatment group (sees the new model); measure the difference in business metrics (engagement, retention, revenue) over weeks. The standard concerns of online experiments apply: power calculations to determine sample size, multiple-testing correction when running many experiments concurrently, network and time effects that violate independence assumptions. A robust experimentation infrastructure is the precondition for serious recommendation work and is a substantial engineering investment in itself.

A/B tests are expensive — they require enough users for statistical power, they take days to weeks to read out, and the failed tests cost real revenue. This makes A/B testing a scarce resource, and the engineering discipline is to use offline metrics, then counterfactual estimators, then online tests in escalating order of cost and reliability.

Counterfactual evaluation

Counterfactual evaluation tries to estimate how a new model would perform on logged data, correcting for the bias from the system that generated the data. The standard tool is Inverse Propensity Scoring (IPS, also called Inverse Propensity Weighting): each interaction is reweighted by 1/p, where p is the probability the logging system assigned to showing that item. The result is an unbiased estimator of the expected reward under the new policy, computable from logged data without an online test.

IPS has high variance, especially when the new policy disagrees substantially with the logging policy. Self-normalised IPS reduces variance with a small bias trade-off; doubly-robust estimators combine IPS with direct reward modelling to further reduce variance. The 2024 standard for serious recommendation evaluation is to compute several counterfactual estimators alongside offline metrics, with the counterfactual estimates serving as a sanity check on whether offline wins are likely to transfer online.

Beyond engagement

The most-measured online metric is engagement (clicks, watch time, conversions). The most-wanted property is something subtler — long-term retention, satisfaction, "wellness." The 2024 generation of recommendation evaluation explicitly measures these: post-engagement surveys, return-rate analysis, app-uninstall correlations, perceived-quality ratings. The integration of these into the optimisation objective (rather than just measurement) is the substance of Section 9's reinforcement-learning framing.

Bandits, Reinforcement Learning, and Exploration

Standard recommender systems treat each user-item interaction as an isolated supervised-learning problem. In reality, recommendation is sequential decision-making — the items shown now influence the data available later, and optimising long-term reward requires explicit exploration to avoid the feedback-loop pathologies of pure exploitation. The bandit and reinforcement-learning framing of recommendation is now standard in production, especially for systems with strong feedback effects.

Why exploitation alone fails

A pure-exploit recommender always shows the items it currently believes the user will most engage with. This produces high short-term engagement but several pathologies. Long-tail starvation: items that the system has not yet learned about never get shown, so they never accumulate the data needed to be properly modelled. Filter bubbles: users get progressively narrower recommendations as the system reinforces its existing model of their preferences. Drift blindness: when user preferences change, the system is slow to adapt because it is committed to the existing model.

The fix is exploration: deliberately show items the system is uncertain about, accept some short-term engagement loss, gain information that improves long-term performance. The trade-off is the classical exploration-exploitation balance from the multi-armed-bandit literature, and recommendation is one of the most consequential applications of bandit methods in industry.

Contextual bandits

Contextual bandits are the natural framing. At each step, the algorithm sees a context (user features, time, query intent), chooses an action (which item to recommend), observes a reward (engagement, conversion). The goal is to minimise cumulative regret over time. Standard algorithms include: LinUCB (Li et al. 2010), which adds an upper-confidence-bound term to the predicted reward and chooses the highest UCB action; Thompson sampling, which samples from the posterior over reward functions and acts greedily under the sample; ε-greedy, which acts greedily most of the time and randomly with probability ε.

For recommendation, the action space is too large for vanilla bandits — millions of items per user, with the right exploration depending on item features. The 2020s production deployments use neural-network bandits with explicit uncertainty estimation (Bayesian deep learning from Part XIII Ch 07 has direct application here) and item-feature-based exploration that generalises across the catalogue rather than treating each item as a separate arm.

Slate recommendation

Recommendation rarely produces a single item — it produces a list. Slate recommendation formalises this: the action is an ordered list of items, the reward depends on the joint set, and the exploration must balance per-item uncertainty against position-dependent effects. Slate-aware reinforcement-learning approaches (SLATEQ from Google, the various determinantal-point-process approaches) are the right tool when slate effects are large.

Reinforcement learning for long-term reward

The fully sequential framing treats each user as an MDP — the user's state is their history, actions are recommendations, and the goal is to maximise long-term satisfaction (e.g., 30-day retention). RL for recommendation has been promising but operationally difficult — long-horizon credit assignment is hard, off-policy evaluation has the IPS problems of Section 8, and online RL on real users carries risks. Production deployments use RL increasingly often (YouTube's REINFORCE-style work, the various MDP-based churn-optimised recommenders) but typically with conservative on-policy bounds and substantial guardrails.

The 2024 generation of recommender RLHF approaches uses survey-style satisfaction signals as the reward, training recommenders to optimise for satisfaction rather than pure engagement. This is the most promising direction for fighting engagement-vs-satisfaction divergence at scale.

Applications and Frontier

Recommendation shows up wherever a product has more items than any user can browse — which is essentially everywhere on the modern internet. The methodology of the chapter is deployed across video, music, e-commerce, social, search, ads, news, and dating, each with a different flavour of the same underlying machinery. This final section surveys the application landscape and the frontier where foundation-model recommenders are reshaping the field.

Video and short-form media

Video recommendation is the highest-stakes application of the chapter. YouTube uses a two-stage architecture (Covington 2016 and successors) with a deep two-tower retriever and a deep ranker; the recommender is responsible for ~70% of all watch time. TikTok built its dominance on a recommendation system that captured user preferences from very few interactions; the architecture is not fully public but is known to use sequence transformers, two-tower retrieval, and reinforcement-learning components. Netflix uses recommendation throughout its product (homepage rows, similar titles, search ranking) with substantial investment in offline-online evaluation infrastructure.

Music and audio

Music recommendation has different dynamics from video — sessions are long, the cost of a bad recommendation is low (the user just skips), and content-based features (audio embeddings, lyrics, genre) carry strong signal. Spotify combines collaborative filtering, content-based features (a foundation-model audio encoder), and editorial curation. The Discover Weekly playlist uses a custom architecture combining matrix factorisation with content-based features and is one of the most-discussed production recommenders in the literature.

E-commerce and product recommendation

E-commerce recommendation has its own constraints: items have inventory limits, prices are dynamic, conversion is rare and high-value, and recommendations interact with search and ads. Amazon pioneered item-item CF and continues to use a hybrid of CF and deep methods. Pinterest uses GNN-based recommendation (PinSage was the canonical paper, Part XIII Ch 05) and has been particularly transparent about its production stack. Shopify and the various B2B-rec platforms serve smaller catalogues with often more sophisticated content-based personalisation.

News, social, and search

News recommendation has unique requirements — freshness matters dramatically (a one-day-old story is stale), bias and diversity are high-stakes, and editorial standards interact with engagement optimisation. Social-media feed ranking — Facebook, X, LinkedIn — is operationally similar but with stronger network effects. Search-result ranking is technically distinct (covered in Part XIV Ch 02) but shares much of the methodology, and the line between "search" and "recommendation" has blurred as both increasingly use the same neural retrieval and ranking stacks.

Frontier methods

Several frontiers are particularly active in 2026. LLM-based recommenders use large language models to represent items, users, and contexts uniformly in natural language; the M3-Rec / GenRec / TIGER lines of work are representative. Generative recommendation trains the model to generate item IDs autoregressively rather than score candidates, sidestepping the softmax-over-millions problem. Multimodal recommenders integrate text, image, audio, and video features into a unified embedding space for richer item representations. Recommender RLHF, mentioned in Section 9, increasingly trains on satisfaction signals rather than pure engagement. Causal recommendation applies the doubly-robust and orthogonal-ML estimators of Part XIII Ch 04 to recommendation, producing systems that target intervention-style effects rather than correlational predictions.

What this chapter does not cover

Several adjacent areas are out of scope. Computational advertising — auction theory, bid optimisation, attribution modelling — overlaps substantially with recommendation but has its own distinct methodology around auctions and budget pacing. The substantial fairness-in-recommendation literature, including the various group-fairness and individual-fairness frameworks for recommender systems, deserves its own treatment in the AI ethics chapters. Privacy-preserving recommendation, covered partially in Part XIII Ch 10, has specific deployment patterns for federated and differentially-private recommenders. The conversational-recommendation literature — chat-based product discovery, multi-turn preference elicitation — is closely related and increasingly important but has its own dialogue-system methodology. And the economic theory of recommendation markets — how recommendation systems shape supply and demand for content, the platform-economic implications — is largely outside the technical scope of this chapter.