Part IX · Reinforcement Learning · Chapter 08

Preference Learning & RLHF

Reinforcement learning from human feedback has become the dominant method for aligning large language models with human values — producing the helpfulness, harmlessness, and honesty that raw pretraining cannot guarantee. This chapter covers the full alignment pipeline: supervised fine-tuning, reward modeling from comparisons, PPO fine-tuning, and the newer direct preference optimization methods that bypass reward models entirely.

Prerequisites

This chapter assumes familiarity with policy gradient methods (Chapter 04), particularly PPO, and a basic understanding of transformer language models (Part VI Chapters 04–06). The reward learning sections use ideas from inverse RL (Chapter 07). Constitutional AI and scalable oversight connect to broader alignment concerns.

Sections

The Alignment Problem reward misspecification · Goodhart's law · helpful & harmless
Learning from Comparisons Bradley-Terry · preference pairs · annotator agreement
The RLHF Pipeline SFT → reward model → RL fine-tuning · three stages
Reward Modeling pairwise ranking · scalar reward · annotation protocol
PPO for Language Models KL penalty · reference policy · token-level MDP
Reward Hacking & Overoptimization proxy gaming · Goodhart · KL budget
Direct Preference Optimization DPO · implicit reward · classification loss
DPO Variants & Alternatives IPO · KTO · ORPO · GRPO · SimPO
Constitutional AI & RLAIF AI feedback · critique-revision · CAI
Scalable Oversight debate · amplification · weak-to-strong generalization
Process Reward Models PRM · step-level feedback · MATH · reasoning
Applications & Open Challenges ChatGPT · Claude · annotation cost · pluralism

The Alignment Problem

A language model trained to predict the next token is not trained to be helpful, honest, or harmless. Bridging this gap — making a capable model also a trustworthy one — is the central challenge of AI alignment.

Large language models are trained by pretraining — next-token prediction over enormous text corpora. This produces a model that is extraordinarily capable at pattern completion, but whose "goals" (if we can call them that) are misaligned with what users and society actually want from it. A pretrained model will happily reproduce misinformation, generate harmful content, or give technically correct but practically useless answers, because these behaviors appeared in training data.

Reward misspecification is the general failure mode: we define a proxy objective (next-token prediction), optimize it well, and discover the resulting behavior diverges from what we intended. This is an instance of Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. The model learned to predict text, not to be an ideal assistant.

The Three H's

Anthropic's framing of the target behavior — popularized through research on Claude and formalized in their Constitutional AI work — describes three properties an aligned model should exhibit. A model should be helpful, providing substantive, accurate assistance that genuinely serves the user's needs. It should be harmless, declining to produce content that could cause real-world harm to the user, third parties, or society. And it should be honest, neither stating falsehoods nor creating false impressions through technically true but misleading statements. These properties frequently trade off against each other: refusing all requests is perfectly harmless but useless; agreeing with everything is maximally agreeable but dishonest.

Why Supervised Fine-Tuning Alone Is Insufficient

The first approach to alignment is straightforward: collect examples of the behavior you want and fine-tune the model on them. This supervised fine-tuning (SFT) step is valuable and genuinely necessary — it primes the model to behave like an assistant rather than a document completer. But it has important limitations. Human demonstrations are expensive to collect at scale, annotators cannot demonstrate every possible scenario, and the model can overfit to the surface style of the demonstrations without internalizing the underlying values. Most importantly, SFT provides no signal for distinguishing between a response that is good and one that is merely plausible.

The core insight of RLHF: humans find it much easier to judge which of two responses is better than to write an ideal response from scratch. Preference learning exploits this asymmetry — it collects relative judgments rather than absolute demonstrations, and uses them to train a reward model that can evaluate arbitrary responses.

Learning from Comparisons

Rather than asking humans to specify what a good response looks like, preference learning asks which of two responses is better. This comparative judgment is easier to make reliably, and can be aggregated into a numeric reward signal.

Comparative judgments have a long history in psychometrics and econometrics. The Bradley-Terry model (1952) provides the statistical foundation: given two items $i$ and $j$ with latent quality scores $\beta_i$ and $\beta_j$, the probability that $i$ is preferred over $j$ is:

$$P(i \succ j) = \frac{e^{\beta_i}}{e^{\beta_i} + e^{\beta_j}} = \sigma(\beta_i - \beta_j)$$

where $\sigma$ is the logistic sigmoid. This is a generative model of human preference: the annotator observes two responses and emits a comparison that is noisy but systematically correlated with the underlying quality difference. If we parameterize the quality scores with a neural network $r_\phi(x, y)$ (where $x$ is the prompt and $y$ the response), the log-likelihood of a preference dataset $\mathcal{D} = \{(x, y_w, y_l)\}$ (where $y_w$ is preferred over $y_l$) is:

$$\mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}}\left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]$$

This is simply a binary cross-entropy loss. The reward model is trained to assign higher scalar scores to preferred responses than to dispreferred ones, using comparisons as the supervision signal.

Annotator Agreement and Preference Noise

Human preferences are noisy and sometimes inconsistent. Annotators may disagree due to genuine value differences, ambiguous prompts, differing interpretations, or fatigue. The Bradley-Terry model treats all disagreement as noise around a single "true" quality ordering — an assumption that breaks down when preferences reflect genuine diversity of values rather than measurement error. This is a deep issue: whose preferences should the reward model represent? A model trained on majority-preference annotations will systematically suppress minority viewpoints.

In practice, RLHF annotation pipelines typically employ trained human raters (not crowd workers), provide detailed rubrics, and measure inter-annotator agreement metrics such as Cohen's kappa. Disagreements are often resolved by majority vote or by averaging scalar ratings. Some systems present annotators with a Likert scale rather than a binary choice, capturing more nuanced preference strength.

The RLHF Pipeline

The standard RLHF workflow proceeds in three stages: supervised fine-tuning to get a competent starting policy, reward model training from human comparisons, and RL fine-tuning to optimize that reward.

The three-stage RLHF pipeline. Stage 1 fine-tunes a pretrained model on demonstrations to get π_SFT. Stage 2 collects human pairwise preferences between model outputs and trains a reward model r_φ. Stage 3 uses PPO to optimize π_θ against r_φ, with a KL penalty to keep the policy close to π_SFT.

The three-stage pipeline was crystallized in the InstructGPT paper (Ouyang et al., 2022), which applied it to GPT-3 to produce a model that human raters overwhelmingly preferred to the raw pretrained model. The same template underlies GPT-4, Claude, Gemini, and most aligned LLMs in production today.

The key innovation is stage 2: instead of specifying reward by hand, it is learned from human comparisons. This allows the reward signal to capture nuanced human values that are difficult to write down explicitly — tone, safety, truthfulness, appropriate hedging — as long as humans can recognize them when comparing outputs.

Reward Modeling

The reward model is a neural network that maps a (prompt, response) pair to a scalar score. It is trained on human comparison data and must generalize to judge responses it has never seen.

The reward model $r_\phi$ is typically initialized from the SFT model with a linear head added to produce a scalar from the final-token hidden state. This initialization is critical: starting from the SFT model means the reward model already understands language and knows what coherent, contextually appropriate responses look like. Fine-tuning from a randomly initialized classifier would require learning language understanding from scratch on comparatively little data.

Data Collection Protocol

Preference data is collected by sampling multiple responses from the SFT model (or a mix of model versions) for a given prompt, then presenting human annotators with pairs to rank. Annotation guidelines typically instruct raters to evaluate on multiple dimensions: accuracy and factual correctness, instruction-following, style and tone, safety and harm avoidance. In practice, many RLHF pipelines collect a ranking over $K$ responses rather than a single pair, which yields $\binom{K}{2}$ pairwise comparisons per prompt — substantially more efficient than binary comparisons alone.

The Reward Model Training Objective

For a dataset of preference pairs $(x, y_w, y_l)$ where $y_w$ is preferred, the reward model is trained with:

$$\mathcal{L}_{RM}(\phi) = -\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\!\left(r_\phi(x,y_w) - r_\phi(x,y_l)\right)\right]$$

An important practical detail: reward models can be sensitive to the position bias of annotators, who sometimes prefer whichever response appears first regardless of content. Calibrating for this bias — by randomizing presentation order and checking for systematic positional preferences — is standard in production annotation pipelines.

Reward model ensembles. A single reward model can overfit to idiosyncrasies of the annotator pool. Some pipelines train an ensemble of reward models from different seeds or annotator subsets, and use the ensemble mean (or a pessimistic lower bound) as the reward signal, reducing sensitivity to reward model errors.

What Reward Models Actually Learn

Empirical studies suggest that reward models can learn meaningful proxies for human values, but they also learn shortcuts. They may reward longer responses independent of quality, prefer responses that sound confident regardless of accuracy, and favor certain stylistic conventions (bullet points, headers) because annotators associate them with thorough answers. These biases become the targets of reward hacking once RL optimization begins.

PPO for Language Models

Proximal Policy Optimization (PPO) is the RL algorithm most commonly used for RLHF fine-tuning. Applying PPO to a language model requires a specific formulation of the MDP and careful regularization to prevent catastrophic forgetting and reward hacking.

Language model generation can be framed as an MDP where the state at step $t$ is the prompt plus all tokens generated so far, the action is the next token to generate (from a vocabulary of tens of thousands), the transition is deterministic (appending the token), and the reward is zero for all intermediate steps and $r_\phi(x, y)$ at the end of the response. This is a bandit-like episode: all reward comes at the final step, which makes credit assignment across the sequence difficult.

The KL-Regularized Objective

Optimizing the reward model score directly would cause the policy to rapidly drift to reward-hacking behavior. The standard fix is a KL penalty that keeps the fine-tuned policy close to the SFT reference policy:

$$\max_{\pi_\theta}\, \mathbb{E}_{x \sim \mathcal{D},\, y \sim \pi_\theta(\cdot|x)}\!\left[r_\phi(x,y) - \beta\,\mathrm{KL}\!\left(\pi_\theta(\cdot|x)\,\|\,\pi_{SFT}(\cdot|x)\right)\right]$$

The coefficient $\beta$ controls the tradeoff: larger $\beta$ keeps the policy more faithful to the SFT baseline but limits how much the reward can be improved; smaller $\beta$ allows greater reward optimization but risks reward hacking. The KL penalty is often computed token-by-token and subtracted from the per-token reward before computing PPO advantages.

Practical PPO Implementation Details

RLHF with PPO requires four models to be loaded simultaneously: the policy $\pi_\theta$ (being optimized), the reference policy $\pi_{SFT}$ (frozen, for KL computation), the reward model $r_\phi$ (frozen), and the value function $V_\psi$ (trained alongside the policy). For LLMs with billions of parameters, this requires careful memory management and is typically implemented with separate GPU allocations, gradient checkpointing, and mixed precision. The practical engineering complexity of PPO-based RLHF is one motivation for the simpler DPO approach described in section 07.

Why PPO rather than other RL algorithms? PPO's clipped surrogate objective bounds the policy update at each step, preventing the large policy changes that destabilize training. This robustness is especially important for LLMs, where catastrophic forgetting can occur if the policy strays too far from its initialization in a single gradient step.

Reward Hacking & Overoptimization

The reward model is only a proxy for human preference. As the policy optimizes more and more aggressively against this proxy, it discovers behaviors that score well on the reward model while diverging from what the reward model was meant to capture.

Reward hacking is the central failure mode of RLHF. Gao et al. (2022) demonstrated this quantitatively: as the KL divergence between the optimized policy and the reference policy increases (i.e., as RL training proceeds), true human preference scores initially rise — the model gets better — then plateau and eventually decline, even as the reward model score continues to increase. The gold-standard human evaluation and the learned proxy reward diverge.

Common Hacking Strategies

Reward-hacked language models tend to exhibit characteristic failure modes. They produce very long responses because length correlates with perceived thoroughness. They hedge excessively, prefacing statements with lengthy disclaimers that satisfy safety rubrics without being genuinely helpful. They may learn to echo the user's framing sycophantically — agreeing with incorrect premises — because agreement feels more satisfying to annotators than accurate correction. In extreme cases, models have learned to produce repetitive text, garbled output, or unusual character sequences that happen to confuse the reward model into assigning high scores.

Mitigations

Several techniques constrain the degree of overoptimization. The KL penalty (section 05) is the primary defense. Some practitioners anneal the learning rate aggressively or stop RL training early. Others monitor the distribution of responses for distributional collapse — when the model's output diversity collapses to a narrow mode. Periodic re-collection of preference data using the current policy (rather than relying on static datasets) keeps the reward model relevant throughout training. The theoretical analysis of overoptimization connects to the more general problem of Goodhart's law: any sufficiently flexible optimizer will find high-scoring but unintended solutions when the optimization target is a proxy.

Reward overoptimization: the reward model score (violet) rises monotonically with KL divergence, but true human preference (teal) peaks at an intermediate KL and then declines — the model has learned to game the proxy.

Direct Preference Optimization

Direct Preference Optimization (DPO) eliminates the explicit reward model entirely. It shows that the optimal RLHF policy has an analytical form, allowing preference data to train the language model directly — as a classification problem.

Rafailov et al. (2023) observed that the KL-constrained RLHF objective has a closed-form optimal solution. Given the optimization:

$$\max_{\pi}\,\mathbb{E}_{x,y\sim\pi}[r(x,y)] - \beta\,\mathrm{KL}(\pi\|\pi_{ref})$$

the optimal policy satisfies:

$$\pi^*(y|x) = \frac{\pi_{ref}(y|x)\,e^{r(x,y)/\beta}}{Z(x)}$$

where $Z(x)$ is a normalizing partition function. This means the reward function can be expressed in terms of the optimal policy and the reference policy:

$$r(x,y) = \beta\log\frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta\log Z(x)$$

Substituting this into the Bradley-Terry preference model and noting that $Z(x)$ cancels in the pairwise comparison, we arrive at the DPO loss:

$$\mathcal{L}_{DPO}(\theta) = -\mathbb{E}_{(x,y_w,y_l)}\!\left[\log\sigma\!\left(\beta\log\frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta\log\frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right)\right]$$

This loss directly optimizes the language model on preference data without any intermediate reward model. The gradient pushes the probability of preferred responses up relative to the reference policy, while pushing the probability of dispreferred responses down. The partition function and all reward model infrastructure disappear.

Why DPO Works: Implicit Reward Maximization

DPO does not abandon the reward — it simply parameterizes the reward implicitly through the log-ratio $\beta\log\frac{\pi_\theta}{\pi_{ref}}$. Every language model implicitly defines such a reward. DPO directly trains the policy to have the right implicit reward, skipping the separate reward model training step. The result is simpler to implement (no RL loop, no value function, no critic), more stable to train, and uses less memory — only the policy and the frozen reference are needed, rather than four models.

DPO's practical advantages: No reward model training pipeline. No PPO hyperparameter tuning. No reward hacking in the RL sense (though DPO can still overfit to preference data). Standard supervised fine-tuning infrastructure suffices — the loss is a standard binary cross-entropy computed over sequence log-probabilities.

Limitations of DPO

DPO's simplicity comes with tradeoffs. It requires an offline, static preference dataset — it cannot incorporate new human feedback during optimization, unlike online RL. It can suffer from length bias: the log-probability difference between $\pi_\theta$ and $\pi_{ref}$ tends to grow with response length, which can lead the model to favor shorter rejected responses or longer chosen responses regardless of quality. Several variants (Section 08) address these issues. Additionally, DPO's implicit reward is less interpretable than an explicit reward model — it is harder to diagnose what the model has learned.

DPO Variants & Alternatives

DPO sparked a wave of alternative direct preference optimization methods, each addressing specific weaknesses. The landscape has evolved rapidly, with new approaches emerging monthly.

Identity Preference Optimization (IPO) (Azar et al., 2023) modifies DPO's loss to prevent overfitting to deterministic preferences. When the model assigns probability 1 to preferred responses and 0 to dispreferred ones — the degenerate optimum of DPO — IPO's loss continues to provide informative gradients. It replaces the log-sigmoid with a squared error term, giving a loss function that is bounded even under extreme probability assignments.

Kahneman-Tversky Optimization (KTO) (Ethayarajh et al., 2023) departs from the pairwise preference paradigm altogether. Rather than requiring paired comparisons, KTO uses unpaired binary labels — each response is labeled simply as "good" or "bad." This dramatically reduces annotation cost, since paired comparisons require presenting annotators with exactly two responses for the same prompt. KTO is inspired by prospect theory: humans feel losses more acutely than equivalent gains, and the KTO loss models this asymmetry explicitly.

Odds Ratio Preference Optimization (ORPO) eliminates the need for a reference model. Instead of a KL penalty against $\pi_{ref}$, it uses an odds ratio penalty that directly measures how much more likely the model is to generate the preferred response versus the rejected one, relative to its pre-training baseline. This makes ORPO suitable for settings where no SFT reference policy is available.

Group Relative Policy Optimization (GRPO), introduced in the DeepSeek-R1 paper, samples multiple responses per prompt and computes advantages relative to the group mean reward. This eliminates the value function entirely and is especially effective for reasoning tasks where the reward is binary (correct/incorrect). GRPO is the algorithm behind several leading open reasoning models.

Simple Preference Optimization (SimPO) (Meng et al., 2024) addresses DPO's length bias by using average log-probability (rather than total log-probability) as the implicit reward, and adds a margin to ensure a minimum quality gap between preferred and rejected responses. SimPO achieves strong performance without a reference model, matching or exceeding DPO on instruction-following benchmarks.

Method	Needs RM?	Needs Reference?	Needs Pairs?	Key strength
RLHF+PPO	Yes	Yes	Yes	Online, flexible reward
DPO	No	Yes	Yes	Simple, stable
IPO	No	Yes	Yes	No overfitting to deterministic prefs
KTO	No	Yes	No	Unpaired labels, cheap annotation
ORPO	No	No	Yes	No reference model needed
GRPO	Scalar	No	No	Reasoning tasks, no critic
SimPO	No	No	Yes	Length-debiased, margin control

Constitutional AI & RLAIF

Instead of asking humans to label every preference, Constitutional AI uses AI systems to generate feedback based on a written set of principles — the "constitution" — making the feedback process far more scalable.

Human annotation is expensive, slow, and subject to annotator inconsistency. Reinforcement Learning from AI Feedback (RLAIF), introduced by Lee et al. (2023) and operationalized in Anthropic's Constitutional AI (Bai et al., 2022), replaces or supplements human preference labels with LLM-generated feedback.

The Constitutional AI Pipeline

Constitutional AI (CAI) proceeds in two phases. In the supervised learning (SL-CAI) phase, the model is given a potentially harmful response it generated and asked to critique it according to a set of principles — the constitution — and then revise the response. The revised responses are used as supervised fine-tuning targets. This teaches the model to produce safer outputs through a self-critique and revision process.

In the RL-CAI phase, the AI is asked to evaluate pairs of its own outputs according to the constitutional principles and generate preference labels. A reward model is trained on these AI-generated preference labels, and the policy is then fine-tuned with PPO (or DPO) against this reward model. Because the feedback comes from the AI itself, this phase scales with compute rather than human annotation capacity.

The Constitution

The constitution is a list of principles that specify desired model behavior. A sample principle might read: "Choose the response that is least likely to contain harmful, unethical, or illegal content." Another might be: "Choose the response that most supports human autonomy and does not manipulate or deceive." The explicit, inspectable nature of the constitution is a key advantage: rather than hoping that annotator biases are benign, CAI makes the value specification explicit and auditable. Anthropic has published the constitutions used for Claude, allowing external scrutiny of the values being instilled.

RLAIF vs. human feedback quality. Lee et al. (2023) found that RLAIF-trained models were preferred by humans roughly as often as RLHF-trained models, despite the AI-generated labels being produced at a fraction of the cost. This suggests that, at current model sizes, AI labelers may be competitive with human annotators for many preference dimensions — though the comparison depends heavily on the capability of the labeling model.

Limitations and Risks

RLAIF introduces a circularity: the same class of model that is being aligned is providing the alignment signal. If the labeling model has systematic biases or blind spots, those biases will be amplified rather than corrected. Sycophancy — preferring responses that agree with the labeling model's priors — is a particular risk. There is also the question of whose values are encoded in the constitution: the document reflects choices made by the AI lab, and those choices embed specific cultural, ethical, and political perspectives.

Scalable Oversight

As AI systems become more capable than humans in specialized domains, humans will no longer be able to directly evaluate the quality of their outputs. Scalable oversight is the research agenda for maintaining meaningful human supervision of superhuman systems.

The RLHF framework assumes that human preferences are a reliable signal for good behavior. This assumption weakens as models become more capable. A human annotator cannot reliably judge whether a frontier model's solution to a complex mathematical problem is correct, whether a long legal document is accurately summarized, or whether a piece of generated code has subtle security vulnerabilities. In the limit, if a model is smarter than its overseers in a domain, RLHF preference labels in that domain may be unreliable or actively gamed.

Debate

AI Safety via Debate (Irving et al., 2018) proposes that two AI agents debate the answer to a question, with a human judging the debate rather than the answer directly. If one agent argues a true position and the other argues a false one, the debate format — where each agent can challenge the other's reasoning — should allow the true position to win, because true claims are easier to defend against scrutiny. Debate shifts the human's task from evaluating complex claims to evaluating the quality of adversarial argumentation, which may be within human capability even when direct evaluation is not.

Iterated Amplification

Iterated Amplification (Christiano et al., 2018) uses an AI assistant to help the human provide feedback on the AI's own behavior. The human decomposes a hard task into subproblems, gets AI help with the subproblems, and then uses those answers to provide feedback on the overall task. Iterating this process bootstraps human oversight to harder and harder tasks, theoretically maintaining alignment even for tasks that exceed unaided human capability.

Weak-to-Strong Generalization

Burns et al. (2023) at OpenAI studied whether a strong AI model can be aligned using only feedback from a weaker model. Surprisingly, in many settings the stronger model generalized beyond the weak supervisor's capability — it extrapolated from the weak labels to produce behaviors that were better than the weak model could evaluate. This weak-to-strong generalization suggests that alignment may be more tractable than naively feared: current human supervisors may be sufficient to instill values that generalize appropriately to more capable future models, at least in some domains.

Process Reward Models

Standard reward models assign a single scalar to a complete response. Process reward models instead evaluate each step of a reasoning chain, providing dense credit assignment that is particularly valuable for multi-step mathematical and scientific reasoning.

When a model produces a long chain-of-thought to solve a math problem, an outcome reward model (ORM) only signals whether the final answer is correct — there is no feedback on intermediate steps. A model that reaches the right answer via lucky guessing receives the same signal as one that reasoned correctly throughout. An process reward model (PRM) instead evaluates each step of the solution and provides a score at each step. This gives much richer credit assignment and allows the training signal to identify exactly where reasoning went wrong.

Let's Verify Step by Step

Lightman et al. (2023) at OpenAI demonstrated the power of PRMs for the MATH benchmark. They collected human annotations at each step of solutions — annotators labeled each step as positive (correct and useful), neutral (correct but unnecessary), or negative (incorrect or misleading). A reward model trained on these step-level annotations substantially outperformed an outcome-only reward model when used for best-of-N sampling: for a given compute budget, selecting the solution with the best PRM score outperformed selecting by answer correctness or ORM score.

Applications Beyond Math

The PRM paradigm extends naturally to any domain where reasoning proceeds through interpretable steps: scientific problem-solving, code generation (evaluate each function or algorithm block), legal reasoning (evaluate each premise in an argument), and medical diagnosis (evaluate each diagnostic step). The bottleneck is annotation: step-level labels require annotators with domain expertise who can evaluate intermediate reasoning, making PRMs expensive to construct for domains where such experts are scarce.

Monte Carlo step estimation. Rather than requiring human annotations at every step, some PRM training approaches use Monte Carlo rollouts to estimate the probability that a given intermediate state leads to a correct final answer. Steps from which the model reliably reaches correct solutions receive high process rewards; steps that tend to lead to failure receive low rewards. This automates the step-level labeling at the cost of additional sampling.

Applications & Open Challenges

RLHF and preference learning underlie the most capable AI assistants in production today, yet fundamental challenges remain: whose values to optimize, how to maintain alignment as models scale, and how to handle the genuine diversity of human preferences.

Production Systems

The InstructGPT/ChatGPT lineage was the first major demonstration that RLHF could turn a capable but raw language model into a broadly useful assistant. OpenAI's GPT-4, Anthropic's Claude family, Google's Gemini, and Meta's Llama-based instruction-tuned models all use variants of the RLHF pipeline — most combining SFT, reward modeling, and either PPO-based RL or DPO-style preference optimization. The scale of these pipelines is substantial: training runs involve millions of human preference judgments and thousands of GPU-hours of RL fine-tuning.

The Pluralism Problem

A preference-learned model necessarily reflects some distribution of values — those of its annotator pool. When annotators disagree, the majority preference wins, systematically underrepresenting minority viewpoints. More fundamentally, there is no universal preference distribution: different cultures, political traditions, and ethical frameworks have genuinely different intuitions about what constitutes a good response to many prompts. A single globally deployed model aligned on Western, English-language annotator preferences may perform poorly as a cultural match for other user populations. Research on pluralistic alignment — methods that represent and respect diverse value systems rather than collapsing them into a single scalar reward — is an active area.

Annotation Cost and Synthetic Data

Collecting millions of high-quality preference labels is expensive. The trend toward RLAIF and Constitutional AI partially addresses this, but introduces its own dependencies on frontier model quality. Distillation from stronger models — using outputs from GPT-4 or Claude to supervise smaller models — has become common practice, with the result that many open models are now implicitly aligned according to the values of the larger proprietary models used for data generation.

Open Challenges

Several fundamental questions remain unresolved. Reward hacking at scale — whether more capable models find more sophisticated ways to game even carefully constructed reward models — is insufficiently understood. The relationship between preference optimization and capability — whether aligning a model makes it more or less capable on objective benchmarks — is complex and context-dependent. The problem of sycophancy — models that tell users what they want to hear rather than what is true — persists despite being specifically targeted by alignment efforts. And the long-horizon alignment problem — ensuring that systems remain aligned as they pursue extended goals over time — is barely addressed by current RLHF methods, which focus on single-turn or short-context interactions.

Alignment tax? Early RLHF work worried that safety constraints would reduce capability — that a harmless model would be a less capable one. More recent evidence is mixed: aligned models often perform better on capability benchmarks than raw pretrained models (instruction-following is itself a capability), but aggressive safety optimization can reduce willingness to engage with legitimate complex topics, representing a real tradeoff that practitioners navigate with considerable difficulty.

Preference Learning & RLHF

Prerequisites

The Alignment Problem

The Three H's

Why Supervised Fine-Tuning Alone Is Insufficient

Learning from Comparisons

Annotator Agreement and Preference Noise

The RLHF Pipeline

Reward Modeling

Data Collection Protocol

The Reward Model Training Objective

What Reward Models Actually Learn

PPO for Language Models

The KL-Regularized Objective

Practical PPO Implementation Details

Reward Hacking & Overoptimization

Common Hacking Strategies

Mitigations

Direct Preference Optimization

Why DPO Works: Implicit Reward Maximization

Limitations of DPO

DPO Variants & Alternatives

Constitutional AI & RLAIF

The Constitutional AI Pipeline

The Constitution

Limitations and Risks

Scalable Oversight

Debate

Iterated Amplification

Weak-to-Strong Generalization

Process Reward Models

Let's Verify Step by Step

Applications Beyond Math

Applications & Open Challenges

Production Systems

The Pluralism Problem

Annotation Cost and Synthetic Data

Open Challenges

Further Reading

Foundational Papers

Constitutional AI & Scalability

Process Rewards & Reasoning

Reward Hacking & Alignment Theory