Part XVIII · AI Safety, Alignment & Governance · Chapter 04

Mechanistic Interpretability, where neural networks become legible.

A neural network's behaviour is determined by hundreds of billions of weighted connections, none individually meaningful. Mechanistic interpretability is the project of reverse-engineering what computations these networks actually perform — identifying the specific circuits and features that produce specific behaviours, in a form humans can inspect, modify, and reason about. The 2017–2026 evolution has been remarkable: from "neural networks are black boxes" to specific identified circuits for tasks like indirect-object identification, in-context learning, and arithmetic; from "we can't tell what features a neuron represents" to sparse autoencoders that decompose activations into millions of identifiable features. Superposition — the structural reason why neurons typically aren't interpretable individually — was the key conceptual obstacle; the 2023–2024 sparse-autoencoder breakthrough was the methodology that addressed it. Probing and causal tracing are the empirical methods that connect interpretability findings to behaviour. This chapter develops the methodology with the depth a working ML practitioner needs: the conceptual frameworks, the algorithmic methods, the empirical results, and the implications for AI safety and broader ML practice.

Prerequisites & orientation

This chapter assumes the deep-learning material of Part VI (particularly transformer architecture), the LLM material of Part IX, the AI safety fundamentals of Ch 01, and the alignment-methods material of Ch 02. Familiarity with linear algebra (vector spaces, projections, decompositions) is essential; familiarity with optimisation theory (sparse coding, dictionary learning) helps for §5 on sparse autoencoders. The chapter is written for ML researchers, alignment researchers, and practitioners who need to understand what their models are computing — increasingly relevant as models become more capable and consequential. The methodology is younger and less consolidated than other parts of ML; we cover both established results and active research directions.

Three threads run through the chapter. The first is the universality conjecture: do different neural networks trained on similar tasks learn similar circuits and features? The empirical evidence is increasingly that they do, which makes interpretability findings transferable across models. The second is the scaling question: do interpretability methods that work at small scale extend to frontier-scale models? The 2024–2026 evidence is encouraging but the trajectory is uncertain. The third is the application question: what does interpretability buy us — better safety, better debugging, better steering, scientific understanding? Different practitioners weight these differently; serious work engages with multiple framings. The chapter develops each in turn.

In this chapter

Why Mechanistic Interpretability Matters scientific understanding · safety · debugging · steering
Features, Circuits, and the Conceptual Frame features · circuits · universality · the OpenAI Microscope era
Superposition: The Core Obstacle linear representation · superposition hypothesis · polysemanticity
Sparse Autoencoders and the 2024 Breakthrough SAEs · feature dictionaries · scaling · the Anthropic / DeepMind work
Identifying Circuits: IOI, Induction, and Beyond indirect object identification · induction heads · arithmetic · multi-step
Probing linear probes · MLP probes · what probes show · what they don't
Causal Tracing and Interventions activation patching · path patching · causal scrubbing · ablations
Applications: Editing, Steering, and Safety model editing · activation steering · feature ablation · refusal directions
Scaling Interpretability to Frontier Models computational cost · automation · feature labelling · the production frontier
The Frontier and the Open Problems circuit-level alignment · agentic interpretability · what next

Why Mechanistic Interpretability Matters

For most of deep learning's history, neural networks were opaque: powerful function approximators whose internal computations resisted human inspection. Mechanistic interpretability is the project of changing that — opening the black box to identify what specific circuits and features inside the network compute, why they compute it, and how they connect to model behaviour. The methodology has implications for safety (detecting deceptive alignment, identifying refusal mechanisms), capability (understanding what makes models work), and basic scientific understanding of what neural networks are.

What "mechanistic" means here

The "mechanistic" qualifier distinguishes this work from other interpretability approaches. Behavioural interpretability studies model outputs (Ch 05 of this part: SHAP, LIME, attention visualisation). Mechanistic interpretability studies model internals: the specific computational structures that produce behaviours. The promise is reverse engineering: rather than just describing what the model does, identify why it does it in terms of the underlying weights and activations. The methodology is more demanding and the findings are deeper; the trade-off is that it's substantially harder.

The safety motivation

For AI safety, mechanistic interpretability has a specific role: detecting alignment-relevant properties that aren't visible from external behaviour alone. Deceptive alignment detection (Ch 02): if a model has learned to behave aligned during training while pursuing different objectives at deployment, the deception lives in the model's internals, not just its outputs. Capability auditing: a model might possess capabilities it doesn't always exercise; interpretability could reveal them. Refusal-mechanism understanding: knowing how models implement refusals lets us evaluate their robustness against jailbreaks. The 2023–2026 work has increasingly oriented mechanistic interpretability toward these safety applications.

The scientific motivation

Beyond safety, mechanistic interpretability is one of the few paths toward genuine scientific understanding of what neural networks are. We know empirically that transformers trained on next-token prediction learn things like grammar, world knowledge, reasoning capabilities — but the how remains mysterious. Interpretability provides a way to actually understand the algorithms the network has learned, the data structures it represents, the strategies it employs. The 2017–2026 progress on this has been substantial: we now know how transformers implement specific identifiable algorithms for specific identifiable tasks.

The capability motivation

Mechanistic interpretability also has capability implications. Understanding why models work suggests how to make them work better. Architectural innovations sometimes flow from interpretability findings: the attention-head taxonomy that emerged from interpretability work informed subsequent architecture design. Training-procedure improvements can target identified weaknesses: if we know the model relies on specific circuits for important behaviours, we can train more deliberately for those circuits' robustness. The capability and safety motivations don't always align — capability progress can outpace safety progress — but interpretability serves both.

What this chapter covers

The chapter develops the conceptual framework (features, circuits, superposition), the methodological tools (sparse autoencoders, probing, causal tracing), the empirical results (specific identified circuits like IOI, induction heads, the various 2024 SAE-based feature catalogues), the application domains (model editing, activation steering, safety interventions), and the open problems (scaling to frontier models, automation of feature labelling, the broader mechanistic-vs-behavioural debate). The methodology is younger than other parts of ML; the chapter reflects this with attention to active research directions.

The downstream view

Operationally, mechanistic interpretability sits between trained models and downstream interventions. Upstream: a trained model whose internal computations we want to understand. Inside this chapter's scope: the methods for decomposing internal activations into interpretable components, hypothesising about what circuits compute, testing those hypotheses, and applying findings. Downstream: alignment-relevant interventions, safety auditing, model editing, scientific publications. The remainder of this chapter develops each piece: §2 features and circuits, §3 superposition, §4 sparse autoencoders, §5 specific circuits, §6 probing, §7 causal tracing, §8 applications, §9 scaling, §10 the frontier.

Features, Circuits, and the Conceptual Frame

The conceptual vocabulary of mechanistic interpretability centres on features (the things models represent) and circuits (the computational structures that operate on features to produce behaviours). The vocabulary was substantially developed by the OpenAI Clarity team and subsequent work, and provides the framework within which empirical interpretability findings are organised.

Features

A feature, in this vocabulary, is a meaningful unit of representation inside the network — something the network has learned to identify and compute about. Features can be perceptual (in a vision model, "is there a curve here", "is this a dog face"), linguistic (in an LLM, "this is a noun", "the writer is angry"), conceptual ("this is mathematical", "this is fictional"), behavioural ("the assistant should refuse", "this is sensitive content"). The empirical question of mechanistic interpretability has been: what features do trained networks represent, and where in the network are they represented?

The classic neuron view

The earliest interpretability work (1990s–2010s) focused on individual neurons: which neuron fires for which inputs? The classic image-classification interpretability literature (Zeiler & Fergus 2014, Olah's work at OpenAI) found that some neurons in trained CNNs respond to recognisable concepts (curve detectors, dog-head detectors, the various "Mickey Mouse" neuron). The methodology of "find the input that maximally activates this neuron" gave glimpses of structure. The limitation — most neurons aren't interpretable individually — set up the superposition story (Section 3).

Circuits

A circuit is a computational subgraph of the network that implements a specific function. The 2020 Olah et al. paper "Zoom In: An Introduction to Circuits" established the term and the methodology. A circuit consists of: a set of features (inputs), a set of weights (the connections that compute over features), and a set of output features. The empirical question becomes: can we identify circuits that implement specific identifiable behaviours? The answer, in many cases since 2020, has been yes — for circuits implementing curve detection, indirect-object identification, induction (the basic in-context learning mechanism), arithmetic, and many others.

The universality conjecture

A central empirical question: do different networks trained on similar tasks learn similar features and circuits? The universality conjecture (informal in the field, with substantial empirical support since 2020) holds that yes, they do — at sufficient scale and training, similar architectures converge to similar internal representations. The implications are substantial: interpretability findings on one model transfer to others; methodology developed on smaller models applies to larger ones; safety insights generalise. The 2024–2026 empirical work on cross-model feature comparison (the various "do these models share features" papers) has substantially supported the universality conjecture.

The empirical methodology

The standard methodology for identifying circuits: (1) pick a specific behaviour you want to understand; (2) construct a dataset where the behaviour is reliable; (3) use various interpretability tools (probing, ablation, causal tracing) to identify the circuits responsible; (4) verify the circuit hypothesis by testing predictions about ablations and interventions. The methodology is laborious — circuits typically take weeks to months to identify — but produces results that are increasingly cumulative across the field.

The structural taxonomy

Several structural concepts complement features and circuits. Attention heads: specific attention heads in transformers often implement specific functions (the "induction heads" of Olsson et al. 2022, the "name-mover heads" of Wang et al. 2022's IOI work). MLP layers: implement key-value lookups (Geva et al. 2021), with specific neurons or directions encoding specific facts. Residual stream: the central communication channel in transformers; features pass through it and accumulate as layers process them. The methodology of mapping these structural levels to features and circuits is the core empirical work of mechanistic interpretability.

Superposition: The Core Obstacle

For years, mechanistic interpretability was held back by an empirical observation: most neurons in trained networks aren't interpretable individually. A given neuron might fire for many seemingly-unrelated concepts (a "polysemantic neuron"), making feature identification hard. The 2022 superposition hypothesis (Elhage et al., Anthropic) proposed an explanation: networks pack many features into fewer neurons by representing features as sparse linear combinations, in a phenomenon analogous to compressed sensing.

The polysemanticity problem

Empirically, many neurons in trained networks don't correspond to single interpretable concepts. A neuron in an LLM might fire for "Python code", "lawyer biographies", and "discussions of forest fires" simultaneously. This is polysemanticity: a single neuron representing multiple concepts. Polysemantic neurons resist the "find the input that maximally activates this neuron" methodology — the maximally-activating input is some artificial mixture rather than any clean concept. Polysemanticity was a longstanding obstacle to scaling interpretability.

The superposition hypothesis

The 2022 Elhage et al. paper proposed that polysemanticity reflects superposition: networks represent more features than they have neurons by packing features as linear combinations in activation space. If a model has 1,000 neurons but represents 10,000 features, each neuron must be involved in roughly 10 features. The mechanism works because features are typically sparse (only a few active at any time), so the "interference" between features stays manageable. The hypothesis explains polysemanticity and predicts that features should be recoverable as sparse linear combinations of neurons — which led directly to the sparse-autoencoder breakthrough (Section 4).

Why superposition makes sense

The structural argument for why networks would learn superposition: the model has more concepts to represent than neurons, so it has to economise. Sparse coding is mathematically the right answer when the underlying signal is sparse. The 2022 paper showed empirically that toy models trained with limited capacity learn superposition naturally; the methodology has substantial theoretical backing (the relationship to compressed sensing, the Johnson-Lindenstrauss lemma, the broader sparse-coding literature). The empirical evidence suggests real networks operate substantially in superposition.

The implication for interpretability

If networks are in superposition, then individual-neuron-based interpretability is fundamentally limited. The right "atoms" of representation are not neurons but directions in activation space — specific linear combinations of neurons that correspond to specific features. The methodological question becomes: how do we find these directions? The 2023–2024 sparse-autoencoder methodology is the answer that has worked at scale; it learns the feature directions explicitly.

Linear vs non-linear features

An important assumption in the superposition framework: features are represented linearly in activation space — i.e., a feature's activation is roughly a linear projection of the activation vector. The 2024–2025 work on the linear representation hypothesis (Park et al., Mikolov-style embedding analyses, the various 2024 papers) has substantially supported this for many feature types. Some features are non-linear (the model represents them as more complex functions of activations) but the linear approximation works for a substantial fraction of identified features. The limitation is acknowledged; the methodology has produced enough results that the linear assumption is operationally useful.

The capacity-and-importance trade-off

The 2022 superposition paper documented a specific trade-off: features compete for representation capacity, and more-important features get more capacity. The model effectively performs a cost-benefit analysis: each feature's interference cost (because it shares neurons with other features) is traded against its predictive utility. Less-important features may be omitted entirely; more-important features may get more dedicated capacity. The empirical evidence supports this: features identified by SAEs (Section 4) tend to correspond to behaviourally-important concepts, and the activation magnitudes correlate with feature importance.

Sparse Autoencoders and the 2024 Breakthrough

The 2023–2024 emergence of sparse autoencoders (SAEs) as a practical interpretability tool was one of the most significant methodological advances in the field. SAEs decompose model activations into sparse combinations of learned features, providing a concrete answer to the superposition problem. The methodology has scaled to frontier models; the 2024 Anthropic paper "Towards Monosemanticity" was the watershed demonstration.

The basic SAE mechanic

An SAE is a two-layer autoencoder trained to reconstruct activations from sparse codes. Given an activation vector x of dimension d, the SAE has: an encoder that maps x to a much higher-dimensional sparse code z (often 16d or 32d, with most entries zero), and a decoder that reconstructs x from z. The training objective combines reconstruction loss (the SAE should faithfully reconstruct the activations) with a sparsity penalty (z should have few non-zero entries). The learned features are the columns of the decoder matrix: each column is a vector in activation space, and the corresponding entry in z is the "amount" of that feature present.

Why SAEs work

Under the superposition hypothesis, model activations are sparse combinations of feature directions. SAEs are explicitly trained to find sparse decompositions, so they should find the feature directions if they exist. The empirical results have substantially supported this: trained SAEs find features that are interpretable, monosemantic (each feature corresponds to a single concept), and align with what we'd expect a model to represent. The methodology effectively "un-superposes" the network's representations, exposing the features for human inspection.

The Anthropic Towards Monosemanticity paper

The 2023 Anthropic paper "Towards Monosemanticity" was the watershed demonstration. Trained SAEs on a small (1-layer) transformer, found ~4,000 interpretable features, demonstrated that the features correspond to recognizable concepts (specific topics, syntactic structures, behavioural patterns). The methodology established the basic SAE recipe and demonstrated it produced human-interpretable features. The 2024 Anthropic follow-ups scaled to Claude Sonnet (a frontier model), finding millions of features at substantial scale.

Scaling to frontier models

The 2024 Anthropic paper "Scaling Monosemanticity" demonstrated the methodology on Claude 3 Sonnet — one of the most-capable LLMs at the time. The SAE found ~16M features at one layer; many were interpretable and corresponded to recognisable concepts (specific people, events, technical concepts, behavioural tendencies). The DeepMind 2024 "Gemma Scope" project produced SAEs across many layers of Gemma 2 models, providing a publicly-available infrastructure. The 2024–2026 progress has been substantial: SAEs are now produced and shared for many open-source LLMs; production-quality SAE infrastructure exists.

Variants and improvements

The SAE methodology has produced rapid methodological improvements. Top-K SAEs (Gao et al. 2024, OpenAI): use top-k sparsity rather than L1 penalty, often more stable. JumpReLU SAEs (Rajamanoharan et al. 2024, DeepMind): use a jump-ReLU activation that improves the sparsity-reconstruction trade-off. Gated SAEs: separate "what feature is present" from "how active it is". Crosscoders: SAEs that decompose activations across multiple layers simultaneously. The 2024 evolution has been rapid; the current best practices change every few months.

What SAEs find

Trained SAEs on frontier LLMs find a remarkable range of features. Concrete features: specific entities (the Golden Gate Bridge, Donald Trump, the Berlin Wall), specific topics (cryptography, baseball statistics, German verb conjugation). Behavioural features: refusal patterns, sycophancy markers, deception indicators (the famous Anthropic 2024 sleeper-agent feature). Linguistic features: specific syntactic structures, dialects, formality registers. Conceptual features: abstract concepts like "betrayal", "elegance", "scientific reasoning". The empirical demonstration that frontier LLMs represent these as identifiable features has been one of the most striking 2024 alignment results.

Identifying Circuits: IOI, Induction, and Beyond

Beyond identifying features, the deeper goal is identifying circuits — the computational structures that operate on features to produce specific behaviours. The 2022–2026 work has produced a substantial catalogue of identified circuits in transformers: indirect object identification, induction (the foundational in-context learning mechanism), arithmetic, factual recall, multi-step reasoning. Each identified circuit advances the field's understanding of what transformers actually do.

Indirect Object Identification

The 2022 Wang et al. paper "Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small" was a landmark identifying-a-circuit paper. The task: given a sentence like "When John and Mary went to the store, John gave a drink to ___", predict "Mary". The paper identified a specific circuit involving multiple attention heads working together to (1) identify all names in context, (2) identify which name is the indirect object, (3) suppress predictions for the duplicated subject name. The circuit was identified, validated by ablations, and showed that even small transformers implement non-trivial multi-step algorithms. The methodology became the template for subsequent circuit-identification work.

Induction heads

The 2022 Olsson et al. paper "In-Context Learning and Induction Heads" identified induction heads — attention heads that implement a specific copy-from-earlier-context behaviour. Given a sequence like "A B C ... A B", an induction head will predict "C" by attending from the second "A" to the first "A" and shifting attention to the next position. The methodology generalises: the model learns to find patterns in context and continue them. Induction heads are a foundational mechanism behind in-context learning; they emerge during training in a sharp phase transition that correlates with the model's broader capability for in-context learning. The methodology has been substantially extended (the various 2023–2024 follow-ups on more complex induction patterns).

Factual recall circuits

The 2022–2024 work on factual recall in transformers (Geva et al., Meng et al., the various ROME / MEMIT papers) identified how models store and retrieve facts. The pattern: facts are stored in MLP layers as key-value associations; specific neurons or directions encode specific facts. The methodology enables model editing: you can identify the neurons storing a specific fact and modify them to change the model's stored knowledge. This is both alignment-relevant (correcting outdated information) and capability-relevant (the model-editing literature).

Arithmetic and algorithmic circuits

How do transformers do arithmetic? The 2024 work on this (Nanda et al., the various follow-ups) has identified specific circuits implementing modular addition and other arithmetic operations. Some circuits are surprisingly sophisticated — modular addition involves Fourier-style decompositions that the model learns. The findings are scientifically interesting (transformers can learn non-obvious algorithms) and methodologically informative (the same techniques generalise to other algorithmic tasks). The 2024–2026 work on algorithmic circuits has substantially advanced.

Multi-step reasoning

The 2024–2026 work on chain-of-thought and reasoning models (o1-class extended-reasoning systems) has produced new interpretability targets. Reasoning circuits: how does the model decompose complex problems into steps? Verification circuits: how does the model check its own reasoning? Backtracking circuits: how does the model recognise mistakes and revise? The methodology is still developing, but the questions are increasingly tractable as SAE-based feature identification scales.

The cumulative catalogue

The 2022–2026 work has produced a growing catalogue of identified circuits. The Anthropic Transformer Circuits thread, the OpenAI interpretability publications, the various academic groups (Neel Nanda's group at DeepMind, the MIT and Stanford groups, the various 2024–2026 entrants) have collectively documented dozens of identified circuits. The cumulative effect is that transformer interpretability is shifting from "we don't know what they're doing" to "here are the specific algorithms they've learned for these specific tasks." The trajectory is encouraging; the overall question of "do we understand frontier models" remains open.

Probing

Probing is the methodology of training small classifiers on internal activations to detect specific properties. If a probe trained on activations can predict a property (truthfulness, sentiment, syntactic role), that's evidence the property is represented somewhere in those activations. Probing was one of the earliest interpretability methodologies (Conneau et al. 2018, Tenney et al. 2019) and remains widely used.

The basic probing setup

The probing methodology: take a trained model; pick a layer's activations as features; train a classifier (typically a logistic regression or small MLP) to predict some target property from those features; report the classifier's accuracy as evidence about whether the property is encoded. If accuracy is high, the property is "linearly recoverable" from the activations, suggesting it's represented in some form. The methodology is conceptually simple and empirically informative.

What probing reveals

The 2018–2024 probing literature has documented many properties that LLMs represent. Linguistic properties: syntactic categories (noun/verb), part-of-speech, syntactic depth, named entities. Semantic properties: word sense, sentiment, factual claims. Behavioural properties: truthfulness, harmful intent, personality features. World-model properties: in 2023, the famous Othello-GPT paper (Li et al.) showed that a transformer trained on Othello move sequences develops an internal world-state representation that probes can detect. The cumulative finding: LLMs represent enormous amounts of information about the world internally, much of it linearly recoverable.

The linearity assumption

Standard probing uses linear classifiers, which only detect linearly-recoverable properties. The methodology trades some sensitivity for some interpretability: linear probes are easier to interpret (you can read off the weights as a "direction") but might miss properties encoded non-linearly. Non-linear probes (MLP probes) can find more, but at the cost of becoming a bit suspicious — a sufficiently powerful probe can find anything. The methodology is to use simple probes by default; if they work, the finding is robust; if they don't, more elaborate probes might or might not be evidence for the property.

The probing-vs-causality distinction

A persistent methodological concern: probes can detect properties that are merely correlated with internal representations, not causally responsible for the model's behaviour. A probe that detects "truthfulness" in activations might be picking up on a correlation rather than the actual mechanism the model uses. Establishing causality requires interventions (Section 7): if you modify the activations to remove the property, does the model's behaviour change? Mature probing work always pairs detection with intervention; pure probing is a hypothesis-generation tool, not a hypothesis-confirmation one.

Probing for safety

For alignment, probing has specific applications. Honesty probes: detect when the model is being truthful vs deceiving. Harmfulness probes: detect when an output is about to be harmful. Refusal probes: detect when the model is about to refuse vs comply. Confidence probes: detect calibration problems. The 2023–2026 work (Burns et al. CCS, the various 2024 honesty-probe papers) has shown that some safety-relevant properties are surprisingly recoverable from internal activations. Whether probing alone is sufficient for safety is open; combined with interventions (Section 7) and SAE-based feature identification (Section 4), it provides meaningful safety infrastructure.

Practical considerations

For deploying probing in practice: choose layers carefully (typically middle layers represent more abstract properties than early or late ones); train probes on diverse examples to ensure they generalise; validate with held-out data and adversarial examples; combine probing with intervention-based confirmation. The methodology is mature enough that probing infrastructure (transformer-lens, the various 2024 toolkits) is widely available; setting up probes for new properties is a few-day exercise rather than a research project.

Causal Tracing and Interventions

To confirm that a hypothesised circuit or feature actually causes specific behaviour, the methodology uses interventions: modify the activation or weight in question and check whether the behaviour changes. The methodology distinguishes correlation from causation in interpretability findings. Activation patching, causal tracing, and causal scrubbing are the dominant intervention methods.

Activation patching

The basic intervention: activation patching. Run the model on input A; record the activations at some layer. Run the model on input B; replace some component of B's activations with the corresponding component from A. Observe how B's output changes. If patching changes the output substantially, the patched component was causally relevant to whatever differs between A and B. The methodology cleanly establishes causal contributions of specific activations to specific outputs.

Causal tracing

The 2022 ROME paper (Meng et al.) introduced causal tracing: a systematic way of finding which neurons or layers are responsible for specific factual recall. The procedure: corrupt an input to remove the relevant fact; restore one component at a time and observe how restoration affects the output. The component that restores the most output behaviour is the "fact-storing" component. Causal tracing has been widely used and has produced specific findings about where facts are stored in transformer MLP layers.

Path patching

An extension: path patching. Activations don't just affect outputs through one path — they propagate through the network's structure to reach output. Path patching isolates specific paths: patch an activation, but only let the patch's effect flow through specific subsequent components. The methodology produces finer-grained causal claims than activation patching alone. The 2022 IOI paper used path patching to isolate the specific attention-head paths responsible for indirect-object identification.

Causal scrubbing

The 2023 causal scrubbing framework (Chan et al.) provides a rigorous way to test specific circuit hypotheses. Given a hypothesis about which components implement which functions, causal scrubbing systematically replaces activations with "what they would be under the hypothesis" and checks that the model still works. If the model's behaviour is preserved under the hypothesised replacement, the hypothesis explains the model's behaviour; if not, the hypothesis is incomplete. The methodology has substantially raised the rigour bar for circuit-identification claims.

Steering vectors and activation steering

A distinctive intervention type: activation steering. Identify a direction in activation space corresponding to a specific concept (e.g., "love" or "deception"); add that direction to model activations during inference. The model's outputs shift in the corresponding direction. The 2023 work on "Activation Addition" (Turner et al.) demonstrated this for personality and behaviour traits; subsequent work has extended it to safety-relevant directions (refusal, honesty, capability suppression). The methodology is operationally useful: instead of retraining the model, you can adjust its behaviour by intervening on activations at inference time.

The ablation methodology

A different intervention type: ablation. Zero out a specific component (a neuron, an attention head, an SAE feature) and see how outputs change. If outputs change substantially, the component was important; if not, it wasn't. Ablation is the standard test for circuit hypotheses: if your hypothesis says component X implements function Y, ablating X should eliminate Y. The methodology is straightforward; the operational details (which form of ablation — zero, mean, resampling — and how to evaluate the change) have been refined substantially over 2022–2024.

Applications: Editing, Steering, and Safety

Mechanistic interpretability findings are increasingly applied. Model editing uses identified facts and circuits to selectively change model knowledge. Activation steering uses identified directions to adjust model behaviour at inference time. Safety interventions use identified features to detect and mitigate problematic outputs. This section covers the application landscape and what each approach can and can't do.

Model editing

Model editing is the methodology of changing specific model knowledge or behaviour by directly modifying weights or activations rather than retraining. ROME (Meng et al. 2022) showed that specific facts could be edited in GPT-class models by modifying specific MLP-layer projections. MEMIT extended ROME to thousands of edits. The 2023–2026 evolution has produced increasingly capable editing methods (the various 2024 papers on more reliable editing, generalization-aware editing, and so on). Applications: correcting outdated facts, removing specific knowledge, fixing specific hallucinations. Limitations: edits don't always generalise to related queries; aggressive editing can degrade other capabilities.

Activation steering at deployment

Activation steering (Section 7) has become a practical inference-time intervention. The methodology: identify a steering direction (using SAE features, probing-derived directions, or gradient-based methods); add the direction to model activations during inference. Applications: style steering (make outputs more formal/casual/poetic), behavioural steering (more or less assertive, more or less detailed), safety steering (suppress harmful outputs at activation level rather than relying on training). The 2024 work on Anthropic's "Claude as Golden Gate Bridge" demonstrated that steering can produce dramatic, specific behavioural changes. Production deployment of steering remains research-stage but is starting to appear.

Refusal-direction analysis

A specific safety application: refusal-direction analysis. The 2024 work (Arditi et al. and others) found that LLMs implement refusal as a specific direction in activation space — a single linear direction whose addition to activations causes the model to refuse, whose subtraction causes it to comply. The methodology has had implications: it suggests refusal is a relatively simple mechanism (potentially fragile to adversarial manipulation); it provides a tool for understanding jailbreaks (jailbreaks may suppress the refusal direction); it provides a tool for hardening refusal (training models to maintain the refusal direction more robustly).

Capability-suppression and unlearning

An emerging application: capability-suppression. Use interpretability methods to identify which features or circuits implement specific capabilities (e.g., "ability to provide bioweapons information"); suppress those features at deployment. The 2024 work on RMU (representation misdirection unlearning, Li et al.) and related methods has demonstrated this for specific narrow capabilities. The methodology offers a complement to refusal-based safety: even if refusal can be jailbroken, capability suppression makes the underlying behaviour harder to elicit. Whether the methodology scales to broader capabilities and to more capable models is open.

Safety auditing

Mechanistic interpretability provides an audit dimension that behavioural evaluation can't reach. Auditing for capabilities: do specific capabilities exist in the model, even if not currently exercised? Auditing for deception: does the model represent deceptive intent that's not visible from outputs? Auditing for biases: what biased associations does the model represent? The 2024 Anthropic work on auditing Claude's internal features for safety properties is the major published example; Anthropic, OpenAI, and DeepMind all have ongoing internal auditing programmes.

Limitations and pitfalls

Despite the encouraging trajectory, application of mechanistic interpretability has real limitations. Edits don't always generalise: an edit that changes one factual response may leave related responses unchanged. Steering has unintended effects: aggressive steering can produce incoherent outputs. Findings are model-specific: a feature identified in one model may not exist (or work differently) in another. Scale-dependence: methods that work at small scale may not scale to frontier models. The honest 2026 assessment: applications of mechanistic interpretability are working on increasingly large and important problems, but the methodology remains imperfect; trusted production deployment of interpretability-based safety interventions is still early.

Scaling Interpretability to Frontier Models

Most of the foundational interpretability work was done on small models (GPT-2 small, BERT-base). The strategic question is whether the methodology scales to frontier models with hundreds of billions of parameters. The 2024 Anthropic "Scaling Monosemanticity" paper substantially answered yes for SAE-based feature identification; whether circuit identification, model editing, and other techniques scale similarly is still open.

The compute cost

SAE training is itself expensive. The 2024 Anthropic paper trained SAEs that captured ~16M features on Claude 3 Sonnet's activations; the training compute was substantial (though much less than the model's original training). Scaling this to 100M+ features (as would be needed for full coverage of a model's representations) is still expensive but not prohibitive. The 2024–2026 trajectory has interpretability budgets at major labs growing substantially; the cost of running interpretability is increasingly absorbed.

Automated feature labelling

SAEs find features but don't label them. Manual labelling of millions of features is impractical; the 2024 work on automated feature labelling (the various papers using LLMs to describe features) has produced substantial progress. The methodology: for each SAE feature, find the inputs that maximally activate it; ask an LLM to summarise the pattern. The labels are imperfect but useful at scale; the 2024 Anthropic and DeepMind work both rely on automated labelling. Whether the labels are reliable enough for production safety auditing is open; substantial work continues.

The Gemma Scope project and infrastructure

The 2024 DeepMind Gemma Scope project produced SAEs across many layers of the Gemma 2 family of models, with public release. The infrastructure has been substantial: trained SAEs at scale, the labelling work, the user interfaces for inspecting features. Gemma Scope has become the default interpretability infrastructure for academic researchers wanting to do SAE-based work without needing to train SAEs from scratch. Similar publicly-available SAE infrastructure for other models is increasingly common.

Cross-layer and cross-model analysis

An emerging direction: studying how features evolve across layers (early-layer features tend to be more concrete; late-layer features more abstract) and how features compare across models (the universality conjecture in practice). The 2024 work on crosscoders (SAEs that span multiple layers) has been productive. The 2024–2026 cross-model comparison work is showing that different models trained similarly do learn similar features, supporting universality. Whether this generalises to substantially-different architectures (GPT-4 vs Claude vs Gemini vs Llama) is being actively studied.

The interpretability-research-meets-deployment frontier

A growing question: how does interpretability research connect to production deployment? Several patterns are emerging. Pre-deployment auditing: run interpretability on a model before deployment to look for concerning features or behaviours. Live monitoring: detect specific features in deployed model outputs (e.g., "is the deception feature firing?"). Steering-based interventions: production deployments using activation steering for safety. The 2024–2026 work is increasingly bridging research-stage interpretability to deployment-relevant infrastructure; the trajectory is encouraging.

The capability-vs-interpretability race

A strategic question: can interpretability keep up with capability? Capability advances on a defined cadence (NVIDIA's hardware cycle, the steady release of frontier models); interpretability has been advancing rapidly but from a much lower base. The 2024 SAE breakthrough has substantially closed the gap; whether the trajectory continues depends on continued methodological advances and continued investment. The honest 2026 assessment: interpretability is meaningfully ahead of where it was in 2022 but still meaningfully behind capability; whether it gets ahead is an open and consequential question.

The Frontier and the Open Problems

Mechanistic interpretability has rapidly matured but faces substantial open problems. Circuit-level alignment, agentic interpretability, the comprehensive coverage of frontier models, and the question of whether interpretability ultimately enables alignment are all active research areas. This section traces the open frontiers and the directions the field is moving in.

Circuit-level alignment

An ambitious direction: circuit-level alignment. Use interpretability to identify and verify specific safety-relevant circuits (refusal circuits, honesty circuits, capability circuits) at the level of detail needed for trusted deployment. The 2024–2025 work on this has produced encouraging early results — refusal directions have been characterised, deception features have been found — but full coverage remains far off. The trajectory is toward "for these specific safety properties, here is exactly what's happening in the model"; whether this becomes reliable enough to gate deployment decisions is one of the central questions for the next several years.

Agentic interpretability

Agent systems (Part XI) introduce new interpretability challenges. Trajectory-level analysis: agents do many actions; interpretability needs to span the trajectory, not just individual outputs. Tool-use analysis: how does the model decide when to call which tool, and what evidence guides those decisions? Long-horizon goal representation: how does the model represent and pursue extended goals? The 2024–2026 work on agentic interpretability (the various agent-evaluation papers, the various agent-trace analysis methods) is starting to address these but is in earlier stages than LLM interpretability.

Multi-modal and reasoning models

Vision-language models (multimodal models) and reasoning models (o1-class, the various 2024–2026 reasoning systems) introduce new interpretability targets. Multi-modal models: how do visual and textual representations interact? Reasoning models: what's happening during the extended reasoning phases? The 2024–2026 work on these is starting; the methodology will likely follow the LLM interpretability trajectory but with multi-modal-specific challenges.

The deeper alignment question

The strategic question: does interpretability ultimately enable alignment, or is it a parallel research direction whose insights help but don't suffice? Several views exist. The optimistic view: sufficient interpretability allows direct alignment intervention — identify the goal-pursuing circuits, identify the deception circuits, modify them as needed. The pessimistic view: even perfect interpretability of current models doesn't address alignment of future systems; the methods don't compose. The middle view: interpretability provides tools that complement alignment training but doesn't replace it. The 2024–2026 evidence is consistent with the middle view; the trajectory could shift toward optimism if interpretability keeps advancing.

Universality and theoretical foundations

The empirical universality conjecture has substantial support but limited theoretical foundation. Why do networks learn similar circuits? Is there theory that predicts which circuits emerge for which tasks? The 2024–2026 work on theoretical foundations of interpretability (the various papers on grokking, on phase transitions in learning, on the geometry of neural representations) is starting to provide answers but is far from complete. A solid theoretical foundation would substantially advance the field.

What this chapter has not covered

The remainder of Part XVIII develops adjacent topics. Practical explainability (SHAP, LIME, attention visualisation, behavioural explanations) is Ch 05 — the complementary "behavioural" side to this chapter's "mechanistic" focus. Fairness, bias, and equity is Ch 06. Privacy in ML is Ch 07. AI governance, policy, and regulation is Ch 08. The chapter focused on mechanistic interpretability — the project of reverse-engineering neural networks — and on its relevance for safety; the broader interpretability landscape includes both mechanistic and behavioural approaches.