Part XVIII · AI Safety, Alignment & Governance · Chapter 05

Explainability for Practitioners, where models meet the people who depend on them.

Mechanistic interpretability (Ch 04) is the project of reverse-engineering neural networks at the circuit level — deep, scientifically ambitious, and demanding. Practical explainability is the complementary discipline of producing human-understandable explanations of individual predictions, fast enough and cheap enough to deploy in production. The two have different goals: mechanistic interpretability seeks to understand what the model is doing; practical explainability seeks to communicate why this prediction was made to a downstream user, regulator, or auditor. SHAP assigns each input feature a contribution to the prediction using game-theoretic principles. LIME approximates the model locally with an interpretable surrogate. Saliency maps highlight which input pixels or tokens contributed most. Counterfactual explanations describe what would have to change to get a different prediction. Each method has its place; choosing the right one depends on the model, the audience, and the stakes. This chapter develops the practical methodology with the depth a working ML practitioner needs: the algorithms, the trade-offs, the deployment realities, and the regulatory contexts that increasingly demand explanations.

Prerequisites & orientation

This chapter assumes the deep-learning material of Part VI, the AI safety material of Ch 01, and the mechanistic interpretability material of Ch 04. Familiarity with feature attribution methods (gradient-based, perturbation-based) and basic game theory (Shapley values) helps for §3–4 but we cover the essentials. The chapter is written for ML engineers, applied scientists, model risk officers, and product engineers who need to provide explanations of model decisions to stakeholders — increasingly required for regulated industries (finance, healthcare, hiring, insurance) and for general high-stakes deployments. The methodology is mature enough that production-quality libraries exist (SHAP, LIME, Captum, the various 2024–2026 entrants); the engineering work is choosing among them and integrating them appropriately.

Three threads run through the chapter. The first is the local-vs-global distinction: practical explainability is mostly about explaining individual predictions (local) rather than the whole model (global), and methodological choices follow this distinction. The second is the faithfulness question: does the explanation actually reflect what the model is doing, or is it a story that's pleasing but misleading? Most explainability methods have known faithfulness limitations and the operational discipline is being honest about them. The third is the regulatory-and-stakeholder dimension: explanations are increasingly demanded by regulators, auditors, and affected users, and the form of explanation matters for these audiences in ways that aren't obvious from the technical methodology. The chapter develops each in turn.

In this chapter

Why Explainability Is Different from Interpretability audiences · faithfulness · local vs global · stakes
Feature Attribution: The Core Methodology contribution · baselines · additive vs interactive · the explanation contract
SHAP and Shapley Values game theory · TreeSHAP · KernelSHAP · DeepSHAP · efficient computation
LIME and Local Surrogates local approximation · perturbations · interpretable surrogate · stability
Saliency Maps and Gradient-Based Methods vanilla gradients · Grad-CAM · integrated gradients · SmoothGrad
Counterfactual Explanations what-would-change · diverse counterfactuals · feasibility · DiCE
Explanations for LLMs and Generative Models attention visualisation · prompt-attribution · chain-of-thought · limits
Evaluating Explanations: Faithfulness and Trustworthiness faithfulness · stability · human studies · adversarial explanations
Regulatory and Stakeholder Contexts GDPR · EU AI Act · adverse-action · medical explanations · bias auditing
The Frontier and the Open Problems explanation reliability · LLM reasoning · agent traces · what next

Why Explainability Is Different from Interpretability

Explainability and interpretability are often used interchangeably, but the practical disciplines have different goals, audiences, and methods. Mechanistic interpretability (Ch 04) seeks deep scientific understanding of what neural networks compute. Explainability seeks operational explanations that downstream stakeholders can act on — at production-quality speed and reliability. Both are valuable; conflating them produces confusion in both directions.

The audience-driven framing

Practical explanations are produced for someone — a model risk officer who needs to verify a credit decision, a doctor reviewing a model's recommendation, a customer asking why their loan was denied, a regulator auditing for bias. The audience determines what counts as a good explanation. A loan denial requires a list of factors a customer can act on; an internal model audit requires a quantitative breakdown of feature contributions; a regulatory submission requires documentation that meets specific compliance requirements. The first design question for any explainability work is: who is the audience, and what do they need?

The faithfulness problem

An explanation is faithful if it accurately reflects what the model is actually doing. An explanation can be plausible, intuitive, and well-structured while being completely wrong about why the model made its prediction. The explainability literature has documented many cases where popular methods produce explanations that look reasonable but don't actually match the model's behaviour: gradient-based saliency maps that are independent of the trained model's parameters (the "Sanity Checks for Saliency Maps" paper, Adebayo et al. 2018), SHAP attributions that diverge between methods, LIME explanations that vary substantially across runs. The discipline of practical explainability is constantly negotiating between producing intelligible explanations and producing faithful ones.

Local vs global

A fundamental distinction. Local explanations describe why a specific prediction was made (why was this loan denied?). Global explanations describe how the model behaves overall (which features matter most across all predictions?). Most production explainability is local — individual decisions need individual justifications. Global explanations are useful for model audit, debugging, and high-level documentation but don't directly help with per-decision questions. The methods covered in this chapter are mostly local; some have global aggregations (e.g., averaging SHAP values across a dataset gives feature importances), but the local-prediction case is primary.

The interpretability-vs-performance trade-off

One school of thought (Cynthia Rudin and others, "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead", 2019) argues that for high-stakes decisions, you should use inherently-interpretable models (sparse linear models, decision trees, generalised additive models) rather than explaining black-box models post-hoc. The argument: post-hoc explanations are unreliable; interpretable models are often as accurate as black-box equivalents on tabular data. The counter-argument: deep learning provides substantial accuracy gains in many domains, and explanations of black-box models — while imperfect — are better than no explanations. Both views have merit; the right choice depends on context.

The downstream view

Operationally, practical explainability sits between trained models and the stakeholders consuming model decisions. Upstream: a deployed model producing predictions. Inside this chapter's scope: the methods that turn predictions into explanations (SHAP, LIME, saliency, counterfactuals, the various LLM-specific methods), the evaluation methodology, the regulatory and audience considerations. Downstream: explanations communicated to model risk officers, regulators, doctors, customers, auditors. The remainder of this chapter develops each piece: §2 feature attribution as the core methodology, §3 SHAP, §4 LIME, §5 saliency maps, §6 counterfactuals, §7 LLM explanations, §8 evaluation, §9 regulatory contexts, §10 the frontier.

Feature Attribution: The Core Methodology

Most practical explainability methods produce feature attributions: numerical scores indicating how much each input feature contributed to the prediction. The pattern is conceptually simple — for each feature, output a number — but the details (what does "contribution" mean, what's the baseline, how do we handle interactions) substantially shape the methodology. This section establishes the conceptual frame that underlies SHAP, LIME, saliency methods, and many others.

What "contribution" means

Several reasonable definitions of feature contribution exist. Marginal contribution: how much does the prediction change when we remove this feature? Average marginal contribution: averaged across all possible orderings of removing features (the Shapley-value approach). Gradient-based contribution: how much does the prediction change as the feature changes infinitesimally? Local-linear contribution: the coefficient on this feature in a local linear approximation of the model. Each definition is reasonable; each can produce different attributions for the same prediction. The methodological choice matters and should be matched to the audience's needs.

The baseline problem

Many attribution methods compare a feature's actual value to a baseline — what the feature would be if not present, or an "average" reference value. The choice of baseline materially affects attributions. Zero baseline: simple but often meaningless (a "zero pixel" isn't natural). Mean baseline: the average value of the feature in the training set. Sample baseline: a typical example from the training distribution. Counterfactual baseline: the closest realistic example. The 2018–2024 literature has documented that different baselines can produce dramatically different attributions; production explainability must explicitly choose and document its baseline.

Additive vs interactive contributions

A subtle methodological question: do feature contributions add up to the prediction (additive attribution), or can features interact in ways that aren't captured by per-feature contributions alone? Most methods are additive by design — SHAP guarantees additivity, LIME approximates a linear model. But real models often have substantial feature interactions (the model treats feature A differently depending on feature B). Methods like interaction SHAP capture pairwise interactions; higher-order interactions are harder. The trade-off: interactive attributions are more accurate but harder for humans to interpret.

Local linearity assumption

Many methods (LIME, integrated gradients) assume that the model is approximately linear in a small neighbourhood around the input. This works well when it's true and produces misleading explanations when it's not. Highly non-linear models (deep networks, ensemble models with sharp decision boundaries) violate the assumption locally; explanations from local-linear methods may not faithfully represent what the model does even nearby. Mature practitioners check for non-linearity (e.g., by sampling neighbouring points and checking prediction consistency) before trusting local-linear explanations.

The explanation contract

For each attribution method, there's an implicit "contract" — what the attribution mathematically means. SHAP's contract: the average marginal contribution under all coalitions, satisfying specific axioms. LIME's contract: the coefficients of a linear surrogate that approximates the model locally. Integrated gradients' contract: the path integral of gradients along a straight line from baseline to input. Different contracts produce different numbers; understanding which contract you're using is essential for correctly interpreting results. The 2018 "A Unified Approach to Interpreting Model Predictions" paper (Lundberg & Lee) was particularly important for clarifying that several seemingly-different methods could be unified under the SHAP framework.

The methodological consequences

The right method depends on context. For tabular data with high-quality features, SHAP (Section 3) is the workhorse. For images, gradient-based saliency methods (Section 5) are more efficient. For text, attention visualisation and gradient methods both have their place. For situations where audiences need to understand "what would change the decision," counterfactual explanations (Section 6) are more directly useful than feature attributions. The methodology is a toolkit; mature practitioners choose tools rather than committing to a single approach.

SHAP and Shapley Values

SHAP (SHapley Additive exPlanations, Lundberg & Lee 2017) is the dominant feature-attribution method for tabular ML. It assigns each feature its Shapley value — the average marginal contribution to the prediction across all possible orderings of features. The methodology is mathematically principled (Shapley values are the unique attribution satisfying certain axioms), produces additive attributions, and has efficient implementations for tree models (TreeSHAP) that make it practical at production scale. SHAP is the workhorse explainability method for tabular ML in 2026.

Shapley values from game theory

The mathematical foundation: Shapley values come from cooperative game theory (Shapley 1953). The setup: N players cooperate; their joint contribution is the value v(S) of any coalition S; how should the total value be divided fairly? Shapley showed there's a unique attribution satisfying four axioms (efficiency, symmetry, dummy player, additivity), and gave a formula for it: each player's value is the average marginal contribution they make as they join coalitions, averaged over all orderings. SHAP applies this to ML: features are players; the model's prediction is the value; feature attributions are Shapley values.

The TreeSHAP algorithm

Computing Shapley values requires summing over all 2^N coalitions — exponential in feature count, infeasible for non-trivial models. TreeSHAP (Lundberg et al. 2018) is an exact algorithm for tree-based models (random forests, gradient-boosted trees) that runs in polynomial time by exploiting the tree structure. The algorithm computes exact Shapley values for tree models in O(TLD²) where T is the number of trees, L is the maximum number of leaves, and D is the maximum depth. TreeSHAP made SHAP practical at production scale for the dominant tabular-ML model class; it's the most-deployed explainability method in production today.

KernelSHAP for any model

For non-tree models, exact Shapley values aren't computable, but KernelSHAP approximates them by sampling. The approach: sample coalitions, fit a weighted linear regression that approximates the model's behaviour on the samples, the regression coefficients approximate Shapley values. KernelSHAP is model-agnostic (works for any model) but slower and noisier than TreeSHAP. The methodology bridges SHAP to the general black-box-explanation case at the cost of computation and precision.

DeepSHAP and gradient-based variants

For deep neural networks, several SHAP variants combine Shapley-value principles with gradient information. DeepSHAP uses DeepLIFT (Shrikumar et al. 2017) to propagate contributions through the network; GradientSHAP integrates gradients along the path from baseline to input. These methods are faster than KernelSHAP for deep networks but have looser fidelity to true Shapley values. For most production explainability of deep tabular models, the methods are adequate.

Visualisation and consumption

Beyond computing attributions, presenting them effectively matters. SHAP summary plots: feature importance plus distributional information. SHAP waterfall plots: per-prediction breakdown showing how features add up to the prediction. SHAP force plots: linearised view of feature contributions. The Python `shap` library provides production-quality visualisations; the visual conventions have become standard across the industry. Knowing how to read these plots is part of the basic ML engineering skill set.

Limitations and pitfalls

SHAP is the workhorse but has limitations. Computation cost: TreeSHAP scales but KernelSHAP can be expensive. Baseline sensitivity: results depend on the baseline choice; documenting the baseline is essential. Faithfulness for non-tree models: gradient-based SHAP variants are approximations. Causal misinterpretation: SHAP attributions are about model behaviour, not real-world causality; users frequently confuse them. Aggregation traps: averaging Shapley values across data can hide important variation. Mature SHAP deployment requires explicit attention to each of these.

LIME and Local Surrogates

LIME (Local Interpretable Model-agnostic Explanations, Ribeiro et al. 2016) was one of the earliest widely-adopted post-hoc explainability methods. The idea: approximate the black-box model locally with an interpretable surrogate (typically a sparse linear model), and use the surrogate's coefficients as the explanation. LIME predates SHAP and shares conceptual ground with KernelSHAP; the methods have somewhat different operational properties.

The LIME algorithm

The basic procedure: given a prediction to explain, sample perturbations of the input; predict the perturbations with the black-box model; weight the perturbations by their similarity to the original input; fit a sparse linear model (typically Lasso) that predicts the black-box output from the perturbations. The coefficients of the linear model are the LIME explanation. The methodology is model-agnostic (works for any black-box model that produces predictions on inputs), local (the linear model only fits well near the explained input), and produces sparse explanations (the L1 penalty in Lasso ensures only a few features get non-zero coefficients).

Perturbation strategies

The key methodological choice in LIME is how to perturb inputs. Tabular data: perturb by sampling from the marginal distribution of each feature, typically with binarisation (each feature becomes "is feature X above the median?"). Text: remove or replace words randomly; the explanation tells you which words matter. Images: divide into superpixels and randomly turn them on/off; the explanation tells you which image regions matter. Each perturbation strategy has its quirks; the method has been extended for various data types over 2016–2024.

The interpretable-surrogate idea

The conceptual insight underlying LIME (and many subsequent methods): you don't need to interpret the black-box model itself — you can interpret a surrogate that approximates it. The black-box model is too complex to interpret directly, but a sparse linear model is interpretable, and if the sparse linear model predicts well on perturbations near the input, it captures the relevant local behaviour. The surrogate's coefficients are then the explanation. The methodology generalises beyond LIME: any local approximation method (decision trees, simple rules) can play the surrogate role.

The stability problem

LIME has a known weakness: explanations can be unstable. Running LIME twice on the same prediction can produce noticeably different explanations because of the random sampling. The 2018 "Stability of Local Explanations" paper documented this; subsequent work proposed mitigations (more samples, better sampling, deterministic variants). The instability is operationally important: a customer-facing explanation that changes when you re-run it is not a trustworthy explanation. Modern LIME implementations (the LIME library, Captum's LIME) mitigate but don't eliminate the issue.

LIME vs SHAP comparison

LIME and SHAP are often presented as competitors; the comparison is informative. Faithfulness: SHAP is grounded in Shapley-value theory with axioms; LIME's coefficients are an approximation without similar guarantees. Computation: LIME is simpler; KernelSHAP (the closest SHAP variant for arbitrary models) uses similar perturbation strategies but with weighting that better matches Shapley values. Stability: SHAP is more stable than LIME because the weighting and aggregation are theoretically motivated. Interpretability: LIME's coefficients are arguably easier to read than SHAP's exact Shapley values for some users. The 2018+ trend has been toward SHAP for tabular data (where TreeSHAP is exact and fast) and LIME or alternatives for harder modalities.

When LIME is the right choice

Despite SHAP's general dominance, LIME is preferred in some contexts. Quick prototyping: LIME's simplicity makes it fast to set up. Image/text where interactions matter less: superpixel- or word-level explanations work fine. Deeply non-linear models where Shapley computation is expensive: LIME's local approximation may suffice. When sparsity matters: LIME's L1 penalty produces explicitly sparse explanations. The method has its place; mature explainability stacks include both SHAP and LIME and choose between them per use case.

Saliency Maps and Gradient-Based Methods

For image and text models, saliency maps highlight which input regions contributed most to the prediction. The simplest approach is the gradient of the prediction with respect to each input pixel (or token); more sophisticated methods (Grad-CAM, integrated gradients, SmoothGrad) refine this. The methodology has been a workhorse since 2014 but has substantial known limitations; mature deployment uses saliency maps with awareness of those limits.

Vanilla gradients

The simplest saliency method: compute the gradient of the predicted class's logit with respect to each input pixel; visualise the gradient magnitudes. The intuition is that pixels with large gradients have more impact on the prediction. The 2013 Simonyan et al. paper "Deep Inside Convolutional Networks" introduced this idea. The method is fast (single backward pass) and easy to implement, but produces noisy maps that are often hard to interpret. The 2014–2024 work has produced many extensions that improve quality.

Grad-CAM

Grad-CAM (Gradient-weighted Class Activation Mapping, Selvaraju et al. 2017) is the most-deployed saliency method for CNNs. The procedure: compute the gradient of the predicted class with respect to the activations of the last convolutional layer; weight the activation maps by these gradients; sum to produce a coarse heatmap. Grad-CAM produces lower-resolution maps than pixel-level methods but they're substantially less noisy and more interpretable. The methodology has substantial track record; it's the default saliency method for CNN-based image classification in production.

Integrated gradients

Integrated gradients (Sundararajan et al. 2017) is a more principled gradient-based method. The idea: compute the path integral of gradients from a baseline to the input. Mathematically, this satisfies several useful axioms (sensitivity, implementation invariance, completeness). Integrated gradients is the gradient-based method that's most directly comparable to SHAP — it has axiomatic justifications and produces additive attributions. The method is more computationally expensive than vanilla gradients (requires multiple forward-backward passes along the path) but produces less-noisy, more-faithful attributions.

SmoothGrad

An empirical observation: gradients are noisy because of the local non-linearity of neural networks. SmoothGrad (Smilkov et al. 2017) reduces noise by averaging gradients over many noisy versions of the input. The procedure: add Gaussian noise to the input multiple times, compute gradients for each, average them. The result is a smoother, more interpretable saliency map. SmoothGrad is often combined with other methods (SmoothGrad + integrated gradients, etc.) for production-quality saliency.

The Sanity Checks problem

The 2018 "Sanity Checks for Saliency Maps" paper (Adebayo et al.) was a watershed: it documented that several popular saliency methods produce visually similar maps for trained models and randomly-initialised models — meaning the maps don't actually depend on what the model has learned. The implication: saliency maps that look reasonable may not actually reflect model behaviour. Some methods (vanilla gradients, integrated gradients) survived the sanity checks; others (guided backpropagation, deconvolution) failed. The lesson: production saliency must use methods that pass sanity checks.

Saliency for transformers

For modern transformer-based models, saliency methods adapt the gradient-based ideas. Attention visualisation: directly visualise which tokens the attention heads attend to (covered in Section 7). Token-level integrated gradients: apply integrated gradients to token embeddings rather than pixels. Attention rollout: aggregate attention across layers to produce a single attention map. Each method has its place; for LLM-based applications, the attention-visualisation methods are most often deployed despite their limitations.

Counterfactual Explanations

Counterfactual explanations answer a different question than feature attribution: instead of "which features mattered?", they ask "what would have to change to get a different prediction?" The answer is often more directly useful for affected users — a loan applicant cares more about "what would have made this approved" than about "which factors contributed how much." Counterfactual methods have been growing in production use since 2017 and are increasingly required by regulation.

The basic counterfactual

The methodology: given an input x with prediction y, find an input x' close to x with prediction y' ≠ y. The "closeness" is measured by some distance metric (typically L1 distance over feature changes); the goal is to find the smallest changes that flip the prediction. Wachter et al. (2017) introduced this formulation as the "counterfactual explanation" approach. The output is intuitive: a list of changes that would have produced a different decision (e.g., "increase your annual income by $5,000 and your credit score by 20 points to get approved").

Diverse counterfactuals (DiCE)

A single counterfactual is often unsatisfying — the customer might not be able to change exactly those features. DiCE (Diverse Counterfactual Explanations, Mothilal et al. 2020) generates a diverse set of counterfactuals showing different paths to a different prediction. The user can then choose the path most actionable for them. The DiCE library is the dominant production tool for counterfactual generation; it supports tabular and image data.

Feasibility constraints

Naive counterfactual generation can produce changes that aren't feasible — "decrease your age by 5 years" or "change your gender" are not actionable suggestions. Mature counterfactual methods incorporate feasibility constraints: which features are mutable (income, credit utilisation), which are immutable (age, gender), which are mutable but require time (employment history). The 2020–2024 work on feasible counterfactuals has produced sophisticated handling of these constraints; production tools support feature-mutability specifications.

Causal counterfactuals

An important refinement: standard counterfactuals are model-counterfactuals (what input changes would change the model's prediction), not real-world counterfactuals (what real-world changes would change the outcome). The two can differ if the model uses correlation rather than causation. Causal counterfactuals use causal models to generate explanations grounded in real-world cause-effect relationships. The methodology is more demanding (requires a causal model of the domain) but produces explanations that are correctly actionable. The 2021–2024 work on causal counterfactuals has been advancing; production deployment is rare but growing.

The right-to-explanation context

Counterfactual explanations have been particularly important for the regulatory "right to explanation" debate. The EU GDPR's Article 22 has been interpreted as requiring "meaningful information about the logic involved" in automated decisions; counterfactuals are arguably the most meaningful form. The EU AI Act (Section 9) extends this. For affected users, counterfactual explanations are often more useful than feature attributions — they tell you what to do, not just what mattered. Production deployments in regulated industries increasingly include counterfactuals.

Computational considerations

Counterfactual generation is more computationally expensive than feature attribution. The methodology requires searching the input space for points with different predictions, often via constrained optimisation. For tabular data with a small number of features, this is fast; for high-dimensional inputs (images, text), it's expensive. The 2024–2026 work on efficient counterfactual generation has produced approximate methods that scale better; the trade-offs are between explanation quality and generation speed.

Explanations for LLMs and Generative Models

The classical explainability literature focused on classifiers; LLMs and generative models present new challenges. The output is text, not a class. The input is text, with sequence and attention structure. The "decision" being explained is multi-token rather than single-step. The methodology is less mature than for classifiers; this section covers what works, what doesn't, and what's actively developing.

Attention visualisation

For transformer-based models, the most-popular interpretability tool is attention visualisation: show which tokens each attention head attends to during processing. The methodology produces visually striking output and is widely used in research papers and production interfaces. The limitation: attention is not explanation — Jain & Wallace 2019 demonstrated that attention weights can be substantially modified without changing model outputs, suggesting attention isn't reliably indicative of which tokens "matter" for the prediction. The methodology has its place but should be used with awareness of this limit.

Prompt attribution

An LLM-specific question: which parts of a long prompt contributed to a specific output? Methods include leave-one-out (remove each prompt segment and measure output change), gradient-based attribution on token embeddings, and influence-based methods. The 2023–2026 work on prompt attribution has been productive; production tooling (LangSmith's trace analysis, the various 2024 entrants) increasingly includes prompt-attribution capabilities for debugging LLM applications.

Chain-of-thought as explanation

Modern reasoning LLMs (o1-class models) produce extended reasoning traces before final outputs. The natural-language reasoning is a form of explanation by design: the model explains its own reasoning. The 2024–2026 work has documented both the promise (reasoning traces really do explain output behaviour, when faithful) and the limits (the reasoning trace doesn't always reflect the actual computation; Turpin et al. 2023 showed that LLMs sometimes produce post-hoc rationalisations that don't match their actual reasoning). Production deployments increasingly use reasoning traces as explanations; the faithfulness question remains active research.

Citation and grounding

For RAG and document-grounded applications, citations are a form of explanation — show which documents the answer came from. Methods include retrieval-time logging (which documents were retrieved), generation-time attention analysis (which documents the LLM attended to during generation), and explicit citation generation (training the model to produce citations). The 2024–2026 work on grounded answer generation (Perplexity-style citations, the various RAG-with-citations products) has substantially advanced; production deployment is mature for many use cases.

Influence functions for LLMs

Influence functions (Koh & Liang 2017, originally for classical ML) measure how much each training example contributed to a specific prediction. For LLMs, the methodology is computationally challenging but has been advanced (Grosse et al. 2023, Anthropic). Influence-based explanations answer "which training documents shaped this output?" — useful for debugging, copyright analysis, and detecting memorisation. The methodology is still expensive enough that production deployment is limited; research-stage tools are increasingly available.

The fundamental limit

For LLMs and generative models generally, faithful explainability is harder than for classifiers. The model's behaviour emerges from billions of parameters interacting in complex ways; reducing this to a per-token attribution loses substantial information. The 2024–2026 honest assessment: LLM explainability methods provide useful but imperfect signals; full faithful explanation of LLM outputs is more aspirational than achievable with current methodology. Mechanistic interpretability (Ch 04) is the longer-term path; practical explainability (this chapter) provides operational tooling for the meantime.

Evaluating Explanations: Faithfulness and Trustworthiness

An explanation that's plausible but wrong is worse than no explanation — it provides false confidence. Evaluating whether explanations are faithful (accurately reflect the model) and trustworthy (genuinely useful for the audience) is its own discipline. This section covers the evaluation methodology that distinguishes meaningful explainability claims from misleading ones.

Faithfulness metrics

Several methods evaluate whether an explanation reflects model behaviour. Deletion: remove the most-important features (according to the explanation) and check that the prediction changes substantially. Insertion: starting from a baseline, add features in order of importance and check that the prediction approaches the original. Perturbation: vary the most-important features and check that predictions vary correspondingly. The metrics produce quantitative faithfulness scores; methods can be ranked by these scores, providing some empirical traction on which methods are more faithful for which contexts.

Stability and robustness

An explanation should be stable: similar inputs should produce similar explanations. Stability metrics: compute explanations for slight perturbations of the input and measure how much they change. Robustness metrics: compute explanations under various forms of model uncertainty (different random seeds, different baselines) and measure consistency. LIME has known instability problems; SHAP is more stable for tree models; saliency methods vary substantially. Production explainability should test for stability before deploying.

Sanity checks

Beyond faithfulness, several sanity checks verify that explanations actually depend on the model. The 2018 Adebayo et al. paper (mentioned in Section 5) introduced the model parameter randomisation test: replace the trained model with a randomly-initialised model and check that the explanation changes substantially. If it doesn't, the explanation isn't really about the trained model. The data randomisation test: train the model on randomly-permuted labels and check that the explanation changes. Methods that fail sanity checks shouldn't be trusted; the literature has documented several methods that fail.

Human studies

Beyond mathematical metrics, human evaluation studies measure whether explanations actually help humans. Forward simulation: given an explanation, can a human predict what the model would output on a new input? Counterfactual identification: can a human use the explanation to identify which features need to change for a different prediction? Trust calibration: do explanations help humans correctly calibrate their trust in the model? The 2018–2024 human-study literature has produced mixed results: explanations sometimes help, sometimes hurt, sometimes don't matter; the effects depend on the specific method, audience, and task.

Adversarial explanations

A concerning finding: explanations themselves can be adversarially manipulated. Slack et al. 2020 demonstrated that LIME and SHAP could be fooled — a model deliberately designed to behave badly while producing benign-looking explanations. Heo et al. 2019 showed that explanations can be modified through training-time interventions without changing the underlying model behaviour. The implications: explanations alone shouldn't be used as evidence of model fairness or correctness; they need to be combined with direct testing.

The pragmatic disclaimer

Practical explainability is not a fully-solved problem. Methods have known limitations; faithful explanation is hard; even popular methods have been documented as misleading in specific cases. Mature deployment includes explicit disclaimers: "this is the model's most-influential features as measured by SHAP, with the caveat that..." Honest production explainability builds trust through transparency about what the methods can and can't do; pretending the methods are more trustworthy than they are erodes long-term trust.

Regulatory and Stakeholder Contexts

Beyond the technical methodology, explainability operates in specific regulatory and stakeholder contexts. GDPR imposes "right to explanation" requirements for automated decisions affecting EU residents. EU AI Act mandates transparency for high-risk AI systems. Adverse-action notices in US lending require specific explanations to denied applicants. FDA AI/ML guidance shapes medical-AI explainability. Each context has specific requirements that production deployment must meet.

GDPR Article 22

The EU General Data Protection Regulation (GDPR, in force 2018) Article 22 limits automated decision-making and includes a "right to obtain meaningful information about the logic involved." The legal interpretation has been contested — what does "meaningful" require? — but production deployments serving EU residents must provide some form of explanation. Counterfactual explanations have been particularly favoured because they're directly meaningful to affected individuals. The 2018–2024 case law has gradually clarified the requirements; the practical answer is "provide the most-meaningful explanation you can given the model and the audience."

EU AI Act transparency requirements

The EU AI Act (full enforcement 2026) imposes additional transparency requirements on high-risk AI systems. Article 13 requires that systems be designed for transparency such that operators can interpret their outputs; Article 86 provides individuals affected by AI-driven decisions a right to explanation. The regulatory implementation is still being clarified through implementing acts and case law. Production deployments must provide both system-level documentation (model cards, technical documentation) and decision-level explanations (per-decision attributions or counterfactuals).

US adverse-action requirements

In the US, the Equal Credit Opportunity Act (ECOA) and Fair Credit Reporting Act (FCRA) require that adverse credit decisions be accompanied by specific reasons. The "specific reasons" requirement is fairly literal: regulators expect lists of factors (e.g., "high debt-to-income ratio", "limited credit history"). Modern ML-based credit models use SHAP attributions or feature-rank-style explanations to satisfy these requirements. The 2023 CFPB guidance has clarified that complex ML models can be used if explanations are provided; the methodology has matured into standard practice.

Medical AI explanations

For healthcare applications, the FDA's Software as a Medical Device (SaMD) guidance and the broader medical-AI regulatory framework increasingly require explanations. The methodology differs from financial: medical practitioners need to understand the model's basis for diagnosis, not just receive feature lists. Visual saliency (showing where the model "looked" in a radiology image) is common; combined with structured natural-language explanations, this provides what physicians need. The 2024–2026 work on medical-AI explainability has been substantial; the FDA's AI/ML pathway provides a regulatory-acceptance route.

Bias auditing and fairness

Explanations are also used for bias auditing — verifying that models don't make decisions based on protected attributes or their proxies. The methodology: aggregate explanations across decisions, look for systematic differences between demographic groups, identify proxies for protected attributes that may be driving disparate impact. The 2018–2024 work on auditing-via-explanations has produced substantial methodology; production fairness audits use these tools (combined with direct testing). Ch 06 (Fairness, Bias & Equity) develops this in detail.

Documentation as explanation

Beyond per-decision explanations, model documentation — model cards, datasheets, system cards — provides system-level explanations. These describe what the model is, what data it was trained on, what its known limitations are, and how it behaves. The EU AI Act explicitly requires this documentation for high-risk systems; major AI labs publish system cards as standard practice. The methodology has matured (Mitchell et al. model cards from Ch 01 are the dominant format); the operational discipline of producing and maintaining these documents is its own engineering practice.

The Frontier and the Open Problems

Practical explainability is mature for tabular ML and reasonably-mature for image classification; for LLMs and agent systems, the methodology is rapidly evolving but less developed. Several frontiers are active: explanation reliability under adversarial conditions, faithful explanation for chain-of-thought reasoning, agent-trace analysis, the integration of mechanistic insights with practical methods. This section traces the open frontiers and the directions the field is moving in.

Explanation reliability and certified explanations

The 2018–2024 literature has documented many cases where explanations are misleading. The 2024–2026 frontier includes certified explanations: methods that provide formal guarantees about what the explanation captures. The methodology builds on certified-robustness ideas (Ch 03) but applied to explanations rather than predictions. Whether certified explanations become operationally meaningful or remain a research direction is open; the trajectory is encouraging.

LLM and agent explainability

The methodology for LLM and agent explainability is genuinely emerging. Open questions: how do we faithfully explain extended-reasoning outputs? How do we explain agent trajectories that span many tool calls? How do we connect agent behaviour to specific underlying model features? The 2024–2026 work on these questions has been productive but the methodology is far from mature. Mechanistic interpretability (Ch 04) provides one path; classical-explainability adaptations provide another; the synthesis is active research.

Faithful chain-of-thought

A specific concerning finding: LLM chain-of-thought reasoning is not always faithful. The model's natural-language reasoning trace doesn't always match its actual computational path; the methodology of producing genuinely-faithful reasoning is active research. Process supervision (Ch 02 §3) helps by training the model to follow reasoning processes that humans can verify; consistency checking looks for when reasoning and final answer disagree; interpretability-based verification uses internal model state to check that the stated reasoning matches the actual computation. The trajectory is encouraging but the problem isn't solved.

Mechanistic-meets-practical synthesis

Mechanistic interpretability (Ch 04) and practical explainability (this chapter) have largely been separate research directions. The 2024–2026 trend is increasing synthesis: SAE features (Ch 04 §4) can produce interpretable explanations of LLM outputs; circuit analysis can explain why specific predictions were made; the methods are starting to inform each other. The 2025–2026 work on combining mechanistic insights with practical explanation pipelines has been productive; deployment of mechanistically-grounded explanations is starting.

Evaluation infrastructure

The explainability evaluation infrastructure (faithfulness metrics, sanity checks, human-study methodology) has matured but remains fragmented. The 2024–2026 work on standardising evaluation (the various benchmark datasets, the open-source explainability libraries with built-in evaluation) is starting to produce shared infrastructure. Whether the field converges on standard evaluation methodology — analogous to RobustBench for robustness — is open and would be substantial progress if it happens.

What this chapter has not covered

The remainder of Part XVIII develops adjacent topics. Fairness, bias, and equity — the discipline of ensuring models don't discriminate in problematic ways, which uses explainability tools — is Ch 06. Privacy in ML — the discipline of protecting training data and individual privacy — is Ch 07. AI governance, policy, and regulation — the broader policy framework that increasingly demands explanations — is Ch 08. The chapter focused on the practical-explainability tooling; the broader ecosystem connects to mechanistic interpretability (Ch 04), fairness (Ch 06), and governance (Ch 08).