Part XIII · Specialized ML Methods · Chapter 10

Federated Learning & Privacy-Preserving ML, training without seeing the data.

Standard machine learning collects data into a central place and trains a model on it; federated learning keeps the data where it is — on phones, hospitals, banks, edge devices — and trains the model by exchanging only updates. The promise is privacy by architecture: no central data store to leak. The challenge is technical and adversarial: federated updates can themselves leak information, the participating clients are heterogeneous and unreliable, and a determined attacker can reconstruct training data from gradients without ever seeing them. This chapter develops the federated-averaging framework, the privacy machinery (differential privacy, secure aggregation, homomorphic encryption) that makes the protocol genuinely private, the personalisation methods that fight client heterogeneity, and the deployment patterns where federated learning is now the only legally viable approach.

Prerequisites & orientation

This chapter assumes neural-network fundamentals (Part V Ch 01–02), basic optimisation (Part I Ch 03), and the distributed-computing material of Part III Ch 05. Cryptography background is helpful for the secure-aggregation and homomorphic-encryption sections but not assumed — the chapter develops what is needed. The continual-learning framing of Ch 09 is a useful complement when federated systems must adapt to drift, but the methods are largely orthogonal: federated learning is about where data lives during training, continual learning is about when training happens.

Two threads run through the chapter. The first is data heterogeneity: federated clients have non-IID data, different feature distributions, and different label distributions, and most of the engineering difficulty in federated learning comes from making the training stable and the model useful in the face of this heterogeneity. The second is privacy at multiple levels: data never leaving the device is a weak privacy guarantee on its own, because gradients leak information; differential privacy, secure aggregation, and trusted execution environments are the tools that turn "data stays local" into a meaningful privacy property. The chapter is organised so the foundational protocol comes first, then the privacy machinery, then the failure modes and applications.

In this chapter

Why Federated Learning Exists data sovereignty · GDPR · cross-silo vs. cross-device
Federated Averaging and the Core Protocol FedAvg · local SGD · client sampling · communication rounds
Data Heterogeneity and Client Drift non-IID data · FedProx · SCAFFOLD · client drift
Differential Privacy in Federated Learning ε-δ DP · DP-SGD · privacy accounting · noise calibration
Secure Aggregation and Cryptographic Protection MPC · masking · Bonawitz protocol · dropout robustness
Homomorphic Encryption and Trusted Execution FHE · CKKS · TEEs · SGX · hybrid approaches
Privacy Attacks Against Federated Learning gradient inversion · membership inference · DLG · iDLG
Personalisation and Federated Multi-Task Learning FedPer · pFedMe · meta-FL · clustered FL
Byzantine Robustness and Adversarial Clients Krum · trimmed mean · poisoning · backdoors
Applications and Frontier Gboard · healthcare · cross-silo finance · LLM federated · frontier

Why Federated Learning Exists

For most of machine learning's history, training data was understood as something you collect, centralise, and run gradient descent on. Federated learning starts from the recognition that this central-collection assumption is increasingly impossible — the data lives on user devices that never share, in hospitals that legally cannot share, on phones whose owners would object to sharing. The chapter exists because a broad and growing class of valuable problems can only be solved by training models on data that never moves from where it was generated.

The data-collection assumption breaks

Three forces have made central data collection untenable for many problems. First, regulatory: GDPR, HIPAA, the EU AI Act, China's Personal Information Protection Law, and dozens of sectoral regulations restrict cross-border or cross-organisation data movement. Second, contractual: enterprise customers increasingly require data-residency and no-export guarantees as a condition of doing business. Third, practical: the largest datasets in the world live on the world's smartphones, and uploading them to a central server would be prohibitive in bandwidth, energy, and user trust.

Federated learning is the engineering response. Instead of moving data to a central trainer, move the model to where the data lives, train locally for a few steps, and aggregate the resulting updates centrally. The data never leaves the device or the silo; only model parameters do.

Cross-device versus cross-silo

The federated-learning literature distinguishes two regimes that share the framework but differ dramatically in their constraints. Cross-device federated learning involves millions of unreliable participants — phones, smart speakers, IoT devices — each with a tiny local dataset, often offline, with constrained battery and bandwidth. The Google Gboard keyboard-prediction system is the canonical example: hundreds of millions of phones each contributing tiny updates from their local typing history. Cross-silo federated learning involves dozens to hundreds of organisational participants — hospitals, banks, branches of a multinational — each with sizeable local datasets, reliable connectivity, and strong organisational identity. Hospital consortia training jointly on patient data are the canonical example.

The two regimes often appear together in the literature but have different methodological needs. Cross-device requires aggressive client sampling, robustness to dropouts, on-device compute constraints, and privacy at scale. Cross-silo requires careful per-organisation accounting, strong trust models, and often regulatory-grade privacy guarantees. Most of the chapter's machinery applies to both, but the design choices and trade-offs differ.

The privacy hierarchy

"Data stays on the device" is the weakest possible privacy guarantee. The model updates that leave the device — gradients, weight differences, statistics — themselves leak information about the underlying data. Section 7 covers gradient inversion attacks that reconstruct training images from gradients alone, demonstrating that a federated protocol without additional privacy machinery is closer to a bandwidth optimisation than to a privacy guarantee.

Modern federated systems layer additional defences: differential privacy (Section 4) adds calibrated noise to updates with a mathematical privacy guarantee; secure aggregation (Section 5) uses cryptography so the server learns only the aggregate, not individual updates; trusted execution environments (Section 6) put computation inside hardware enclaves that the operator cannot inspect. The serious deployments combine multiple layers, and Section 7 explains why each layer alone is insufficient.

Why This Matters in 2026

Federated learning has gone from a research curiosity to a regulatory requirement in several domains. Medical-imaging consortia, multi-bank fraud-detection systems, and on-device language modelling on phones are now federated by default. The 2024 EU AI Act explicitly recognises federated learning as a means of compliance for high-risk training applications, and the US FDA has begun accepting federated training in pre-market regulatory submissions for medical-device AI. The chapter covers what the field has converged on as the standard methods, with attention to the failure modes and the open problems that remain.

Federated Averaging and the Core Protocol

The federated-learning protocol that almost every system uses is federated averaging, introduced by McMahan et al. in 2017. It is conceptually a small extension of distributed SGD — clients run SGD locally for several steps and the server averages their resulting models — but the choice to do many local steps before communicating changes the dynamics dramatically. FedAvg is the protocol every federated system starts from, and most production-deployed federated learning is some variant of it.

The FedAvg algorithm

The full FedAvg recipe runs a sequence of communication rounds. Each round, the server selects a fraction of available clients, broadcasts the current global model to them, and waits for them to compute and return updated models. The server averages the received updates (weighted by local-data size) to form the next global model:

Federated Averaging (McMahan et al. 2017) Round t: server samples K clients; sends global θ t Each client k: trains for E local epochs on D k, returns θ k,t+1 Server: θ t+1 = Σ k (n k / n) \cdot θ k,t+1 K is the per-round client sample (typically 10-100 even when total clients number in the millions); E is the local-epoch count, the key hyperparameter that distinguishes FedAvg from naive distributed SGD; n k is client k's dataset size and n is the sum across selected clients. Setting E=1 reduces FedAvg to ordinary mini-batch SGD averaged over clients; setting E>1 saves communication but introduces client drift (Section 3).

One FedAvg round. ① The server broadcasts the current global model θ_t to a sampled cohort of K clients. ② Each client trains for E local epochs on their private data D_k (which never leaves the device) and returns only the updated weights θ_k,t+1. ③ The server averages the returned weights, weighted by local-data size, to form θ_t+1 for the next round. The protocol is conceptually simple; everything else in the chapter is about the things that go wrong with it — heterogeneity, privacy leakage, dropouts, and adversarial clients.

Communication is the bottleneck

The reason FedAvg uses many local epochs per round is that communication is the dominant cost in cross-device federated learning. Sending a model to a phone, waiting for it to train, and getting the update back can take seconds; sending hundreds of bytes per training step would be prohibitive. By doing E=5 to 20 local epochs per round, FedAvg reduces communication by 5×–20× compared to naive distributed SGD. The trade-off is that the local updates drift away from each other when client data is heterogeneous, which Section 3 covers.

Several other communication-efficiency techniques layer on top of FedAvg. Compression: clients send quantised or sparsified gradients, reducing bandwidth by 10×–100×. Selective participation: only the clients whose updates would matter most contribute each round. Pre-trained initialisation: starting from a pretrained model rather than random initialisation reduces the rounds needed to convergence by orders of magnitude. The 2024 production deployments use all three.

Client sampling and the cohort problem

Cross-device federated learning typically has more clients available than can be reasonably included in any single round. The standard practice is to sample a small cohort per round — 50 to 1000 clients out of millions — based on availability (the device is plugged in, idle, on Wi-Fi), capability (sufficient local compute), and protocol-specific eligibility (random sub-sampling for privacy reasons). The cohort sampling has serious implications: cohorts are not uniform random samples of the population, so the global model is implicitly trained on a biased subset of clients, and the bias direction depends on which devices are most available.

Server architecture

The server side of a production federated system is more elaborate than the simple "average gradients" image suggests. Real servers handle: client registration and health monitoring, cohort selection (typically randomised to mitigate the eligibility-bias problem), broadcast of the global model, aggregation of returned updates, dropout handling (clients that disconnect mid-round), per-round model checkpointing, and a separate evaluation pipeline. The aggregation logic itself is the simplest part; everything else is operational complexity that determines whether the system works at scale. The TensorFlow Federated and Flower frameworks provide reference architectures for this server logic.

Data Heterogeneity and Client Drift

Vanilla FedAvg works beautifully when client data is independent and identically distributed across clients. It works poorly when it is not — and in practice, federated client data is wildly non-IID. Different users type different words, different hospitals see different patient populations, different banks have different fraud patterns. The handling of this heterogeneity is the central engineering problem that distinguishes a working federated system from a paper.

The non-IID problem

Suppose client A has only photos of cats and client B has only photos of dogs. After E local epochs of FedAvg, A's local model has specialised to cat features and B's to dog features; the simple average of these models produces something that is competent at neither. The pathology is called client drift — local models drift away from each other in directions that the average cannot recover.

The mathematical issue: when clients have different local objectives (because their data is different), the gradient at the global model on client A points toward A's optimum, the gradient on client B points toward B's optimum, and these directions can be at large angles. After many local steps, the local models end up in genuinely different regions of weight space, and averaging them lands you in the middle, possibly a worse place than where you started. The empirical pattern: FedAvg's convergence slows dramatically as data heterogeneity increases, sometimes failing to converge at all.

FedProx: proximal regularisation

The first major fix was FedProx (Li et al. 2018), which adds a proximal term to each client's local loss that pulls it toward the global model:

FedProx local objective ℒ k,prox (θ) = ℒ k (θ) + (μ/2) ‖θ - θ global ‖ 2 θ global is the global model received at the start of the round; μ is a hyperparameter controlling the strength of the proximal pull. The regulariser prevents local models from drifting too far during their E local epochs, at the cost of some plasticity. FedProx degenerates to FedAvg when μ=0 and is more conservative as μ grows.

FedProx is simple to implement (one extra term in the local loss) and consistently improves over FedAvg on non-IID benchmarks. It does not eliminate client drift but bounds it. For most production deployments where data is moderately heterogeneous, FedProx with a tuned μ is the right starting baseline.

SCAFFOLD: control variates for client drift

A more principled approach is SCAFFOLD (Karimireddy et al. 2020), which uses control variates to correct for client drift directly. Each client maintains an estimate of the difference between its local gradient direction and the global gradient direction; the local update is corrected by subtracting this estimated drift. The result is a federated SGD that mathematically eliminates the bias from client heterogeneity, achieving variance-reduced rates that match centralised SGD.

SCAFFOLD outperforms FedProx on non-IID benchmarks but requires per-client state (each client must store its control variate between rounds, which complicates client churn) and double the communication per round (the control variates must be updated alongside the model). For cross-silo deployments where clients are persistent and bandwidth is plentiful, SCAFFOLD is often the right choice; for cross-device deployments where clients are ephemeral, FedProx is more practical.

Dataset partitioning and benchmarking

Honest evaluation of non-IID methods requires a clear specification of the heterogeneity. The standard partitioning techniques in the literature: label-skewed (each client has a subset of class labels — the most adversarial case), quantity-skewed (clients have wildly different dataset sizes), feature-skewed (clients have different input distributions but similar labels), and concept-shift (the input-label relationship itself differs across clients). Most published methods are evaluated under label-skewed partitions because they produce the largest accuracy gaps between methods; production deployments more commonly face feature-skewed and concept-shift heterogeneity, where the methods' rankings often differ.

Differential Privacy in Federated Learning

Federated learning's "data stays local" property is a privacy-by-architecture claim, but as Section 7 will show, the model updates leaving each client can themselves leak training data. The standard response is differential privacy, a mathematical framework for quantifying and bounding the information leakage of any computation over data. DP-SGD applied within FedAvg is the canonical privacy-preserving federated protocol and is now a default in serious deployments.

What differential privacy means

Differential privacy (Dwork et al. 2006) is a mathematical guarantee about a randomised algorithm M operating on a dataset D. M is (ε, δ)-differentially private if for any two datasets D and D' differing in one record, and any output set S:

(ε, δ)-Differential Privacy P( M(D) \in S ) \leq e ε \cdot P( M(D') \in S ) + δ ε is the privacy budget — smaller ε means stronger privacy. The guarantee says: an attacker observing M's output cannot tell whether any particular record was in the input, beyond a multiplicative factor of e ε . δ is a small failure probability — the guarantee is allowed to fail with probability δ. Typical privacy levels: ε < 1 is strong, ε < 5 is moderate, ε > 10 is weak. δ is usually set to 1/n or smaller, where n is the dataset size.

The intuition: if the algorithm's behaviour barely changes when one person's record is added or removed, then no observer of the output can be confident about whether that person was in the data. Differential privacy is the strongest standard privacy notion in widespread use, and a substantial mathematical literature has developed around it.

DP-SGD: gradient clipping and noise

The standard mechanism for differentially-private deep learning is DP-SGD (Abadi et al. 2016), which modifies SGD with two changes. First, per-example gradient norms are clipped to a fixed bound C. Second, calibrated Gaussian noise is added to the clipped per-example gradient sum before averaging:

DP-SGD update g̃ i = g i \cdot min(1, C / ‖g i ‖) θ t+1 = θ t - η \cdot ( (1/B) \cdot Σ i g̃ i + 𝒩(0, σ 2 C 2 I) / B ) Clipping bounds each example's gradient contribution; noise then masks any one example's effect. The noise scale σ is calibrated to the desired privacy budget via the Gaussian mechanism. Each training step costs some privacy budget, accumulated across the run via a privacy-accounting mechanism (RDP, GDP, PRV — modern accountants compute tight bounds on the total ε after T steps).

DP-FedAvg

Combining DP with FedAvg gives DP-FedAvg, the standard recipe for private federated learning. Two design choices matter. First, where the noise is added: local DP has each client add noise to their update before sending; central DP has the server add noise after aggregating. Local DP gives stronger guarantees (each client trusts no one) but requires much more noise per client; central DP gives the same overall guarantee with much less total noise but requires a trusted server. In practice, central DP is the standard choice with secure aggregation (Section 5) used to get equivalent guarantees without the trust assumption.

Second, what counts as "one record." Federated DP can be at the example level (one example contributes ε / E privacy across all rounds) or at the user level (one user's entire dataset contributes ε / E). User-level DP is the meaningful guarantee for federated systems but requires more noise. Production deployments at Google's Gboard and Apple use user-level DP with privacy budgets in the ε ≈ 1–5 range per user across the entire training run.

The utility cost

DP is not free. Adding calibrated noise hurts model accuracy, and tighter privacy (smaller ε) hurts more. The empirical pattern: ε ≈ 10 typically costs 1–2% accuracy on mid-scale problems; ε ≈ 1 costs 5–15%. For very large models trained on very large data, the cost is smaller (the noise is averaged over more examples). For small specialised models on small datasets, the cost can be prohibitive. The 2024 production trend is to combine DP with strong pretraining: pretrain on public data without DP, then fine-tune federated with DP. The pretraining absorbs most of the model capacity, so the federated DP fine-tuning has less work to do and the accuracy cost is much smaller.

Secure Aggregation and Cryptographic Protection

Differential privacy bounds what an observer can infer from the aggregate, but the federated server still sees individual client updates if no other protection is used. Secure aggregation uses cryptography to ensure the server learns only the sum of updates, not any individual update. Combined with central DP, it produces a system where the server has the privacy properties of a trusted aggregator without the clients having to trust the server.

Why aggregation alone is not enough

Recall: the server in vanilla FedAvg sees every client's individual update before averaging. If those updates leak information about the data (and they do, as Section 7 details), then a malicious or compromised server can extract that information. The naive privacy story of "data stays on the device" is broken at this point. Secure aggregation closes the gap — the server sees only the aggregate of all clients' updates, never any individual one.

The masking-based protocol

The standard secure-aggregation protocol (Bonawitz et al. 2017) uses pairwise masking. Each pair of clients (i, j) agrees on a shared random vector r_ij (using Diffie-Hellman key exchange). Client i adds Σ_j<i r_ji − Σ_j>i r_ij to their update. When the server sums all clients' masked updates, the masks cancel pairwise (each r_ij appears with positive sign in one client's update and negative sign in another's), yielding the true sum. The server never learns any individual update because every individual update has a random mask attached.

The protocol's elegance is that the masks cancel exactly when summed but make any individual update look like noise. The cost is the pairwise communication required to establish the masks (O(N²) for N clients, mitigated by neighbourhood schemes), and the dropout problem: if a client disconnects mid-round, their masks were never cancelled and the server's sum is corrupted. The Bonawitz protocol handles this with secret-sharing of the masks so any t-of-N surviving clients can reconstruct dropouts' masks; modern variants use more efficient techniques.

Multi-party computation framing

Secure aggregation is an instance of secure multi-party computation (MPC), the broader field of cryptographic protocols that compute functions over distributed inputs without revealing them. The general MPC literature provides protocols for arbitrary functions; the practical federated-learning version specialises to the addition function (much cheaper than general MPC) at the cost of generality. Several frameworks (Google's protocol, MIT's MP-SPDZ, Microsoft's CrypTen) provide production implementations.

Combining with differential privacy

The standard production combination is secure aggregation plus central differential privacy. The protocol: clients clip their updates, secure aggregation produces the sum at the server, the server adds calibrated DP noise to the sum, the noisy sum becomes the model update. This gives the strong privacy guarantee of central DP (less noise than local DP) without requiring the clients to trust the server (because secure aggregation hides individual contributions). It is the gold standard for serious federated deployments and is what Google's Gboard, Apple's federated-learning systems, and most regulated-industry deployments use in 2026.

Limitations and threat model

Secure aggregation defends against an honest-but-curious server that follows the protocol but tries to extract information from the messages it sees. It does not defend against a malicious server that deviates from the protocol — for instance, the server could broadcast different models to different clients and use the per-client updates to learn each client's data. Defending against malicious servers requires more elaborate cryptography (zero-knowledge proofs, verifiable MPC) at substantial overhead. Most production deployments accept the honest-but-curious threat model on the basis that the server operator has reputational and legal incentives to follow the protocol.

Homomorphic Encryption and Trusted Execution

Secure aggregation handles the additive case efficiently but does not extend gracefully to general computation. Two complementary technologies fill that gap: homomorphic encryption performs computation on encrypted data with no decryption, and trusted execution environments use hardware enclaves where computation is hidden even from the machine running it. Both are part of the modern privacy-preserving-ML toolkit, with different cost profiles and threat models.

Homomorphic encryption

Homomorphic encryption (HE) is a class of encryption schemes that allow computation on ciphertexts: encrypt a, encrypt b, perform the encrypted-addition operation, decrypt — and you get a + b. The encryption is preserved through computation. A fully homomorphic encryption (FHE) scheme supports arbitrary computation; partially or somewhat homomorphic schemes support restricted operations (often just addition, or addition plus a bounded number of multiplications).

For federated learning, the practical scheme is typically CKKS (Cheon-Kim-Kim-Song 2017), a leveled-homomorphic scheme that supports floating-point arithmetic with controllable precision and reasonable performance. Clients encrypt their updates, the server performs the aggregation (which is just addition) on ciphertexts, and the result is decrypted by a key holder. The privacy property is dramatic: the server can never decrypt anything; it only sees ciphertexts throughout. The cost is also dramatic: HE-based aggregation is 100×–1000× slower than plaintext, and the ciphertexts are 10×–100× larger than the underlying tensors.

For applications where the cost is tolerable and the threat model justifies it (cross-silo medical or financial federated learning, where clients refuse to trust any server-side cryptography that depends on secret-sharing assumptions), HE is the right choice. For cross-device deployments at scale, the cost is usually prohibitive and secure aggregation is the practical choice instead.

Trusted execution environments

A trusted execution environment (TEE) is a hardware-enforced enclave on a CPU that runs code with its memory contents inaccessible to the rest of the machine — including the operating system and the machine's owner. Intel SGX, AMD SEV, ARM TrustZone, and Apple's Secure Enclave are commercial TEEs; the AWS Nitro, Google Confidential Computing, and Azure Confidential Computing services expose them as cloud primitives.

For federated learning, a TEE on the server can run the aggregation logic in a way that even the cloud operator cannot inspect. Clients verify (via remote attestation) that the server's code is the agreed-upon aggregation logic running inside a genuine enclave; they then send their plaintext updates encrypted to the enclave's public key. The aggregation runs at full speed (no cryptographic overhead), produces the aggregate, and emits only the aggregate. The privacy properties depend on the security of the TEE itself, which has had several published vulnerabilities (Spectre-style side-channels, fault-injection attacks).

Hybrid approaches

Modern federated deployments increasingly use hybrids that combine techniques. A typical 2026 stack: clients use secure aggregation to hide individual updates from the server-side cohort coordinator; the coordinator runs inside a TEE for additional defence-in-depth; the final aggregate is post-processed with central DP for a mathematical privacy guarantee on the released model. Each layer protects against different attacks. The combined stack has stronger privacy than any single technique provides.

What the layering does and doesn't do

The layered stack defends against passive-honest-but-curious adversaries (server-side or cloud-side) and against gradient-inversion attacks on individual updates. It does not defend against malicious clients (which Section 9 covers separately), against side-channel attacks on TEEs, against vulnerabilities in the cryptographic primitives themselves, or against social-engineering attacks on the operators. The privacy properties are real and strong but bounded; production deployments should be clear-eyed about the threat model that each layer addresses.

Privacy Attacks Against Federated Learning

Federated learning's privacy story is "the data never leaves the device." A line of attack research starting around 2019 demonstrated that this story, taken alone, is dangerously incomplete: a malicious or curious server can recover individual training examples from the gradients clients send. Understanding these attacks is essential for designing defences that actually defend, and for recognising why the differential-privacy and secure-aggregation machinery of Sections 4–5 is not optional.

Gradient inversion: reconstructing inputs from gradients

Gradient inversion attacks exploit the fact that a gradient is a function of the input — and for many architectures, that function is invertible enough to recover the input. Geiping et al. 2020 ("Inverting Gradients") showed that for a single example trained on a standard CNN, the gradient typically determines the input image to within a recognisable approximation. The attack: given the gradient, optimise an artificial input until its gradient matches the observed one. The recovered input is, in many cases, visually identical to the original.

For batch sizes greater than one, the attack is harder but still often successful. DLG (Deep Leakage from Gradients, Zhu et al. 2019) and its successor iDLG demonstrated batch-level reconstruction; subsequent work (Yin et al. 2021) extended to ResNet-50 on ImageNet-scale images. The empirical pattern: for small batches and feed-forward architectures, gradients leak essentially the entire input. For large batches and architectures with batch normalisation, the leakage is partial but still significant.

Membership inference

A weaker but still meaningful attack class is membership inference: given access to model updates, determine whether a specific person's data was in the training set. Membership inference is conceptually easier than gradient inversion (it asks a yes/no question rather than reconstructing data) and applies in settings where gradient inversion is impractical. Shokri et al. 2017 introduced membership-inference attacks for ML in general; the federated-learning literature adapted them quickly.

For regulated applications (health, finance), membership inference is often the worst case the regulator cares about — knowing that a specific patient was in the training set can violate HIPAA even if their actual record is not recovered. Differential privacy provides provable defences against membership inference; secure aggregation does not directly defend against it (because the aggregate itself can leak membership signal).

Property inference and reconstruction at scale

Two further attack classes round out the picture. Property inference attacks try to determine aggregate properties of clients' data — for instance, "what fraction of client X's data has property P?" — without recovering individual examples. Sybil-style data extraction attacks combine many compromised clients to gradually extract information about a target client through carefully-chosen queries. Both are realistic threats for cross-device federated learning where the server has many opportunities to probe individual clients.

Why the defences work

Differential privacy directly bounds all of these attacks: by definition, an (ε, δ)-DP mechanism limits how much any one record can affect the output, so any attack trying to extract information about that record is correspondingly bounded. Secure aggregation defends against per-client gradient inversion (the server never sees individual gradients) but does not by itself defend against attacks on the aggregate. The combined DP + secure-aggregation stack provides defence in depth: secure aggregation prevents trivial gradient inversion; DP bounds the residual leakage from the aggregate.

The empirical pattern from the attack-defence literature: federated systems without DP are vulnerable to reconstruction attacks at clinically-relevant accuracy on most modern architectures; systems with reasonable DP budgets (ε < 5) defeat published attacks; systems with very weak DP (ε > 100, sometimes used for "compliance" without much privacy) are still vulnerable. The take-home: privacy claims for federated systems should be read carefully, and "uses differential privacy" is not the same as "provides meaningful privacy."

Personalisation and Federated Multi-Task Learning

Vanilla FedAvg trains a single global model that is supposed to work for every client. When client distributions are sufficiently different — as they almost always are in real cross-device deployments — the global model is mediocre for everyone. The personalisation literature asks: how do you give each client a model tailored to their data while still benefiting from joint training across clients? The answers connect federated learning back to the meta-learning material of Ch 08 and to multi-task learning more broadly.

Why one model isn't enough

Consider keyboard prediction. A user who texts in English has very different typing patterns than one who codes in Python or one who writes Hindi. A single global model averages over these populations and predicts mediocrely for all of them; a per-user model learns the user-specific patterns but has too little data to learn anything more general. The right answer is somewhere in between — a user-specific head on top of a shared backbone, or a shared backbone fine-tuned slightly per user.

FedPer: shared backbone, personal head

The simplest personalisation method is FedPer (Arivazhagan et al. 2019), which splits the network into a shared backbone (federated-averaged across clients) and a personalised head (kept local on each client). Each client receives the global backbone, prepends or appends their personal head, and trains both layers locally; only the backbone updates are sent back for federated averaging. The result is a per-client model that has access to the joint-learned features but specialises the final-layer predictions to the local data.

FedPer is simple and works well when the underlying feature representation is genuinely shared across clients but the output mapping varies. For keyboard prediction this fits — token features are shared, but per-user vocabulary preferences differ. For more dramatic heterogeneity (different feature distributions across clients), more elaborate methods are needed.

pFedMe and meta-learning approaches

pFedMe (T. Dinh et al. 2020) takes a meta-learning angle. The global model is treated as a meta-initialisation; each client runs a few gradient steps on their local data to produce a personalised model; the meta-update aggregates across clients to improve the meta-initialisation for next round. The framework is essentially MAML-FL: the global model is what's meta-trained, the local fine-tuning is what's deployed. The connection to Ch 08's meta-learning material is direct.

Meta-FL methods produce stronger personalisation than FedPer-style methods at higher complexity. They are the right choice when client distributions vary substantially in their feature structure (not just their output labels) and when each client has enough data to support a meaningful local fine-tune.

Clustered federated learning

Sometimes clients fall into natural groups — distinct user segments, hospital types, language families. Clustered federated learning assigns each client to a cluster and trains a separate model per cluster, with the assignment learned alongside the models. The IFCA algorithm (Ghosh et al. 2020) is canonical: alternate between assigning each client to its best-fitting cluster's model and updating each cluster's model on its assigned clients. The result is K models for K clusters, each better-suited to its assigned clients than a single global model would be.

Clustered FL works well when there are genuinely distinct subpopulations among clients but not when client distributions form a continuum. The hyperparameter K (number of clusters) is itself difficult to choose; production deployments often use K = 2–10 with empirical tuning.

The personalisation-privacy tension

A subtle point: personalisation methods often weaken the privacy story. A per-client fine-tune means a per-client model that the client must store and use; if that model is exfiltrated (by a malicious app, a compromised device backup), it leaks more about that client's data than a shared global model would. The 2024 production trend: keep the personalisation to small modules (LoRA-style adapters, FedPer heads) so the per-client diff is small, and apply DP to the federated portion of training. This bounds the privacy leakage from both the personalisation and the federation.

Byzantine Robustness and Adversarial Clients

Privacy attacks (Section 7) try to extract information from the federated protocol; this section's attacks try to corrupt the model itself. A malicious client can submit deliberately-crafted updates that slow training, push the global model toward an attacker-chosen target, or implant backdoors. The Byzantine-robust aggregation literature provides defences, but they trade off accuracy and robustness in ways that are still being mapped out.

The attack surface

Three attack classes matter. Untargeted attacks aim to degrade the global model — Byzantine clients submit large random or adversarial updates that drag the global aggregate toward bad weights. Targeted attacks aim to make the model misclassify specific inputs while preserving overall accuracy — useful when the attacker has a particular goal (cause the spam filter to misclassify their emails). Backdoor attacks implant a trigger pattern: the global model behaves normally on most inputs but produces an attacker-chosen output when a specific trigger pattern is present.

The federated setting is unusually hospitable to such attacks because the server cannot easily verify what each client did locally. A client claims to have computed a gradient; the server must accept or reject it without seeing the underlying data. Without explicit defences, even a small fraction of Byzantine clients can dominate the aggregate.

Robust aggregation: median, trimmed mean, Krum

The basic defensive idea: replace the simple average with a robust aggregator that is resistant to outliers. Three standard choices. Coordinate-wise median: take the median rather than the mean of each parameter across clients. Trimmed mean: discard the top and bottom α% of values per coordinate before averaging. Krum (Blanchard et al. 2017): for each client, compute the sum of squared distances to its K nearest neighbours; pick the client with the smallest sum as the round's update. All three are provably robust to a bounded fraction of Byzantine clients.

Multi-Krum and similar variants average over the top-K most-aligned clients, balancing robustness and convergence speed. The 2024 production standard for Byzantine-robust federated learning is some variant of multi-Krum or trimmed mean; pure Krum and pure median converge slowly when no attack is present.

The robustness-accuracy trade-off

Robust aggregators are conservative — they ignore or downweight outlier updates, including legitimate ones from clients with unusual but valid data. Under non-IID heterogeneity, "unusual" overlaps with "Byzantine" and the aggregator may be unable to distinguish them. The empirical pattern: pure-IID benchmarks show a small accuracy cost from robust aggregation; non-IID benchmarks show a much larger cost; severely non-IID settings can make robust aggregation underperform vanilla averaging because legitimate diversity is mistaken for adversarial behaviour.

The modern pragmatic compromise: combine robust aggregation with FedProx-style regularisation (limits how much any client can deviate, reducing the apparent "outlier-ness" of legitimate updates) and reputation tracking (clients who consistently send useful updates are trusted more). Production systems typically use these layered defences rather than any one robust aggregator alone.

Backdoor attacks and detection

Backdoor attacks are the hardest to defend against because the malicious updates are designed to look normal — they degrade the model only on specific trigger inputs that the attacker controls. The defensive literature has produced several detection methods (clustering of update directions, neural-network-based attack classifiers, certified-robust training) but none provide strong guarantees. The 2024 best-practice: combine robust aggregation with periodic centralised evaluation on a held-out test set that includes plausible backdoor triggers, plus selective deployment that compares new global models against the previous one before promoting them.

Open problems in Byzantine FL

Several aspects remain unresolved. The interaction between Byzantine robustness and differential privacy is subtle — DP noise itself looks like Byzantine noise to a robust aggregator, and combining the two requires careful tuning. The handling of strategic Byzantine clients who adapt to the defence is an open arms race. The case where the majority of clients are Byzantine (rather than the bounded-fraction case the standard literature handles) is mostly unsolved. For applications where Byzantine clients are a serious threat, the right move in 2026 is conservative: small client cohorts of vetted participants, explicit reputation tracking, multiple defensive layers, and continuous monitoring.

Applications and Frontier

Federated learning shows up wherever data cannot be centralised — privacy regulation, contractual requirements, bandwidth constraints, on-device deployment. The methods of the chapter combine differently in different domains; the specific stack at Google's Gboard differs sharply from a hospital consortium training a medical-imaging model differs from a multi-bank fraud-detection collaboration. This final section surveys the application landscape and the frontier where federated learning is reshaping how the largest models are trained.

Cross-device: Gboard, Apple, Siri

The largest production federated systems are at the major mobile platforms. Google's Gboard keyboard has trained federated next-word-prediction and emoji-suggestion models since 2017, using FedAvg with secure aggregation and central DP across hundreds of millions of phones. Apple's keyboard, autocomplete, and Siri systems use federated learning for similar tasks, with their own DP framework (the public-facing "Private Federated Statistics" project). The deployment patterns share themes: aggressive client sampling (cohorts of ~1000 from many millions of available phones), DP at user-level with ε in the single digits across the entire training run, secure aggregation as the standard cryptographic protection, and pre-trained initialisation to minimise the federated rounds needed.

Cross-silo: medical imaging and finance

Hospital consortia training jointly on patient imaging is the canonical cross-silo application. The MELLODDY consortium (10 pharma companies federating drug-discovery training) and the various NIH-funded medical-imaging federations (NHANES, MICCAI) demonstrate the pattern: dozens of well-identified institutional clients, each with thousands to millions of local examples, training models that no single institution could build alone. The trust model is different from cross-device: institutions sign data-use agreements rather than trusting a server's cryptography, and the privacy machinery focuses on regulatory requirements (HIPAA, GDPR) rather than on cryptographic minimality.

Cross-silo finance applications include fraud detection across banks, anti-money-laundering across financial institutions, and credit-risk modelling across lenders. The regulatory and competitive sensitivities are extreme — banks have strong reasons to not share customer data with each other — and federated learning is one of the few technical paths that addresses them. The 2024–2026 deployments use secure aggregation plus DP plus extensive auditability (each round's contribution is logged for regulatory review).

Federated learning of foundation models

The largest open question in 2026: can foundation models be trained federated? The classical FedAvg recipe scales poorly to billion-parameter models — the per-round communication is enormous, the client compute requirements are prohibitive, and the convergence is slow. Several lines of work address pieces of the problem. FedLoRA and similar adapter-based federated fine-tuning send only LoRA-style low-rank updates, dramatically reducing communication. Split federated learning partitions the model across clients and server (clients run early layers, server runs later layers) to reduce client compute. Federated pretraining on open data with cross-silo participation is the most ambitious direction and has produced early results (the BloomZ-style cross-organisation efforts, the MedFM medical foundation model) that suggest it is feasible at moderate scale.

Frontier methods

Several frontiers are particularly active in 2026. Personalised federated foundation models: the combination of foundation-model scale with federated personalisation, using LoRA-style adapters per client. Federated reinforcement learning: federating policy training across embodied agents, with the homogeneity-via-heterogeneity issue particularly acute. Vertical federated learning: a less-developed branch where different clients hold different features for the same individuals (rather than different individuals' full feature vectors), which requires entirely different aggregation primitives. Provable-privacy frameworks: combining DP, secure aggregation, and TEEs into formal end-to-end guarantees that hold under composition.

What this chapter does not cover

Several adjacent areas are out of scope. The classical distributed-machine-learning literature (data parallelism, model parallelism, parameter servers) is the technical ancestor of federated learning but assumes trusted workers and centralised data, so the methodology is mostly different. Pure cryptographic privacy-preserving ML — secure inference, encrypted databases, oblivious algorithms — solves related problems with different machinery and warrants separate treatment. The differential-privacy literature beyond DP-SGD (DP statistical estimation, DP synthetic data, local DP for statistics) is closely related but mostly outside the deep-learning context. And the legal and compliance dimensions of federated learning — what counts as "data minimisation" under GDPR, the FDA's evolving guidance for federated medical AI, the cross-border-data-flow doctrines — are crucial in deployment but are policy questions rather than ML questions.