Part VI · NLP & Large Language Models · Chapter 08

Fine-tuning and parameter-efficient adaptation, the techniques by which a general-purpose pretrained model is specialised to a domain, a task, a style, or a user — without retraining it from scratch, without forgetting what it knew, and without paying the memory cost of a fresh copy of every weight for every variant you want to deploy.

A pretrained language model is a shared asset: a single frozen block of weights that represents a compressed reading of a large fraction of public text. Fine-tuning is the practice of continuing to train that model on a narrower dataset so that the resulting weights behave differently — better at a specific task, fluent in a specific style, aware of a specific corpus, or better aligned to a specific operator's standards. Classical fine-tuning updates every parameter; the resulting checkpoint is a full copy of the model, and running a thousand variants means paying for a thousand copies. Parameter-efficient fine-tuning (PEFT) is the family of methods — adapters, prefix tuning, prompt tuning, and especially low-rank adaptation (LoRA) and its descendants — that update only a small fraction of the weights, or inject small trainable modules alongside the frozen backbone, reaching most of the quality of full fine-tuning at a few percent of the parameter count, the memory, and the storage. This chapter covers both ends: when to fine-tune at all, how to do it well when you do, the PEFT methods that now dominate open-source practice, and the adjacent topics — quantization, distillation, model merging, multi-tenant serving — that together determine what a serving fleet of specialised variants actually looks like in 2026.

How to read this chapter

Sections one through three set the stakes. Section one is when to fine-tune — the decision tree that distinguishes a fine-tuning problem from a prompting problem, a retrieval problem, or a continued-pretraining problem, and the reason that distinction matters at planning time rather than implementation time. Section two covers full fine-tuning in its classical form: the straightforward approach of continuing gradient descent on a new dataset with a new objective, the memory and storage cost that follows, and when paying that cost is actually the right answer. Section three covers catastrophic forgetting — the central failure mode of naive fine-tuning, in which adapting a model to a new task destroys competence on the tasks it was pretrained for, and the techniques that mitigate it.

Sections four through seven cover the parameter-efficient methods that now dominate. Section four is a taxonomy of PEFT — the additive, selective, and reparameterised families — and a map of where each method sits. Sections five through seven go through the three canonical approaches: adapters (Houlsby 2019), the original bottleneck-module approach; prefix, prompt, and P-tuning, which treat soft embeddings as the adaptation surface; and LoRA (Hu et al. 2021), the low-rank decomposition that has become the de-facto standard in 2024–2026.

Sections eight and nine cover the extensions of LoRA that made it deployable at scale. Section eight is QLoRA — the 4-bit backbone plus LoRA trick that made fine-tuning a 65B-parameter model feasible on a single consumer GPU. Section nine is a tour of the LoRA variants: DoRA, AdaLoRA, VeRA, LoRA+, PiSSA, and the reasons the field did not stop at the original formulation. Section ten covers selective methods — BitFit, IA³, and the (BitFit)-style family that trains only existing parameters rather than adding new ones.

Sections eleven through thirteen cover adjacent topics that share infrastructure and ideas with PEFT. Section eleven is quantization for training and inference — PTQ, QAT, GPTQ, AWQ, and the numerical-precision story that makes efficient fine-tuning possible. Section twelve covers task vectors and task arithmetic — the observation (Ilharco et al. 2022) that differences between checkpoints compose and subtract meaningfully. Section thirteen covers model merging — TIES, DARE, SLERP, model soups — the practical and surprisingly effective technique of averaging or interpolating checkpoints to combine capabilities.

Sections fourteen through sixteen cover the failure modes and deployment-side consequences of the current practice. Section fourteen is distillation — training a smaller student on a larger teacher's outputs, a long-established technique that is increasingly the main path to deployable model sizes. Section fifteen is fine-tuning without undoing alignment — the 2023 finding (Qi et al.) that even benign fine-tuning on small datasets can erase the safety training, and the mitigation research that followed. Section sixteen is multi-tenant adapter serving — batching LoRA variants together on a shared backbone, vLLM and friends, and why LoRA is now as much a deployment technique as a training one.

The closing in-ml section places this chapter between Chapter 07 (alignment) on one side and Chapter 09 (retrieval-augmented generation) on the other, and sketches what the open problems in specialised-model training look like as of early 2026 — where full fine-tuning still wins, where PEFT has won decisively, and where the practice is still moving faster than the textbooks.

When to fine-tuneThe decision tree: prompt, retrieve, adapt, or retrain
Full fine-tuningClassical continued training, learning rate regimes, memory and storage cost
Catastrophic forgettingThe central failure mode, replay, EWC, the rehearsal buffer
A taxonomy of PEFTAdditive, selective, reparameterised — the map of methods
AdaptersHoulsby bottleneck modules, the first parameter-efficient method
Prefix, prompt, and P-tuningSoft embeddings as the adaptation surface
LoRA — low-rank adaptationThe dominant method, low-rank factorisation, the rank/alpha hyperparameters
QLoRA4-bit backbones, double quantization, fine-tuning a 65B on a single GPU
LoRA variantsDoRA, AdaLoRA, VeRA, LoRA+, PiSSA, ReLoRA
Selective methodsBitFit, IA³, training only existing parameters
Quantization for training and inferencePTQ, QAT, GPTQ, AWQ, mixed precision, numerical formats
Task vectors & task arithmeticCheckpoint differences compose — adding and subtracting capabilities
Model mergingModel soups, TIES, DARE, SLERP, merging as a training step
Knowledge distillationTeacher–student training, why distillation dominates deployment
Fine-tuning without undoing alignmentQi et al. 2023, benign datasets that erase safety, preserving guardrails
Multi-tenant adapter servingvLLM, S-LoRA, batching hundreds of LoRAs on a shared backbone
Where fine-tuning sits in 2026What works, what doesn't, the open problems

§1

When to fine-tune — the decision tree

A large fraction of the fine-tuning projects that reach production should never have been fine-tuning projects in the first place. They should have been prompting, retrieval, or — occasionally — a continued-pretraining run. Choosing the right tool at planning time is the single most important decision in the chapter, because it is the decision that is most expensive to reverse.

Fine-tuning is the modification of a model's parameters by gradient descent on new data. It is powerful but it is not free: it requires curated training data, it requires evaluation that the curation produced what you wanted, it produces a new checkpoint that has to be stored and served, and — critically — it changes the model in ways that are difficult to reason about. Before reaching for fine-tuning, it is worth asking whether the same behaviour could have been produced by a prompt, by in-context examples, by retrieval, or by continued pretraining on a substantially larger unlabelled corpus.

A reasonable decision tree, in the order one should try things:

Prompting. If the base model can do the task at all — with a clearer instruction, a role, an output format, or a chain-of-thought prompt — stop. Prompting is free, instant, and leaves no artefacts on a serving fleet.
Few-shot / in-context examples. If the task is of a predictable shape and a handful of worked examples in the context gets the model most of the way there, this is a better solution than a fine-tuning run with the same examples, because it is reversible and lets you swap examples per-request.
Retrieval-augmented generation. If the gap is knowledge — the model does not know about your product catalogue, your company's documents, or the set of changes in last week's sprint — this is a retrieval problem, not a training problem. Chapter 09 covers it; the diagnostic is that the missing behaviour is recall of specific facts rather than a new style of response.
PEFT / LoRA. If prompting and retrieval are not enough, and you have a training set of ordered pairs (input → desired output) in the low thousands, a LoRA or adapter run is usually the right tool. It costs almost nothing to train, produces a ~1% parameter overlay, and can be served on top of a frozen base.
Full fine-tuning. If the target distribution is genuinely far from the base — a new language the base does not speak well, a domain register the base does not match, a code style substantially outside the training corpus — full fine-tuning becomes competitive. The cost is real, and the serving story is heavier.
Continued pretraining. If you have hundreds of billions of tokens of domain text and not much labelled data, continued pretraining on the unlabelled corpus is often the right answer. This is how BloombergGPT, Med-PaLM's predecessors, and several legal-domain models were built. It is closer to Chapter 05 than to this chapter, and it is almost always followed by an alignment pass.

The bright line. Fine-tuning changes behaviour; retrieval changes knowledge. If your diagnostic question is "does the model say the right kind of thing?", fine-tuning is probably the answer. If it is "does the model know this fact?", retrieval almost always is. Teams that conflate the two waste a lot of money trying to fine-tune facts into a model that would be one retrieval call away on the other path.

A second useful lens is the data-to-behaviour ratio. A small number of high-quality demonstrations (hundreds to thousands) can meaningfully change a model's behaviour through PEFT. Getting the same change from prompting requires you to fit those demonstrations in the context window, which is expensive on every request. Getting the same change from full fine-tuning requires tens of times more data, because the signal is diluted across many more parameters. The rule of thumb — if you have a thousand examples, try LoRA; if you have a hundred thousand, consider full fine-tuning; if you have a billion tokens, you are back in Chapter 05's territory — is approximately right in 2026 and has been for two years.

The final observation is that fine-tuning has a reversal problem. Once a model has been fine-tuned, the behaviour of that checkpoint is entangled with the base in ways that are hard to recover. If the fine-tuning was bad, you revert to the base; you do not incrementally un-train. This makes fine-tuning an operation that deserves the same care as a schema migration: think at planning time about what happens when you want to replace the base, what happens when you want to replace the dataset, and what happens when the target behaviour drifts. PEFT methods help substantially here, because the artefact of a PEFT run is small — a few hundred megabytes of LoRA weights, not a forty-gigabyte checkpoint — and therefore cheap to keep multiple versions of.

§2

Full fine-tuning — the classical approach

Full fine-tuning is the obvious thing: take a pretrained checkpoint, load it into the same training loop that produced it, and continue running gradient descent on a new dataset with a new loss. It works, it is well understood, and for a decade it was the only option. Its drawbacks are practical rather than statistical, and understanding them is how you understand why PEFT exists.

Mechanically, full fine-tuning is an optimization continuation. The optimizer state — in practice Adam or AdamW — is either reinitialised or reloaded from the pretraining checkpoint. The learning rate is almost always much lower than the pretraining rate, typically $10^{-5}$ or $10^{-6}$ rather than the $10^{-4}$ to $10^{-3}$ range used during pretraining. The schedule is usually a short warmup followed by a linear or cosine decay over a small number of epochs. The loss is whatever is appropriate for the target task: cross-entropy on the new text, sometimes with masking that applies the loss only to the response portion (so-called instruction masking), sometimes with a composite loss that includes a regularisation term against the base model.

The mathematical object is simple. Let $\theta$ be the parameters of the base model, $\mathcal{D}_{\text{new}}$ be the fine-tuning dataset, and $\mathcal{L}$ be the task loss. Full fine-tuning solves $\min_\theta \mathbb{E}_{(x,y)\sim\mathcal{D}_{\text{new}}}[\mathcal{L}(y, f_\theta(x))]$ starting from $\theta_0$, the pretrained weights, with a learning rate small enough that we stay in the basin of attraction around $\theta_0$. This last point is crucial: we are not looking for a global minimum of the new objective; we are looking for a nearby local minimum that balances the new objective against the implicit prior encoded in the base weights.

Hyperparameters that actually matter. Learning rate dominates. Epoch count matters (two to five is typical; more than five is almost always overfitting). Batch size, within reasonable bounds, matters less than either. Weight decay is usually set small. The response-only masking trick — computing the loss only on the assistant's tokens, not the user's — is the difference between a well-behaved instruction-tuned model and one that learns to complete arbitrary user strings.

The costs are threefold. First, memory: training requires storing activations for backprop, gradients for every parameter, and optimizer state (two moments for Adam per parameter). In practice this is about $16\times$ the forward-pass memory for a full fine-tune in full precision, reduced by a factor of two or four with mixed-precision and careful attention. For a 70B-parameter model this is on the order of a terabyte of GPU memory, which is why sharding (ZeRO, FSDP) is non-optional past a few billion parameters.

Second, storage and deployment: every fine-tuned checkpoint is a full-size model. A 70B model at bf16 is 140 GB. A serving fleet with a thousand variants is 140 TB of model files, and the warm-swap latency between variants is measured in minutes. This is the cost structure that made PEFT economically necessary. Third, forgetting: fine-tuning on a narrow dataset shifts the weights, and the abilities represented by those shifted weights degrade — more on this in the next section.

When is full fine-tuning still right? When the dataset is large (tens of millions of examples or more), when the target distribution is far from the base (a new language, a new code style, a fundamentally different chat format), when you control a small number of variants (one specialised model, not a thousand), and when the storage overhead is acceptable. For small teams serving small numbers of specialised models, full fine-tuning is still a reasonable default. For everyone else, the rest of this chapter is the story of doing less work for almost as much quality.

§3

Catastrophic forgetting

A model fine-tuned on a new task tends to get worse at its old tasks, often dramatically. This phenomenon — observed, named, and analysed since the early nineties — is the oldest failure mode in transfer learning and the one that most shapes modern practice. Every technique in this chapter is, in part, a response to it.

The classical demonstration (McCloskey & Cohen 1989) was on small MLPs: train on task A to convergence, switch to task B, train to convergence on B, re-evaluate on A. Performance on A collapses — often all the way back to chance — because the weights that encoded A have been overwritten by gradient steps driven by the loss on B. The phenomenon scales. Fine-tune a 70B pretrained language model on medical QA and it becomes noticeably worse at code. Fine-tune a code model on Python and it becomes worse at C++. Fine-tune a chat model on a narrow corporate dataset and it forgets how to converse. The degree of forgetting depends on learning rate, number of steps, and dataset breadth, but the direction is robust.

There are three ways to think about why this happens. From a gradient geometry view, the update rule $\theta \leftarrow \theta - \eta\nabla\mathcal{L}_B$ has no knowledge of $\mathcal{L}_A$; any direction that reduces $\mathcal{L}_B$ is taken, even if it increases $\mathcal{L}_A$, because the training objective sees only $B$. From a capacity view, the parameters encode both tasks in a distributed way, and any substantial modification to encode $B$ better necessarily disrupts the subspace used for $A$. From a Bayesian view, the pretraining is a prior and the fine-tuning is a likelihood; catastrophic forgetting is the case where the likelihood dominates the prior, which happens when the fine-tuning signal is strong relative to the regularisation keeping us near $\theta_0$.

The central design question. Every fine-tuning method can be read as a particular answer to: how do we absorb the new task while minimally disrupting what was learned before? Full fine-tuning answers with a very small learning rate and a short training run. LoRA answers by not modifying the base weights at all and training a small side-path instead. Replay answers by mixing old and new data in every batch. Elastic Weight Consolidation answers by adding a regulariser that penalises movement of parameters the old tasks cared about.

The classical mitigations go back to replay and rehearsal. Kirkpatrick et al.'s Elastic Weight Consolidation (EWC, 2017) computes a diagonal Fisher information for parameters on the old task and adds a quadratic penalty $\sum_i F_i(\theta_i - \theta_{0,i})^2$ to the new-task loss, so that parameters the old task used heavily are penalised for moving. Experience replay — training on a mixture of new-task and old-task data — is often the simplest and most effective option when you have the old data. Zenke et al.'s Synaptic Intelligence and the various learning-without-forgetting approaches add logit-level regularisation that keeps the new model's outputs on old inputs close to the old model's.

The PEFT mitigation is different and, in most settings, more effective: don't move the base weights at all. If $\theta$ is frozen and only a small set of adapter parameters $\phi$ is trained, then the base model's behaviour on any input that does not use the adapter is mathematically unchanged. This is one of the most underrated practical virtues of LoRA and its relatives: forgetting is bounded by construction. Switching off the adapter at inference recovers the base model exactly.

For alignment-specific fine-tuning there is an extra layer to this story: the alignment training from Chapter 07 is a kind of fine-tuning, and subsequent fine-tuning can forget the alignment even when the new dataset contains nothing adversarial. The Qi et al. 2023 paper on benign fine-tuning erasing safety training — covered in §15 of this chapter — is the canonical demonstration, and it is catastrophic forgetting applied to the safety distribution specifically.

§4

A taxonomy of parameter-efficient fine-tuning

PEFT is an umbrella term for methods that fine-tune a small fraction of parameters, typically under one percent, while reaching most of the quality of full fine-tuning. A good mental map of the field has three families — additive, selective, and reparameterised — and each answers a different question about where the adaptation capacity should live.

The taxonomy, introduced in more or less its modern form by the Hugging Face PEFT library and the Delta Tuning review (Ding et al. 2022), partitions methods by the structural relationship between the trainable parameters $\phi$ and the frozen base parameters $\theta_0$:

Additive methods. Add new trainable parameters $\phi$ to the network, typically as small modules inserted between layers or as prefix embeddings prepended to attention keys and values. The base model is completely frozen. Training modifies only $\phi$. The adapters of Houlsby 2019, prefix tuning, prompt tuning, P-tuning, and (AnyPrecision)-style adapters all fall here.
Selective methods. Train a strict subset of the existing parameters $\theta$ — for example, only the bias vectors (BitFit, Zaken et al. 2022) or only the learned scales inside transformer FFNs (IA³, Liu et al. 2022). No new parameters are added; the trainable set is simply $\{\theta_i : i \in S\}$ for some sparse subset $S$.
Reparameterised methods. Rewrite the update $\Delta\theta$ in a constrained form and train the reparameterisation. LoRA is the archetype: $\Delta W = BA$ for rank-$r$ matrices $B\in\mathbb{R}^{d\times r}$, $A\in\mathbb{R}^{r\times d}$, with $r \ll d$. The underlying weight $W_0$ is frozen, the update $\Delta W$ is constrained to be low-rank, and only $B$ and $A$ are trained. At inference, $W_0 + BA$ can be merged into a single matrix, making reparameterised methods especially well-suited to deployment.

The diagonal of the table. The three families are not exclusive. IA³ is selective and additive in a weak sense (learned scales are added to activations). Some LoRA variants (DoRA, AdaLoRA) blur the line between reparameterised and selective. The taxonomy is a useful mental model, not a partition.

The practical question — which family to reach for first — has a clear answer in 2026: reparameterised, specifically LoRA or one of its variants. LoRA dominates the PEFT literature because it matches or exceeds the quality of other PEFT methods on almost every benchmark, it requires no architectural changes to the backbone, it merges into the base weights for inference so there is no runtime overhead, and the ecosystem — training libraries, inference servers, adapter registries — is now built around it. Adapters (additive) are the historical ancestor and still appear in multimodal and vision work. Selective methods (BitFit, IA³) are niche but valuable when parameter count must be kept extremely small.

There is also a fourth, cross-cutting axis that the literature sometimes distinguishes: where in the network the adaptation lives. Most methods attach to the attention projections (Q, K, V, O) because the attention computation is where most of the task-specific behaviour lives. Some attach to the FFN. A few — particularly prefix and prompt tuning — attach at the input. The empirical finding is that attention-projection adaptation is the most parameter-efficient of these choices for language tasks; FFN adaptation matters more for code and highly specialised domains.

Across all families, the shared empirical result is the same, and it is remarkable: for moderate-sized training sets (thousands to a few million examples), updating one percent or fewer of the parameters reaches within a couple of percentage points of updating all of them. The rank-bottleneck hypothesis — that the intrinsic dimensionality of fine-tuning updates is small, so a low-rank update is nearly as expressive as a full-rank one — is the best current explanation, and §7 on LoRA is where it is made precise.

§5

Adapters — the Houlsby bottleneck

Adapters were the first widely-used parameter-efficient method, predating LoRA by two years. They introduced the template that everything since has followed: freeze the base, insert a small trainable module with a bottleneck, train only the module. LoRA is arguably an adapter with the bottleneck moved and constrained; understanding the original makes the rest of the story easier to follow.

The canonical adapter (Houlsby et al. 2019) inserts, after every attention block and every FFN block in the transformer, a small two-layer network: a down-projection from hidden dimension $d$ to a bottleneck dimension $b \ll d$, a nonlinearity (originally GELU), an up-projection back to $d$, and a residual connection around the whole thing. Schematically, for a hidden state $h$:

h_out = h + W_up · σ(W_down · h)

with $W_{\text{down}} \in \mathbb{R}^{b\times d}$, $W_{\text{up}} \in \mathbb{R}^{d\times b}$. The up-projection is initialised to zero so the adapter is identity at the start of training and only develops as gradients flow through it. The number of new parameters per adapter is about $2bd$; with $b$ in the tens, this is a small fraction of the $\sim 4d^2$ parameters in a single transformer block.

The design is explicit about residual safety. Because the up-projection starts at zero, the initial forward pass is identical to the base model; training perturbs this gradually. Because the adapter is placed inside a residual block that already has a skip connection, the gradient path around the adapter is clean. These choices — initialise to identity, train inside a residual block, use a bottleneck — show up in every PEFT method since.

Pfeiffer et al. 2020 — AdapterHub. The most important follow-up to Houlsby was not a new method but an infrastructure project: AdapterHub demonstrated that adapters for dozens of tasks could be catalogued, shared, and composed. The eventual move from adapters to LoRA as the dominant PEFT method is in part an infrastructure story — LoRA's merge-back-to-base property makes serving much easier — rather than a quality story.

Quality-wise, adapters match or slightly trail full fine-tuning on GLUE and SuperGLUE, and on most downstream tasks. The gap is small (one or two points) and within the noise of a careful full fine-tune. The parameter overhead is a few percent per task. The runtime overhead is non-trivial, because the adapter layers run in addition to the base computation — there is no merge-back trick — which is one of the practical reasons LoRA eventually displaced them.

Two refinements are worth knowing. Parallel adapters (He et al. 2021) move the adapter alongside the FFN rather than after it, letting the adapter and FFN be computed in parallel and eliminating some of the sequential overhead. Compacter (Mahabadi et al. 2021) further reduces parameter count by factorising $W_{\text{down}}$ and $W_{\text{up}}$ as Kronecker products. Neither is widely deployed in LLM practice today, but both informed the LoRA family that followed.

Where adapters are still used: multimodal work (vision–language adapters, cross-lingual adapters) and architectures where LoRA's attention-projection focus is too narrow. For most text-only LLM fine-tuning in 2026, LoRA has replaced them. Adapters remain the cleanest conceptual introduction to PEFT, which is why they come before LoRA in this chapter.

§6

Prefix, prompt, and P-tuning — soft embeddings as the adaptation surface

A different answer to the PEFT question: instead of adding modules to the network, add trainable embeddings to the input. The model is entirely frozen; the adaptation is a handful of vectors prepended to the token sequence — vectors that have no corresponding natural-language tokens and that the model learns to interpret. This family of methods (prefix tuning, prompt tuning, P-tuning) was briefly dominant around 2021 and remains a useful tool for specific settings.

Prompt tuning (Lester et al. 2021) is the simplest. Prepend $k$ learned vectors $P \in \mathbb{R}^{k\times d}$ to the token embeddings of the input. Freeze the model. Train only $P$ on the downstream loss. The number of new parameters is $kd$; for $k=20$ and $d=4096$ this is about 80K parameters — three orders of magnitude smaller than even a small LoRA adapter. The finding in the original paper: at sufficient base-model scale (past roughly 10B parameters), prompt tuning matches full fine-tuning on SuperGLUE. Below that scale, it underperforms noticeably.

Prefix tuning (Li & Liang 2021) is a more powerful variant. Instead of prepending to the input embeddings only, it prepends learnable vectors to the key and value matrices at every attention layer. Each layer has its own set of prefix vectors. This gives much more capacity than prompt tuning (hundreds of thousands to millions of parameters instead of tens of thousands), and matches full fine-tuning reliably at smaller base scales.

P-tuning (Liu et al. 2021) and P-tuning v2 (Liu et al. 2022) merged the two ideas. v1 used an LSTM or MLP to parameterise the soft prompt so it could be more expressive; v2 converged on the prefix-tuning-style per-layer prompts. P-tuning v2 was the immediate predecessor to the dominance of LoRA.

Why this family lost ground. Prefix tuning works, and for very small adaptation budgets it still works well. But it has three drawbacks compared to LoRA. The prefix takes up context length, reducing the budget available for actual inputs. The prefix is always active, so there is no way to serve base and adapted behaviour on the same request without recomputing. And the gradient flow through softmax attention to the prefix is numerically tricky, making training less stable than LoRA training. LoRA avoided all three.

There is a more abstract observation that the prefix-tuning family made clear: the adaptation interface of a transformer need not be the weights. Any path that reaches the attention computation — adjusting the inputs, adjusting the keys and values, adjusting the biases — is a viable adaptation surface, and different surfaces have different trade-offs in capacity, stability, and serving cost. This framing made it easier, once LoRA appeared, to see it as yet another choice of adaptation surface rather than something categorically different.

Prefix and prompt tuning retain a few use cases in 2026. They are nearly free in memory and storage. They compose trivially with other methods. They are the natural fit when the adaptation is very mild — a change of style, a preferred output format — rather than a substantive capability addition. Research in soft prompt retrieval (matching learned prefixes to queries at inference time) keeps the family alive in some retrieval-augmented setups, but the dominant PEFT method is now LoRA.

§7

LoRA — low-rank adaptation

LoRA is the single most influential fine-tuning method of the post-transformer era. Published by Hu et al. in 2021, widely adopted by 2022, and dominant by 2024, it is now the default PEFT method: the one that beginners start with, the one that infrastructure is built around, and the one that the rest of the PEFT literature compares itself to. Understanding why it works so well, and where its assumptions break, is the most important single topic in this chapter.

The core idea is a constraint on the shape of the weight update. For a weight matrix $W_0 \in \mathbb{R}^{d\times d}$ (say, an attention projection matrix), full fine-tuning learns an update $\Delta W \in \mathbb{R}^{d\times d}$, with $d^2$ parameters. LoRA assumes that $\Delta W$ has low rank — that is, $\Delta W = BA$ where $B \in \mathbb{R}^{d\times r}$ and $A \in \mathbb{R}^{r\times d}$ for some $r \ll d$ (typically $r = 8$ or $r = 16$). The number of trainable parameters drops from $d^2$ to $2rd$; for $d = 4096$ and $r = 8$ this is $\sim 16 \text{M} \to \sim 65 \text{K}$, a 250× reduction.

The forward pass is:

h_out = W_0 · x + (α/r) · B · (A · x)

where $\alpha$ is a scalar hyperparameter (LoRA alpha) that controls the magnitude of the adapter's contribution relative to the base. Training updates only $A$ and $B$; $W_0$ is frozen. At initialisation, $A$ is sampled from a small Gaussian and $B$ is zero, so the adapter contributes nothing until training begins — the identity-initialisation trick reused from adapters.

The intrinsic-dimensionality conjecture. The empirical observation behind LoRA is that fine-tuning updates, when projected to a low-rank subspace, lose very little quality. Aghajanyan et al. 2020 (Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning) showed that a random projection to as few as 200 dimensions was enough to match full fine-tuning on several GLUE tasks for RoBERTa. LoRA turns this observation into a training method. The why is still an open question in 2026 — something about the geometry of loss landscapes near well-pretrained checkpoints — but the that is robust.

LoRA has several practical virtues that combined to make it dominant. First, merge-back at inference: because $W_0 + \frac{\alpha}{r}BA$ is just another $d\times d$ matrix, you can bake the LoRA into the base weights when deploying, producing a single matrix with the same runtime cost as the original. This contrasts with adapters, which add sequential ops. Second, small artefacts: a LoRA for a 7B model at $r=8$ is about 16 MB, compared to 14 GB for the full model. Thousands of LoRAs fit on a single disk. Third, training stability: LoRA trains with surprisingly little hyperparameter tuning, compared to full fine-tuning or prefix tuning. Learning rates in the $10^{-4}$ to $10^{-3}$ range (higher than full fine-tuning, because we are not worried about displacing the base) usually work.

The hyperparameter discipline is straightforward. Target modules — which matrices to apply LoRA to. The default for most codebases is the attention Q, K, V, O projections; applying to FFN projections as well tends to help on harder tasks at the cost of more parameters. Rank $r$ — higher rank means more capacity and more parameters. Most reported results are in the $r=8$ to $r=64$ range. Lower $r$ is preferred when the task is close to the base; higher when it is far. Alpha $\alpha$ — the scaling factor; a common choice is $\alpha = 2r$ (e.g. $r=16, \alpha=32$). Dropout on the adapter path, typically 0.05 to 0.1. Learning rate — start at $2\times 10^{-4}$ and go from there.

Where LoRA breaks down: if the pretrained base model is genuinely bad at the task — the intrinsic dimensionality is high — LoRA cannot compensate at reasonable rank. The canonical example is teaching a pretrained model a new natural language it genuinely doesn't speak; this is a full-fine-tune or continued-pretraining problem, not a LoRA problem. A second failure is when the fine-tuning dataset is extremely large (many millions of examples): LoRA's capacity becomes the bottleneck, and full fine-tuning regains its edge. Inside those boundaries, LoRA is as close to a free lunch as fine-tuning research has produced.

§8

QLoRA — fine-tuning a 65B model on a single GPU

QLoRA (Dettmers et al. 2023) is the paper that made fine-tuning large models accessible. By quantising the frozen backbone to 4 bits, keeping the LoRA adapters in fp16, and introducing a memory-management trick called paged optimizers, QLoRA demonstrated that a 65-billion-parameter model could be fine-tuned on a single consumer GPU with 48 GB of memory — a task that had required a dedicated cluster a year earlier. The methods it introduced are now standard in every open-source PEFT library.

The problem QLoRA solves is memory. A 65B model at fp16 is 130 GB of weights; with activations, gradients (on the LoRA params only), and optimizer state, a LoRA run still needs more than 150 GB of GPU memory. The weights dominate. If the weights are held in the forward pass only as a reference to multiply against (we are not training them), their inference precision is the only thing that matters for correctness, and inference precision can be much lower than training precision.

The three tricks, in rough order of importance:

4-bit NormalFloat (NF4) quantization of the frozen backbone. QLoRA introduces a new 4-bit floating-point format, NF4, designed so that its representable values match the empirical distribution of weights in a pretrained transformer (approximately zero-mean Gaussian). NF4 quantization compresses the 130 GB base down to about 33 GB with minimal quality loss; the dequantization-then-multiply is done on-the-fly during the forward pass.
Double quantization. The 4-bit format itself requires per-block scaling constants (one fp32 constant per block of 64 weights). QLoRA quantizes those scaling constants too — an 8-bit quantization of the 32-bit scales — saving another half a bit per weight.
Paged optimizers. Using NVIDIA's unified memory, optimizer state for the LoRA parameters is automatically paged between GPU and CPU memory during gradient update spikes, avoiding OOM errors on transient peaks.

Why 4 bits, not 8. 8-bit quantization was well understood before QLoRA (bitsandbytes 8-bit quantization). The QLoRA contribution is that 4-bit turns out to be sufficient for the frozen-backbone-plus-LoRA setup, specifically. The LoRA adapters, which are fp16, absorb most of the precision noise introduced by the 4-bit base. This would not work in a full-fine-tuning setup; it works in QLoRA because only the adapters are being trained.

The quality result is the key finding. On a battery of benchmarks (MMLU, Vicuna-style chat evaluations, HumanEval, etc.), QLoRA matches the quality of fp16 LoRA fine-tuning, which in turn matches the quality of full fp16 fine-tuning. The information loss from 4-bit quantization shows up in the base model's raw capabilities, but not in the fine-tuning signal the LoRA captures. This is a genuinely surprising empirical result that is still not fully understood theoretically in 2026.

QLoRA made a second research contribution: the Guanaco family of open chat models, produced by QLoRA-fine-tuning LLaMA-65B on a curated chat dataset and released with the paper. Guanaco established that the democratised fine-tuning recipe could produce models competitive with much more expensively trained ones, and it kicked off the wave of community fine-tunes (Airoboros, Hermes, Dolphin, many others) that characterised open-source LLM practice in 2023–2024.

The broader lesson is that quantization and PEFT compose. Quantize aggressively for memory, adapt parametrically with high precision, and the two techniques attack disjoint parts of the cost structure. This composition pattern — frozen quantized backbone plus high-precision adapter — is now the default for open-source fine-tuning of large models, and §11 covers the quantization side of it in more depth.

§9

LoRA variants — DoRA, AdaLoRA, VeRA, LoRA+, PiSSA, ReLoRA

LoRA's simplicity invited modification. By 2026 there is a small zoo of LoRA variants, each addressing a perceived limitation of the original: fixed rank, fixed scaling, the direction/magnitude coupling, slow convergence on small learning rates, or the rigid initialisation scheme. A handful have accumulated enough evidence to be worth knowing; the rest are interesting but peripheral.

DoRA (Weight-Decomposed Low-Rank Adaptation, Liu et al. 2024) decomposes each weight matrix into a direction (unit-norm) and a magnitude (a scalar per column), and applies LoRA only to the direction while training the magnitude separately. The decomposition $W = m \cdot (V / \|V\|_c)$, where $m$ is a vector of column magnitudes and $V$ is the direction, is analytically motivated: it mirrors how pretrained weights actually look, with magnitudes and directions varying relatively independently. DoRA consistently outperforms LoRA by a point or two on downstream tasks, with negligible additional parameters, and is the LoRA variant most likely to supplant the original in the next few years.

AdaLoRA (Zhang et al. 2023) addresses the fixed-rank problem. Different weight matrices contribute differently to adaptation; some need more rank than others. AdaLoRA gives every matrix a shared rank budget and learns, via an SVD-style parameterisation with importance scores, how to allocate rank adaptively. The theory is clean; the practical gains over well-tuned uniform-rank LoRA are real but modest, and the additional complexity has limited its adoption.

VeRA (Vector-based Random-matrix Adaptation, Kopiczko et al. 2023) reduces parameter count further by sharing random matrices across layers and only training per-layer scaling vectors. The matrices $A$ and $B$ in LoRA are replaced by fixed random matrices (shared across the whole network) scaled by per-layer learned vectors. VeRA reaches within a small margin of LoRA at 10× fewer parameters, which matters for very-low-budget deployment.

LoRA+ (Hayou et al. 2024) is a hyperparameter tweak rather than an architectural change: use different learning rates for $A$ and $B$, specifically a higher rate for $B$. The theoretical argument (based on the NTK analysis of LoRA dynamics) suggests a rate ratio of $\sim d/r$, i.e. much higher for $B$. In practice this gives a measurable speedup and slight quality gain. It costs nothing to adopt and is now the default in several open-source trainers.

PiSSA (Principal Singular Values and Singular Vectors Adaptation, Meng et al. 2024) replaces LoRA's random-then-zero initialisation with an SVD of the base weights: initialise $B$ and $A$ from the top-$r$ singular components of $W_0$, and subtract those components from the frozen base. The intuition is that the most important directions of $W_0$ are the ones that most benefit from fine-tuning, so initialising the adapter to capture those directions produces better starting gradients. PiSSA converges faster than LoRA and reaches similar or slightly better final quality.

ReLoRA — LoRA for pretraining. ReLoRA (Lialin et al. 2023) is the most interesting variant philosophically. It uses LoRA-style low-rank updates during pretraining rather than fine-tuning, periodically merging the LoRA into the base and re-initialising. The motivation is to capture some of LoRA's memory efficiency during the expensive pretraining phase. It works — one can pretrain competitive models with substantial memory savings — and it hints that the rank-bottleneck hypothesis is not purely about fine-tuning.

A reasonable default policy in 2026: start with standard LoRA + LoRA+ learning rates. If you have budget for experimentation, try DoRA. For very-low-parameter budgets, consider VeRA. AdaLoRA and PiSSA are worth knowing about but rarely the right first choice. Do not spend weeks selecting among LoRA variants; the biggest quality lever is still the dataset.

§10

Selective methods — BitFit and IA³

The selective family of PEFT methods trains only a pre-specified subset of existing parameters — no new parameters added, no decompositions imposed. The resulting adapters are tiny, sometimes two orders of magnitude smaller than LoRA, and they reach surprising quality on many tasks. They have stayed a research curiosity rather than a production default, but they are worth understanding for the cases where parameter count truly matters.

BitFit (Zaken, Ravfogel & Goldberg 2022) is the simplest method in this chapter: freeze everything except the bias vectors, and train only those. A transformer has bias parameters in every linear layer (Q, K, V, O projections, FFN up-and-down projections, layer-norm biases), totaling perhaps 0.05% of the full parameter count. Training only those and freezing the rest recovers, on GLUE and related benchmarks, around 95% of the quality of full fine-tuning.

The result is remarkable and slightly mysterious. Biases are a tiny, relatively simple part of the parameter space: shifts on intermediate activations with no interaction with inputs. The fact that they are sufficient to steer the model's behaviour substantially is evidence for a broader structural hypothesis about pretrained models: most of the capability lives in the large weight matrices, and the fine-tuning signal is largely about routing that capability — deciding what to compute — which can be done by shifting activations rather than changing transformations.

IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations, Liu et al. 2022) is a closely related method. It introduces three learned vectors per transformer block — one scaling the key vectors, one scaling the value vectors, one scaling the hidden states in the FFN — and multiplies those vectors element-wise with the activations they correspond to. The parameter count is even smaller than BitFit (on the order of 0.01%), and the quality is roughly comparable. IA³ was the method used in the T-Few few-shot-learning paper, which demonstrated that PEFT with very few parameters could match in-context learning on many benchmarks at a fraction of the inference cost.

Selective methods are under-used. For latency-sensitive serving where adapter overhead must be near-zero, for storing thousands of task-specific adapters on small devices, or for cases where the fine-tuning signal is weak enough that even a few-hundred-parameter adapter can capture it, BitFit and IA³ are the right tool. They are rarely the right first choice for a production fine-tuning project — LoRA is almost always better on the mid-size tasks — but they earn their place on the PEFT menu.

A research observation about selective methods: they are much more sensitive than LoRA to which parameters are selected. BitFit's choice of biases is well-motivated; ablations that include FFN scales but exclude biases are worse, not better. The design space of selective methods is essentially combinatorial over subsets of parameters, and the finding is that a small number of principled subsets (biases, FFN scales, attention scales) work well and arbitrary random subsets do not. This is in contrast to LoRA, where the rank constraint is the main choice and the target modules are relatively interchangeable.

Selective methods also compose well with LoRA. One can train BitFit biases and a small LoRA simultaneously, giving a two-level adaptation: the biases handle routing-like changes, the LoRA handles small capacity additions. This combined setup is sometimes called (IA³)×LoRA in the literature and has appeared as a default in a few open-source trainers. It is a reasonable choice when you have very little training data and want to maximise the chance of absorbing its signal without overfitting.

§11

Quantization for training and inference

Quantization is the sibling topic to PEFT: it attacks the memory and compute cost of LLMs from the numerical-precision side rather than the parameter-count side. The two compose — QLoRA being the canonical example — and understanding quantization is increasingly part of understanding what fine-tuning costs and what it produces.

The space has three axes: when the quantization happens (post-training vs. training-aware), what is quantized (weights, activations, both), and how the numerical representation is structured (uniform integer, floating-point with few bits, learned codebook). The nomenclature across these axes:

Post-training quantization (PTQ). Take a model trained in fp16 or bf16 and quantize the weights after the fact, usually to 8, 4, or sometimes 3 bits. Two methods dominate in 2026. GPTQ (Frantar et al. 2022) uses an approximation to a second-order quantization error and per-channel scaling to reach 4-bit with minimal quality loss. AWQ (Lin et al. 2023, Activation-aware Weight Quantization) identifies the small fraction of weights that the activation magnitudes say matter most, and keeps those at higher precision. Both are used at inference; neither requires retraining.

Quantization-aware training (QAT). Simulate quantization during training, so the model learns weights that round well at the target precision. QAT costs more than PTQ but recovers more quality at aggressive bit-widths. For LLMs the QAT story is less developed than for smaller vision models, largely because training at low precision is expensive for LLM-scale pretraining; most LLM quantization in practice is PTQ.

Activation quantization. Quantizing weights is the easier half; activations are harder because they have outliers. SmoothQuant (Xiao et al. 2022) migrates activation-scale variance into weights (through a mathematically equivalent reparameterisation), making activations easier to quantize. LLM.int8() (Dettmers et al. 2022) handles activation outliers by keeping a small fraction of matrix multiplications in fp16 and the rest in int8.

Mixed precision during training. The other kind of quantization, one that predates all of the above, is mixed-precision training — keeping the master weights in fp32, activations in fp16 or bf16, and rescaling as needed. The 2017 Mixed Precision Training paper (Micikevicius et al.) made this routine; it is now the default for almost all deep learning, including fine-tuning runs. bf16 has largely displaced fp16 in LLM training because its wider exponent range eliminates the loss-scaling headache.

Numerical formats matter. The landscape in 2026 includes: fp32 (legacy, rarely used for LLMs), bf16 (training default), fp16 (older GPUs, dynamic range issues), fp8 (new, with E4M3 and E5M2 variants — H100/B200 hardware), int8 (inference serving), NF4 (QLoRA — 4-bit with a distribution tailored to neural-net weights), int4 (inference, AWQ/GPTQ), int2 and int1 (extreme, research only). The precision-quality curve is roughly linear from fp16 down to int4; below that it steepens sharply.

For fine-tuning specifically, the practical recipe in 2026 is: train in bf16 if the model fits; use QLoRA (NF4 base + fp16 LoRA) if it doesn't; quantize for inference with GPTQ or AWQ when deploying; keep the LoRA adapters in fp16 even when the base is int4, because the small size of the adapters makes their precision nearly free. This stack compresses a 70B fine-tuning run from ~500 GB of GPU memory (full fp16) down to ~50 GB (QLoRA) and keeps most of the quality.

Quantization is where this chapter most clearly blurs into inference engineering. The fine-tuning artefact you produce matters less than the quantized artefact you deploy; cross-checking quality at the inference precision is a routine part of any modern PEFT pipeline. A model that evaluates well in bf16 but degrades noticeably at int4 — which happens — might still be the right choice, but the degradation should be measured and budgeted.

§12

Task vectors and task arithmetic

If fine-tuning a model produces a checkpoint, the difference between the fine-tuned checkpoint and the base is an artefact worth studying in its own right. Ilharco et al. 2022 (Editing Models with Task Arithmetic) showed that these difference vectors — task vectors — have surprisingly useful algebraic properties. They add. They subtract. They sometimes commute. The finding opened a small research subfield and is the conceptual basis for the model-merging methods in the next section.

The definition is exactly what it sounds like. Let $\theta_0$ be the base pretrained weights and $\theta_t$ be the weights after fine-tuning on task $t$. The task vector is $\tau_t = \theta_t - \theta_0$. Task arithmetic is the operation of combining task vectors algebraically and then adding the result back to $\theta_0$ to produce a new model: $\theta_{\text{new}} = \theta_0 + \sum_i \alpha_i \tau_i$ for some coefficients $\alpha_i$.

The empirical findings, in order of strangeness:

Adding task vectors adds capabilities. $\theta_0 + \tau_{\text{code}} + \tau_{\text{translate}}$ is reliably better at both code and translation than the base, and often close to the quality of training on a mix of both.
Negating a task vector removes the capability. $\theta_0 - \tau_{\text{toxic}}$ reduces the model's propensity to generate toxic text below the base's level — the model has been edited to be less at this task than it was originally.
Analogies work. $\tau_{\text{french}} - \tau_{\text{english}} + \tau_{\text{code}}$ produces (in tightly controlled experiments) a model that does code generation in French-style contexts. The analogy arithmetic is less robust than the first two results, but it works often enough to be meaningful.

Why this is surprising. Neural network weight space is, a priori, not a vector space with any particular semantics; the manifold of trained models is thin, weird, and full of symmetries. The fact that you can naively add difference vectors and get sensible models is evidence that pretraining confines fine-tuning updates to a locally linear region of parameter space, and that the directions within that region have approximately semantic meaning. This is a strong empirical claim about the geometry of modern LLM training, and it has become the topic of active research.

The practical implications are immediate. If you have task-specific fine-tunes of the same base, you can combine them into a multi-task model without a multi-task training run. If you have a model with a behaviour you want to suppress, and you can find or produce a fine-tune that enhances the behaviour, subtracting that fine-tune produces a debiased model. If you have a fine-tune on your data that accidentally introduced a regression on some other capability, you can look at whether the regression axis is a task vector you can subtract.

The caveats are also immediate. Task vectors only compose well when the fine-tunes started from the same base and used similar hyperparameters and dataset sizes; cross-base arithmetic is meaningless. Scaling coefficients matter: $\theta_0 + \tau_{\text{code}} + \tau_{\text{translate}}$ with $\alpha_i = 1$ is often too strong and needs dampening (the merging methods in the next section formalise this). And the clean algebraic picture only holds locally — pushing the combined model too far from the base drops it off the well-behaved manifold.

The research is not closed. As of 2026 there is ongoing work on understanding when task vectors are disentangled (tasks that use different subsets of the network are more independent), how to make fine-tuning produce more composable task vectors on purpose (via orthogonality regularisation during training), and whether the same phenomenon holds at pretraining scale (early results suggest yes, with caveats). Task vectors are also the conceptual foundation for the next section, which turns the algebra into a practical toolchain.

§13

Model merging — soups, TIES, DARE, SLERP

Model merging turns the task-arithmetic observations of the previous section into a production toolchain. Given two or more fine-tuned checkpoints that share a base, can they be combined into a single model that absorbs the capabilities of all of them? The empirical answer, refined across several papers since 2022, is yes, surprisingly well, and merging is now a routine step in open-source model production.

The simplest method is model soups (Wortsman et al. 2022). Take several checkpoints produced by fine-tuning from the same initialisation with different hyperparameters or random seeds, and average their weights: $\theta_{\text{soup}} = \frac{1}{n}\sum_i \theta_i$. The result is usually better than the best individual checkpoint and always better than the average. The original paper was on vision transformers; the technique transfers to LLMs with little modification. It costs nothing beyond the individual fine-tuning runs and no inference overhead.

The next level up is task-vector merging. Compute $\tau_i = \theta_i - \theta_0$ for each fine-tune, combine them (with some scheme for resolving conflicts), and add back. The naive version — $\theta_0 + \sum_i \tau_i$ — over-shoots: the task vectors often point partly in the same direction and summing them amplifies that direction too aggressively. Two methods dominate in 2026 for handling this.

TIES (Trim, Elect Sign & Merge, Yadav et al. 2023) is a three-step procedure. Trim: for each task vector, zero out the parameters with the smallest magnitude (typically keeping the top 20%). This removes noise that doesn't represent the task's update. Elect sign: for each parameter, determine by majority vote across task vectors whether the combined update should be positive or negative, and zero out the task vectors that disagree with the majority. This resolves the conflict-direction problem. Merge: average the remaining parameters and add to the base. TIES consistently outperforms naive averaging, especially as the number of merged tasks grows.

DARE (Drop And REscale, Yu et al. 2023) is a simpler alternative that achieves similar quality. For each task vector, randomly zero out a fraction (typically 90%) of the parameters and rescale the remaining ones by $1/(1-p)$ where $p$ is the drop rate. The justification is that the task vector contains a lot of redundancy — similar to why LoRA's low-rank works — and randomly sparsifying it preserves the signal while reducing interference between merged vectors. DARE composes with TIES (apply DARE first, then TIES-style majority-voting) and the combination is competitive with either alone.

SLERP and spherical interpolation. For merging two models, spherical linear interpolation (SLERP) — interpolating along the great circle between the two weight vectors rather than along the straight line — often produces smoother quality curves than linear averaging. SLERP is most popular in the open-source community for producing chat models that blend the characteristics of two parents. Its theoretical justification is weak; its empirical results are good enough to have made it the default in MergeKit, the most popular open-source merging library.

Merging has become a genuine part of the open-source LLM production pipeline. Chat models like Nous-Hermes, Mistral's mixtral-inspired community merges, and many of the top models on open leaderboards in 2024–2026 are merges of two or three parent fine-tunes. The recipe is essentially: take a capable base, produce a few fine-tunes with different strengths (math, code, chat, instruction following), merge with TIES or DARE+TIES, and evaluate. The quality ceiling is meaningfully higher than any single fine-tune alone.

Merging also has a research implication: it is another piece of evidence that fine-tunes of the same base live in a relatively flat, well-connected region of parameter space. The fact that you can average them at all and get something coherent is surprising. This has motivated a research line on mode connectivity — whether there exist low-loss paths between different trained models — that remains active in 2026.

§14

Knowledge distillation

Distillation is the oldest form of model-to-model training, older than this chapter by two decades. A teacher model produces outputs on a dataset; a smaller student is trained to match those outputs. The method is both a capability-transfer technique and, in 2026, the dominant route from a capable-but-expensive frontier model to a deployable one.

The canonical form (Hinton, Vinyals & Dean 2015) trains the student on the teacher's softmax distribution rather than on one-hot ground-truth labels. The teacher's probability over classes — the soft targets — carries information about the similarity structure of the output space that a hard label does not. A softmax temperature $T > 1$ is used on both teacher and student during training to sharpen the relative probabilities of non-top classes. The student's loss is cross-entropy against the teacher's soft targets, optionally combined with cross-entropy against the hard ground-truth labels.

For LLMs the picture is modified. The teacher and student both produce token distributions; the distillation loss is KL divergence between the two:

$\mathcal{L}_{\text{distill}} = \sum_t \text{KL}(p_{\text{teacher}}(\cdot|x_{

summed or averaged over tokens in a training sequence. Two modes exist: offline distillation, where the teacher's outputs are generated once and stored, and the student is trained on the stored outputs; and online distillation, where the teacher is queried live during student training. Online is more expensive but allows the teacher to respond to the student's actual outputs (on-policy distillation).

Why distillation is everywhere in 2026. Frontier labs train a very large model — 400B to several trillion parameters — and then distill it into smaller models that are actually deployed. The frontier model is too expensive to serve; the distilled student is a factor of 10–50 smaller and, for most tasks, nearly as good. Claude's Haiku and Sonnet-family relationships, OpenAI's "turbo" and "mini" series, Google's Gemini Flash, Meta's LLaMA instruct models, and Mistral's small/medium family are all distilled from larger teachers in one form or another, though the exact recipes are rarely fully disclosed.

For open-source practice, distillation takes a simpler form: response distillation, often called synthetic-data fine-tuning, where a capable commercial model is used to generate thousands to millions of (instruction, response) pairs, and an open base is fine-tuned on those pairs. This is how the Vicuna, WizardLM, Alpaca, and most subsequent open chat models were built. It is distillation in the cross-entropy-on-hard-labels sense rather than the KL-on-soft-distributions sense, but it captures most of the practical benefit.

Two caveats about distillation in the LLM era. First, distillation is capped by the teacher. A distilled student cannot exceed the teacher on tasks where the teacher's outputs are the only signal; this is one reason RL-based training methods (which can produce outputs better than any in the training data, within the limits of the reward) remain important. Second, distillation inherits the teacher's mistakes. If the teacher has a particular failure mode — sycophancy, a specific hallucination pattern, a reward-hacking artefact — the student will inherit it. Sanitising distillation data for these failures is an active research area.

Distillation also blends into the PEFT story. One can distill into a small LoRA on a base model, rather than into a full smaller model — the LoRA absorbs the teacher-specific behaviour while the base carries the general capability. This produces an artefact that is both cheap to train and cheap to swap. A good fraction of open-source "Hermes-style" chat LoRAs in 2025–2026 are distillations of frontier-model outputs into a LoRA on a capable open base.

§15

Fine-tuning without undoing alignment

Fine-tuning changes the model; sometimes it changes more than you meant to change. Qi et al. 2023 (Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To) showed that even small benign fine-tuning runs on aligned models can substantially degrade the safety training — erasing refusals, restoring a base-like propensity to produce harmful content, and doing so without any adversarial intent from the fine-tuner. This finding has reshaped the policy conversation about fine-tuning access and the engineering conversation about how to fine-tune aligned models without breaking them.

The empirical finding, in its cleanest form: take an aligned chat model (the paper tested GPT-3.5 through OpenAI's fine-tuning API and several open-weight models), fine-tune it on as few as ten adversarial examples (questions paired with harmful responses), and the safety training is substantially reduced on held-out adversarial prompts. That part is perhaps expected. The more alarming finding: fine-tune on a completely benign dataset — Alpaca-style helpful demonstrations, or pure task examples unrelated to safety — and the safety training still degrades, though less dramatically. The more fine-tuning steps, the more degradation; the effect is present from the first thousand examples.

The mechanism is catastrophic forgetting in the specific direction of the safety training. The alignment signal is a relatively small, relatively localised modification of the pretrained weights. Fine-tuning — even on benign data — moves the weights along whatever gradient directions the new data implies, and those directions rarely preserve the alignment-direction subspace. The safety behaviour degrades not because anything actively attacks it, but because nothing in the new training signal actively preserves it.

Implication for any fine-tuning pipeline. If you are fine-tuning an aligned model, you cannot assume the alignment survives. You must either re-do the alignment pass after fine-tuning (expensive, requires the original preference data you probably do not have), include alignment-preserving data in the fine-tuning mixture (the mixture approach, which helps but does not fully fix), or use PEFT methods that physically constrain updates away from the alignment subspace (early research, not yet a standard tool). In practice, most production fine-tuning pipelines in 2026 use some version of the mixture approach.

The mixture approach is the most commonly deployed mitigation. Include, in the fine-tuning dataset, a significant fraction (typically 5–30%) of alignment-representative examples — refusals for unsafe requests, hedged answers for uncertain questions, honest statements of limitations. The exact examples depend on the alignment profile the original model was trained with; in practice, a publicly available dataset like HH-RLHF or SafeRLHF can approximate it. The mixture reduces but does not eliminate the safety degradation. It is the minimum-viable-defence for a benign fine-tune.

Several research approaches have pushed further. Safe-LoRA (Hsu et al. 2024) constrains LoRA updates to be orthogonal to the alignment-direction subspace, inferred by comparing the pre- and post-alignment checkpoints. Vaccination (Huang et al. 2024) adds an adversarial pre-fine-tune step that prepares the model to resist forgetting. Representation engineering methods (Zou et al. 2023, 2024) directly modify activations at inference time to maintain alignment-relevant representations. None is a complete solution; each reduces the degradation by some margin.

The policy implications have been substantial. OpenAI's fine-tuning API now runs outputs through a safety-classification pass before allowing a fine-tuned model to be deployed, flagging degradations. Meta's LLaMA licensing terms include obligations around post-fine-tune evaluation for safety. Open-source community practice has not fully caught up; most community LoRA releases do not document their post-fine-tune safety behaviour in any systematic way. The gap between the known risk and the deployed practice is one of the soft spots in the open-source LLM ecosystem in 2026.

The general lesson is that alignment is a property of a specific checkpoint, not a permanent property of a model family. Any training operation — including PEFT, including benign fine-tuning, including merging — has the potential to disturb it, and the disturbance is not automatically recovered. The fine-tuning practitioner owns the alignment state of the model they ship, regardless of how little they intended to change it.

§16

Multi-tenant adapter serving

LoRA is not only a training technique. Its structural property — a small, mergable modification of a shared base — is also a serving technique. The ability to load hundreds or thousands of different LoRA adapters onto a single copy of a backbone, and to route each inference request to the right adapter, is what makes per-customer and per-task specialisation economically viable at scale.

The serving problem without LoRA is prohibitive. If every customer's fine-tune is a full 70B-parameter model, and serving a 70B model needs say $N$ GPUs, then serving a thousand fine-tunes needs a thousand times $N$ GPUs — or a queue with enormous warm-swap latency. The fine-tuning economy that most SaaS vendors assumed they would build in 2020 simply did not work at full-fine-tune costs.

With LoRA, the arithmetic changes. The backbone is loaded once and shared across all requests. Each request carries metadata indicating which LoRA (or which combination of LoRAs) to apply. The inference server maintains a cache of LoRA weights — they are small, typically tens of megabytes — and, for each batch of requests, multiplexes the base forward pass with the per-request LoRA matmuls. Thousands of LoRAs on a single backbone is feasible; the memory overhead is dominated by the backbone, and the per-adapter marginal cost is small.

S-LoRA and the batching problem. The technical subtlety is that LoRA adapters cannot be pre-merged into the base if different requests in a batch use different adapters. Naive implementations fall back to sequential processing, destroying throughput. S-LoRA (Sheng et al. 2023) showed how to do the LoRA matmuls as structured sparse operations that can be batched across heterogeneous adapters, with minimal overhead over a single-adapter baseline. Punica (Chen et al. 2023), Lorax (Predibase 2024), and vLLM's adapter support all adopted similar techniques. By 2026 batched multi-LoRA serving is a solved problem in open-source inference engines.

The practical recipe for a multi-tenant LoRA serving stack looks roughly like this. A shared base model (e.g. Llama-3 70B or Mistral's equivalent) is loaded once per GPU, typically in int8 or int4 with GPTQ/AWQ. A separate LoRA registry holds thousands of adapters, indexed by customer ID or task ID. Each incoming request includes an adapter identifier; the router adds the request to the next batch along with any others using the same or compatible adapters. The inference engine — vLLM, Lorax, TGI, TensorRT-LLM with LoRA — computes the forward pass with the relevant per-request adapter matmuls. Latency is within 10–20% of a base-only serving latency, and throughput is a couple of times lower because of the extra matmuls. For most applications this is an acceptable trade in exchange for per-tenant customisation.

The architecture has implications for how teams build specialised models. A workflow that once looked like collect data → train a full-fine-tune → provision serving → manage a model zoo → swap models with minute-scale latency now looks like collect data → train a LoRA → register the adapter → serve from a shared backbone with millisecond-scale per-request routing. The rate at which new specialised models can be produced and deployed is qualitatively different. Companies that would once have had three fine-tuned models now have three hundred.

The second consequence is compositional. If a request can carry multiple LoRA identifiers, and the inference engine can apply them in sequence or in sum, then LoRA composition becomes a runtime feature rather than a training-time one. An end user's per-user personalisation LoRA can be composed with a task-specific LoRA and a style LoRA, all chosen at request time. This composability is still being explored in 2026; the early results suggest that additive composition works well for orthogonal tasks and poorly for conflicting ones (the latter being addressed by the merging research in §13).

The deployment story for PEFT, in one sentence: parameter-efficiency at training time was never the primary pay-off, and the really valuable property turned out to be parameter-efficiency at serving time — the ability to deploy thousands of specialised models on the memory footprint of one.

§17

Where fine-tuning sits in 2026

Fine-tuning in 2026 is a mature subfield but not a closed one. The basic techniques — full fine-tuning, LoRA, QLoRA, merging, distillation — are well-understood, widely implemented, and present in every serious open-source training library. The harder questions — when to use which method, how to avoid undoing alignment, how to compose adapters at serving time, what the rank-bottleneck is actually about — are still active. This closing section sketches what is settled and what is not.

What is settled. LoRA is the right default for fine-tuning language models. QLoRA is the right default when memory is tight. Full fine-tuning is reserved for cases where the target distribution is far from the base, where the training set is very large, or where LoRA has been tried and failed. Quantization to int4 or int8 for inference is routine and well-tooled. Model merging with TIES or DARE+TIES is a reliable way to combine specialised fine-tunes into a generalist. Distillation from a frontier teacher into a smaller student is the dominant route from a capable-but-expensive model to a deployable one. These recipes are not going to change substantially in 2026 or 2027; they are the settled landscape.

What is unsettled. The right parameter-efficient method for continued pretraining (adapting a model to a new language or domain at the pretraining distribution level, not the task-specific level) is still open. ReLoRA is a plausible partial answer, but the research has not converged. The interaction between PEFT and alignment — whether PEFT methods preserve alignment better than full fine-tuning, and whether alignment-preserving PEFT can be made robust — is an active topic with no consensus. The composition of LoRAs at serving time, whether for multi-task routing or per-user personalisation, is being explored but the failure modes (conflicts, interference, compounding errors) are not yet well-characterised.

The cross-chapter connections. Fine-tuning sits between Chapter 07 (alignment) and Chapter 09 (retrieval-augmented generation). Alignment is a fine-tuning pass at scale, done once per base model; this chapter covers the downstream adaptation that happens on top of it. Retrieval is the alternative to fine-tuning for knowledge problems; the cleaner one understands the divide between behaviour (fine-tune) and knowledge (retrieve), the better the resulting systems. The earlier chapters on pretraining (Ch 05), scale (Ch 06), and the transformer architecture (Ch 04) supply the substrate this chapter modifies.

The practical reading list for a practitioner who has just joined a team. Read the LoRA paper (Hu et al. 2021), which is short and clearly written. Read the QLoRA paper (Dettmers et al. 2023) for the quantization-plus-PEFT combination. Read the TIES paper (Yadav et al. 2023) for the merging toolchain. Read Qi et al. 2023 for the alignment-fragility finding before fine-tuning anything safety-critical. Read the Ilharco task-arithmetic paper (2022) for the conceptual foundations of merging. The Hugging Face PEFT library documentation is the best single starting point for hands-on practice; the Axolotl and LLaMA-Factory trainers are the best starting points for production-scale fine-tuning on open-weight models.

The research frontier. The most active research in 2026 is around three clusters. First, intrinsic-dimensionality theory: why does low-rank adaptation work so well, and can we predict in advance what rank a task needs? Second, alignment-aware fine-tuning: how do we specialise a model without erasing the safety behaviour, without having to re-run the alignment pass? Third, compositional serving: how do we combine LoRAs at inference time in a principled way, and what is the right abstraction for multi-adapter routing at scale? Each of these has produced partial results and real systems; none has a clean answer that will survive the next generation of models.

The throughline of this chapter, worth repeating: fine-tuning is not a general-purpose patch for everything a pretrained model does wrong. It is a specific tool — change the behaviour of the model by moving its weights — and it has specific costs: memory, storage, serving complexity, and the risk of damaging capabilities or alignment the base had. Parameter-efficient methods reduce those costs substantially without reducing the task-specific benefits. A team that has internalised the distinction between behaviour and knowledge, between PEFT and full fine-tuning, and between training-time and serving-time parameter efficiency, is a team equipped to do serious specialisation work with frontier models. Everything else in this chapter is implementation detail on top of that core picture.

How to read this chapter

Contents

When to fine-tune — the decision tree

Full fine-tuning — the classical approach

Catastrophic forgetting

A taxonomy of parameter-efficient fine-tuning

Adapters — the Houlsby bottleneck

Prefix, prompt, and P-tuning — soft embeddings as the adaptation surface

LoRA — low-rank adaptation

QLoRA — fine-tuning a 65B model on a single GPU

LoRA variants — DoRA, AdaLoRA, VeRA, LoRA+, PiSSA, ReLoRA

Selective methods — BitFit and IA³

Quantization for training and inference

Task vectors and task arithmetic

Model merging — soups, TIES, DARE, SLERP

Knowledge distillation

Fine-tuning without undoing alignment

Multi-tenant adapter serving

Where fine-tuning sits in 2026

Further reading

Textbooks & tutorials

Foundational papers

Modern extensions

Software & tools