Part VIII · Speech, Audio & Music · Chapter 04

Speaker Recognition & Verification, the voice as identity.

Every voice is a biometric shaped by the geometry of a unique vocal tract, the mass of the vocal cords, and decades of learned articulation. Speaker recognition encodes those cues into a compact embedding and uses it to answer a precise question: does this utterance belong to a claimed or identified speaker? The field has moved from GMM-UBM models and i-vectors to deep ECAPA-TDNN and ResNet architectures — and from closed, controlled microphone conditions to wild, telephony-degraded real-world audio.

How to read this chapter

The chapter is organized along the historical and architectural arc of the field. Sections 1–2 frame the problem and its acoustic foundations. Sections 3–4 cover the classical GMM-UBM and i-vector pipeline that dominated 2006–2016. Sections 5–7 cover the neural embedding era — d-vectors, x-vectors, and the modern ECAPA-TDNN and ResNet families. Sections 8–9 treat training objectives and the scoring backend. Sections 10–11 extend from verification (1:1) to identification (1:N) and open-set detection. Section 12 covers anti-spoofing, and Section 13 surveys datasets and evaluation metrics.

Prerequisites: Chapter 01 (Audio Signal Processing) for MFCCs and filterbanks. A working knowledge of neural network training (Part V, Chapters 01–02) is assumed from Section 5 onward. The margin-based loss functions in Section 8 reference softmax classification; ArcFace/AAM-softmax is self-contained. Notation: utterance-level embeddings are $\mathbf{e} \in \mathbb{R}^d$; speaker models are distributions or prototype vectors over that same space.

What speaker recognition actually solvesVerification, identification, the closed/open-set divide
The acoustics of speaker identityVocal tract, glottal source, MFCCs, prosody
GMM-UBM: the pre-neural baselineUniversal background model, MAP adaptation, LLR scoring
The i-vector frameworkTotal variability space, factor analysis, PLDA
Neural speaker embeddings: d-vectorsFrame-level networks, utterance averaging, speaker classification
x-vectors and the TDNN eraTime Delay Neural Networks, statistics pooling, Kaldi recipe
Modern architectures: ECAPA-TDNN and ResNetsSE blocks, Res2Net, aggregation, self-supervised pre-training
Training objectivesSoftmax, AAM-softmax, GE2E, prototypical loss
The verification pipelineEnrollment, cosine scoring, score normalization, thresholding
PLDA and the scoring backendWithin-class covariance, length normalization, PLDA vs cosine
Speaker identification and open-set detection1:N search, rejection threshold, large-scale identification
Anti-spoofing and voice livenessReplay, voice conversion, TTS attacks; ASVspoof; AASIST
Datasets, benchmarks, and evaluation metricsVoxCeleb, NIST SRE, EER, minDCF, FRR/FAR

Section 01

What speaker recognition actually solves

Speaker recognition is not one task but a family of related problems, distinguished by what the system knows at enrollment time and what it is asked to decide at test time.

The three canonical tasks are speaker verification, speaker identification, and the closely related speaker diarization (covered in Chapter 05). Verification is a binary 1:1 decision — given a test utterance and a claimed identity, accept or reject. Identification is a 1:N retrieval — given a test utterance and a gallery of enrolled speakers, return the closest match. Both can be closed-set (the speaker must be in the gallery) or open-set (the speaker may not be enrolled, requiring a reject option).

Speaker recognition is further divided by linguistic content. Text-dependent systems require the speaker to utter a specific passphrase; they are easier to build and harder to spoof but inflexible. Text-independent systems work on any utterance — a phone call, a brief command, a recording — and are the focus of this chapter and of most modern research.

Key Idea

Verification and identification share the same core technology — speaker embeddings — but differ in their decision rules. Verification compares one embedding against one enrolled template. Identification compares one embedding against many. Both require a threshold: below it, the system rejects or abstains; above it, it asserts identity.

The practical use cases are wide: voice authentication for banking apps, speaker-aware meeting transcription, forensic speaker comparison, voice assistant personalization, and access control in call centres. Each imposes different conditions on audio quality, utterance length, and acceptable false-accept rates — which is why the evaluation metrics discussed in Section 13 matter so much.

One crucial distinction that the broader literature sometimes collapses: speaker recognition is not speech recognition. A speaker recognition system does not transcribe words; it extracts a representation of the vocal source, ignoring linguistic content as much as possible. The two systems share front-end features (mel filterbanks, MFCCs) but diverge completely in what they suppress and what they preserve.

Section 02

The acoustics of speaker identity

Before any model can encode a speaker, we need to understand what physical properties make voices distinguishable — and which acoustic measurements best capture those properties.

The vocal tract is a tube of variable length and cross-section, shaped by the tongue, jaw, lips, and velum. Its geometry determines the formant frequencies — resonant peaks in the spectrum that are speaker-dependent because no two people share exactly the same vocal anatomy. A longer vocal tract (typical of adult males) lowers all formants; the specific shape of the tract determines their exact pattern.

Layered on the vocal tract is the glottal source — the pattern of vibration of the vocal cords. Its fundamental frequency (F0, or pitch) varies moment to moment with prosody, but its average and range are speaker-dependent. The spectral tilt of the glottal source — how rapidly energy falls off with frequency — also varies between speakers and is partly captured by cepstral features.

Why MFCCs work

Mel-frequency cepstral coefficients compress the spectral envelope into about 20–40 numbers per frame. Cepstral analysis separates the smooth spectral envelope (dominated by vocal tract shape) from the fine harmonic structure (dominated by the glottal source period). Speaker identity information concentrates in the lower cepstral coefficients, which represent broad spectral shape; speech content information concentrates in the detail. This separation is not perfect — which is why text-independent speaker recognition is harder than text-dependent — but it is enough for MFCCs to remain competitive features even in 2025.

Modern systems typically use 80-dimensional log-mel filterbank outputs rather than MFCCs, letting the neural network learn its own cepstral compression. Delta and delta-delta coefficients, which capture temporal dynamics, add further discriminative information about speaking rate and articulation style — both speaker-dependent long-term habits.

Intuition

Think of the speech spectrum as the product of two filters: the vocal tract filter (smooth, speaker-dependent) and the source excitation (harmonic, more content-dependent). Cepstral analysis is a convolution-to-addition trick: taking the log of the spectrum converts this product into a sum, and the inverse DFT (the "cepstrum" operation) then separates slow-varying envelope from fast-varying excitation in the cepstral domain.

The smooth spectral envelope (gold) encodes vocal tract shape — the primary speaker-dependent information captured by MFCCs. The fine harmonic structure (blue) encodes glottal source periodicity, which is more content-dependent.

Section 03

GMM-UBM: the pre-neural baseline

For roughly two decades, Gaussian Mixture Models trained against a Universal Background Model were the dominant approach to speaker verification. Understanding them illuminates why the neural embedding revolution happened — and what was sacrificed by making it.

A Gaussian Mixture Model (GMM) for speaker recognition models the distribution of acoustic feature vectors for a given speaker as a weighted sum of Gaussians: $p(\mathbf{x} | \lambda) = \sum_{k=1}^{K} w_k \, \mathcal{N}(\mathbf{x}; \boldsymbol{\mu}_k, \boldsymbol{\Sigma}_k)$. Each component captures a different phonetic context in which the speaker's voice characteristics are expressed.

The key insight of Reynolds et al. (2000) was the Universal Background Model. Rather than training each speaker's GMM from scratch — which requires substantial enrollment data — you train one large GMM on thousands of speakers (the UBM), then adapt it to each target speaker via Maximum a Posteriori (MAP) adaptation. MAP adaptation shifts only the mixture means toward the new speaker's data, keeping components not seen at enrollment anchored to the background. The result: a per-speaker GMM from as little as 10–30 seconds of enrollment audio.

Scoring

At test time, the verification decision is a log-likelihood ratio: $$\text{LLR}(\mathbf{X}) = \log p(\mathbf{X} | \lambda_{\text{spk}}) - \log p(\mathbf{X} | \lambda_{\text{UBM}})$$ where $\mathbf{X} = \{\mathbf{x}_1, \ldots, \mathbf{x}_T\}$ is the sequence of feature frames. If this exceeds a threshold $\theta$, the claimed identity is accepted. The UBM acts as a generic "not this speaker" hypothesis; the LLR measures how much better the speaker-specific GMM explains the test utterance compared to the background.

Key Idea

MAP adaptation elegantly handles data scarcity: components with many enrollment observations are pulled strongly toward the new speaker's statistics; components with no observations remain at the UBM values. The interpolation weight is $\alpha_k = n_k / (n_k + r)$ where $n_k$ is the soft count of frames assigned to component $k$ and $r$ is a relevance factor (typically 16).

The GMM-UBM system remained competitive for years because its assumptions matched telephony conditions: short enrollment sessions, noisy channels, limited bandwidth. Its failure mode was the inability to share information across speakers — each adapted GMM was independent, and the model had no learned notion of what speaker identity actually looks like in a lower-dimensional space. That limitation motivated factor analysis and, later, neural embeddings.

Section 04

The i-vector framework

The i-vector, introduced by Dehak et al. in 2011, solved the fundamental limitation of GMM-UBM by projecting the speaker-session variability into a low-dimensional latent space — becoming the dominant approach until the neural era.

The starting point is the GMM supervector: concatenate all the MAP-adapted means of a speaker's GMM into a single vector $\mathbf{s} \in \mathbb{R}^{KD}$ (K mixtures, D feature dimensions). For a 512-component GMM with 60-dimensional features, this gives a 30,720-dimensional vector per utterance. The supervector is enormous, but it lies on a low-dimensional manifold — a speaker's voice and the recording conditions together determine it almost entirely.

The total variability model captures this with a factor analysis decomposition:

$$\mathbf{s} = \mathbf{m} + \mathbf{T}\mathbf{w}$$

where $\mathbf{m}$ is the UBM supervector, $\mathbf{T} \in \mathbb{R}^{KD \times d}$ is the total variability matrix (a rectangular projection), and $\mathbf{w} \in \mathbb{R}^d$ is the i-vector — typically $d = 400$. Crucially, $\mathbf{T}$ is shared across all speakers; $\mathbf{w}$ is utterance-specific. The i-vector is estimated as the MAP posterior mean of $\mathbf{w}$ given the acoustic observations.

Backend scoring

Two i-vectors from the same speaker should be close in the total variability space; the problem is that channel, microphone, and room variability also land in $\mathbf{w}$. The backends that followed addressed this: within-class covariance normalization (WCCN), Linear Discriminant Analysis (LDA) to project out session variability, and Probabilistic Linear Discriminant Analysis (PLDA), which explicitly models between-speaker and within-speaker covariance and scores pairs of i-vectors with a Bayesian likelihood ratio. PLDA-scored i-vectors were competitive well into the neural era and are still used as backends for neural embeddings (see Section 10).

Intuition

The i-vector is the answer to: "In the low-dimensional space that explains most of the variability in GMM supervectors across all speakers and sessions, where does this particular utterance land?" Its power comes from compressing 30,000 dimensions to 400 while preserving exactly the information relevant to speaker identity.

Section 05

Neural speaker embeddings: d-vectors

The first neural approach to speaker embeddings took the simplest possible route: train a neural network to classify speakers, then extract an intermediate representation as the embedding. The result was the d-vector.

Variani et al. (2014) at Google trained a deep neural network on frame-level features to predict the speaker identity from a large training set. Given input frames $\mathbf{x}_1, \ldots, \mathbf{x}_T$, the network produces per-frame output vectors from a hidden layer; the d-vector for an utterance is the average of these frame-level representations:

$$\mathbf{e} = \frac{1}{T} \sum_{t=1}^{T} f(\mathbf{x}_t)$$

where $f(\mathbf{x}_t)$ is the output of a penultimate network layer at frame $t$. At inference time the softmax classification head is discarded, and the averaged hidden representation becomes the speaker embedding.

The d-vector approach has a critical limitation: it treats all frames equally, regardless of whether they are voiced, noisy, or linguistically informative about speaker identity. Nonetheless, when the training set is large enough (thousands of speakers), the network learns to suppress much of the within-speaker variability and to encode speaker-discriminative information in the embedding. The averaging operation also provides a crude form of length normalization: longer utterances give more stable d-vectors.

Key Idea

The shift from i-vectors to d-vectors changes the locus of statistical modeling from a carefully engineered generative model (GMM-UBM + factor analysis) to a discriminatively trained classifier whose hidden representation has learned what to ignore. The embedding space is no longer interpretable, but it is optimized end-to-end for speaker discrimination — and scales with data.

D-vectors were quickly surpassed in accuracy by x-vectors (Section 6), but they remain important for two reasons. First, they established the training paradigm — discriminative classification on a large multi-speaker dataset, embedding extraction from a hidden layer — that all subsequent architectures follow. Second, d-vector-style systems with attention pooling (rather than simple averaging) are still competitive in streaming and on-device settings where model size matters.

Section 06

x-vectors and the TDNN era

Snyder et al. (2018) at Johns Hopkins replaced simple frame averaging with a statistics pooling layer inside the network, allowing the model to aggregate temporal context before forming the utterance-level embedding. The result — the x-vector — was for several years the dominant speaker embedding.

The x-vector architecture is a Time Delay Neural Network (TDNN) applied to a sequence of acoustic frames. TDNNs are 1-D convolutional networks that explicitly model temporal context: each layer has access to frames at specific offsets from the current frame, and stacking several layers gives a large effective receptive field without the vanishing-gradient problems of recurrent networks over long sequences.

Statistics pooling

The key innovation is the pooling layer that sits between the frame-level TDNN layers and the utterance-level layers. Given frame-level representations $\{\mathbf{h}_t\}_{t=1}^{T}$, the pool computes the mean and standard deviation across time: $$\mathbf{p} = \left[\frac{1}{T}\sum_t \mathbf{h}_t \;\Big\|\; \sqrt{\frac{1}{T}\sum_t \mathbf{h}_t^2 - \left(\frac{1}{T}\sum_t \mathbf{h}_t\right)^2}\right]$$ and concatenates them. The pooled vector $\mathbf{p}$ is then passed through fully-connected layers, producing the x-vector at a chosen hidden layer (typically 512-dimensional). The standard deviation term is critical: it captures how variable the speaker's voice is across the utterance, which itself is a speaker-dependent property.

Training uses a softmax cross-entropy objective over thousands of training speakers. The x-vector system was codified in the Kaldi toolkit, making it easy to reproduce and widely adopted. On the NIST SRE benchmarks, x-vectors plus PLDA backend substantially outperformed i-vector systems, particularly in short-duration conditions (under 10 seconds of test audio) where the GMM-based pipeline degrades quickly.

Intuition

Statistics pooling gives the network access to utterance-level statistics — not just what spectral shapes appear, but how much they vary. A speaker with a wide F0 range and variable speaking rate will have high-variance pooled features; a monotone speaker will have low variance. These are genuine speaker-dependent properties that averaging alone cannot capture.

Section 07

Modern architectures: ECAPA-TDNN and ResNets

The x-vector was a substantial advance but left performance on the table. Two architecture families pushed further: ECAPA-TDNN, which enhanced TDNN with squeeze-and-excitation and multi-scale aggregation, and ResNet-based encoders that brought 2-D convolutional image classification architecture to the problem.

ECAPA-TDNN

ECAPA-TDNN (Emphasized Channel Attention, Propagation and Aggregation TDNN, Desplanques et al. 2020) introduced three key changes to the x-vector TDNN. First, channel attention: Squeeze-and-Excitation (SE) blocks learn to re-weight the importance of each feature channel conditioned on the global statistics of the frame, letting the network suppress noise-dominated channels. Second, Res2Net blocks: multi-scale residual connections operating at different temporal granularities within each layer. Third, multi-layer aggregation: instead of pooling only the final frame-level layer, ECAPA-TDNN concatenates intermediate layers before pooling — analogous to DenseNet-style feature reuse — so the utterance-level representation incorporates both shallow (prosody, energy) and deep (phonetic, spectral) features simultaneously.

ECAPA-TDNN became the new standard for the VoxCeleb benchmark, reducing EER by roughly 30–40% relative to the best x-vector systems. A 512-channel ECAPA-TDNN produces a 192-dimensional embedding and can be trained on a single GPU in under a day on VoxCeleb2.

ResNet speaker encoders

Concurrent with ECAPA-TDNN, ResNet-based systems adapted image classification backbones (ResNet-34, ResNet-50, SE-ResNet-34) to process 2-D mel-spectrogram inputs. The 2-D convolution treats time and frequency symmetrically, which can be a disadvantage (speech is not isotropic in time-frequency space) or an advantage (it leverages the same pre-training and augmentation tricks from vision). These systems are competitive with ECAPA-TDNN, particularly at larger model sizes, and the ResNet-293 architecture from the VoxSRC 2021 challenge pushed EER on VoxCeleb1-O to below 0.5%.

Self-supervised pre-training

More recently, WavLM and HuBERT representations — trained on hundreds of thousands of hours of unlabelled audio with masked prediction objectives — have been fine-tuned for speaker recognition. The self-supervised features capture richer contextual information than filterbanks alone, particularly in noisy conditions, and achieve state-of-the-art EER when combined with an ECAPA-TDNN or attention pooling head. The trade-off is model size: WavLM-Large has 316M parameters versus ~5M for ECAPA-TDNN.

ECAPA-TDNN architecture: frame-level TDNN blocks with SE attention and Res2Net multi-scale connections feed into a multi-layer aggregation statistics pooling step, producing a 192-dimensional utterance embedding via fully-connected layers.

Section 08

Training objectives

The choice of loss function during training is at least as important as the choice of architecture. Speaker recognition has moved through several objective families, each addressing a limitation of the previous.

Softmax cross-entropy

The baseline: treat training as an $N$-class classification problem over $N$ training speakers. Softmax cross-entropy is straightforward and scales well, but it optimizes classification accuracy over training speakers — not the distance geometry of the embedding space at test time, where new, unseen speakers must be compared. The resulting embeddings are discriminative but not necessarily well-separated when viewed as a metric space.

AAM-Softmax (ArcFace)

Additive Angular Margin softmax (AAM-softmax, or ArcFace adapted to speaker recognition) addresses this by operating on the angular space. The logit for class $j$ is: $$z_j = s \cdot \cos(\theta_j + m \cdot \mathbf{1}[j = y])$$ where $\theta_j$ is the angle between the $\ell_2$-normalized embedding and the $j$-th class weight vector, $m$ is an additive margin (typically 0.2), and $s$ is a scale factor (typically 30). The margin forces the network to produce embeddings that are not just classified correctly but are separated from the decision boundary by a fixed angular gap — which directly improves cosine-distance-based verification at test time. AAM-softmax is now the standard training objective for ECAPA-TDNN and ResNet speaker encoders.

Metric learning objectives

The Generalized End-to-End (GE2E) loss (Wan et al. 2018, Google) takes a fundamentally different approach: rather than classifying over a fixed set of training speakers, it operates on batches of (speaker, utterance) pairs and directly optimizes the cosine similarity matrix. Each utterance's embedding is compared to the centroids of all speakers in the batch; the loss rewards high cosine similarity to the true speaker centroid and low similarity to all others. This objective trains the system for the exact task it will perform at inference — comparing embeddings — and is particularly effective for on-device speaker verification where training-test speaker overlap is a concern.

Key Idea

AAM-softmax and GE2E both enforce a gap between positive and negative pairs in angular space, but AAM-softmax does so via a fixed margin on class weights while GE2E does so via direct batch-level comparison. In practice, AAM-softmax with a well-tuned margin is simpler to implement and matches or exceeds GE2E on standard benchmarks; GE2E has advantages in continual enrollment settings.

Section 09

The verification pipeline

From raw audio to an accept/reject decision, a modern speaker verification system passes through enrollment, embedding extraction, score computation, and score normalization — each step with its own design choices and failure modes.

Enrollment is the offline phase: one or more utterances from the target speaker are processed, their embeddings extracted and — if multiple utterances are available — averaged to form a speaker template $\bar{\mathbf{e}}_{\text{spk}}$. More enrollment utterances give a more stable template; practical systems often use 5–30 seconds of audio. The enrollment embedding is stored (typically $\ell_2$-normalized) for later comparison.

Score computation at test time extracts an embedding $\mathbf{e}_{\text{test}}$ from the test utterance, then computes a similarity score. The dominant method is cosine similarity: $$s = \frac{\mathbf{e}_{\text{test}} \cdot \bar{\mathbf{e}}_{\text{spk}}}{\|\mathbf{e}_{\text{test}}\| \cdot \|\bar{\mathbf{e}}_{\text{spk}}\|}$$ which measures the angle between vectors and is invariant to magnitude. When embeddings are $\ell_2$-normalized before storage, this reduces to a simple dot product — fast and compatible with approximate nearest-neighbor indices.

Score normalization

Raw cosine scores are not well-calibrated across conditions: scores tend to be higher for shorter utterances, certain channel types, or certain accent groups. Z-normalization (ZNorm) standardizes test scores against a cohort of impostor utterances through the same system; T-normalization (TNorm) standardizes against a cohort of impostors impersonating the target speaker. S-normalization (adaptive, or AdaS-norm) applies a speaker-adapted normalization that uses the top-scoring cohort utterances as the reference — reducing the cohort size needed while preserving accuracy.

Thresholding converts the score to a binary decision. The threshold $\theta$ is set on held-out data to achieve a target operating point (e.g., 1% false acceptance rate in a banking application). In practice, the threshold is almost never fixed globally; it is calibrated per deployment condition.

The speaker verification pipeline. Enrollment embeddings are pre-computed and stored; at test time a single embedding is extracted and compared via cosine similarity. Score normalization adjusts for condition-specific biases before the threshold decision.

Section 10

PLDA and the scoring backend

Cosine similarity is fast and effective, but it ignores the statistical structure of the embedding space. PLDA exploits that structure — modeling both between-speaker and within-speaker covariance — to produce better-calibrated scores, especially when speaker embeddings show residual channel or condition variability.

Probabilistic Linear Discriminant Analysis (PLDA) models each speaker's embeddings as samples from a Gaussian centered at a latent speaker factor $\mathbf{y}$, with additive within-speaker noise: $$\mathbf{e} = \mathbf{m} + \mathbf{V}\mathbf{y} + \boldsymbol{\epsilon}$$ where $\mathbf{m}$ is the global mean, $\mathbf{V}$ is the between-speaker loading matrix, $\mathbf{y} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is the latent speaker factor, and $\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \boldsymbol{\Sigma})$ is within-speaker noise. The PLDA score for a verification trial is the log-likelihood ratio between the hypothesis "same speaker" and "different speakers" — a Bayesian decision that accounts for both the similarity of the two embeddings and the expected within-speaker variability.

Length normalization

Before fitting or applying PLDA, embeddings are typically length-normalized: projected to the unit hypersphere via $\mathbf{e} \leftarrow \mathbf{e} / \|\mathbf{e}\|$. This maps the Gaussian assumptions of PLDA onto an approximately Gaussian distribution on the sphere, and empirically improves PLDA performance. An additional centering step (subtracting the mean over the training set) ensures the distribution is zero-mean.

When to use PLDA vs cosine

For modern deep embeddings trained with AAM-softmax, cosine similarity often matches or exceeds PLDA — the angular margin training already optimizes for cosine-distance discrimination. PLDA retains an advantage when: embeddings show strong channel or session variability not removed by the model; enrollment utterances are multiple (PLDA naturally fuses them by marginalizing over the latent speaker factor); or the scoring must be well-calibrated without held-out data for score normalization. In practice, the two backends are often ensembled.

Section 11

Speaker identification and open-set detection

When the gallery contains N enrolled speakers and the system must name the speaker rather than verify a claimed identity, the verification framework extends naturally — but the open-set case introduces a qualitatively different challenge.

Closed-set identification is straightforward: compute cosine similarity between the test embedding and all $N$ enrolled speaker templates, return $\text{argmax}_j \, s(\mathbf{e}_{\text{test}}, \bar{\mathbf{e}}_j)$. At small $N$ (a few hundred speakers), this is a brute-force nearest-neighbor search. At large $N$ (millions of speakers in a forensic database or a streaming media platform), approximate nearest-neighbor (ANN) methods such as HNSW or FAISS IVF indices make this tractable, typically trading a small accuracy loss for orders-of-magnitude speedup.

Open-set identification adds the possibility that the test speaker is not enrolled. The system must either return a match or reject. The standard approach is to threshold the top-$1$ similarity score: if $\max_j s_j < \theta$, output "unknown." Setting $\theta$ involves the same EER/minDCF trade-off as verification but now also affects the false-rejection rate for enrolled speakers.

Speaker counting

A related task is estimating how many distinct speakers appear in a recording without prior enrollment. This connects to the diarization pipeline (Chapter 05) but can also be handled by hierarchical agglomerative clustering over short-segment embeddings: the number of clusters that maximizes a clustering criterion (BIC, silhouette score, or a learned stopping criterion) estimates the speaker count. This approach is used in the DIHARD and AMI benchmarks as a preprocessing step.

Watch Out

Identification accuracy degrades gracefully with $N$ for closed-set systems, but open-set rejection quality degrades faster: as the gallery grows, the probability that some enrolled speaker has a high cosine similarity to an unknown test speaker increases, raising the false-acceptance rate at any fixed threshold. Systems must be re-calibrated when gallery size changes substantially.

Section 12

Anti-spoofing and voice liveness detection

Speaker verification systems are vulnerable to adversarial attack: a well-crafted impersonation, a replay of a recorded utterance, or a TTS-synthesized voice can fool the biometric. Anti-spoofing — sometimes called voice liveness detection — is a parallel line of defense.

The ASVspoof challenge series (2015, 2017, 2019, 2021) has structured the anti-spoofing community around three attack categories:

Replay attacks: a recording of the target speaker's voice is played back to the microphone. These are low-cost, physical attacks with no synthesis required.
Voice conversion (VC): an impostor's voice is transformed to resemble the target using a voice conversion model (spectral mapping, CycleGAN, diffusion-based). The converted audio is synthetic but closely matches the target speaker's identity.
TTS attacks: a speech synthesis system generates the target speaker's voice from text. With modern zero-shot TTS (VALL-E, XTTS), only a few seconds of enrollment audio from the target are needed.

Countermeasure models

Anti-spoofing countermeasures (CMs) are binary classifiers: bona fide (real) vs. spoofed. Early systems used hand-crafted features sensitive to spoofing artefacts — phase discontinuities, codec traces, unnatural energy distributions in high-frequency bands. The Lightweight CNN (LCNN) model applied to LFCC features became the ASVspoof 2019 baseline. More powerful models followed: RawNet2 operates directly on raw waveforms and learns its own filter bank; AASIST (Audio Anti-Spoofing using Integrated Spectro-Temporal graph attention network) explicitly models relationships between spectral and temporal artefacts and won ASVspoof 2021.

Tandem and end-to-end systems

In production, the ASV system and the CM are typically combined in a tandem architecture: both scores are computed independently and fused (linearly or with a learned gate) to produce the final accept/reject decision. End-to-end jointly trained ASV+CM systems are an active research direction but face the challenge that ASVspoof training sets do not contain natural speaker diversity. The tandem approach, with separate training, remains the dominant deployment pattern.

Watch Out

Modern zero-shot TTS systems (Chapter 03) have dramatically lowered the barrier to speaker spoofing. A voice clone requiring 3 seconds of enrollment audio from a target speaker can now be synthesized in real time. Anti-spoofing CMs trained on ASVspoof 2021 data show significant performance degradation against these newer attack vectors, and the field is racing to update evaluation sets.

Section 13

Datasets, benchmarks, and evaluation metrics

Speaker recognition progress is inseparable from its benchmarks. The field has converged on a small number of evaluation metrics and a handful of canonical datasets that have driven architectural competition for a decade.

Datasets

VoxCeleb1 (Nagrani et al. 2017, Oxford) contains 153,516 utterances from 1,251 celebrities extracted from YouTube interviews — fully in-the-wild, covering diverse conditions, microphones, and backgrounds. VoxCeleb2 extends this to 1,128,246 utterances from 5,994 speakers. VoxCeleb1 is now the standard evaluation set (three official test protocols: O, E, H); VoxCeleb2 is the standard large-scale training set.

The NIST Speaker Recognition Evaluation (SRE) series, running since 1996 and annually/biennially since 2004, provides controlled telephone-quality and microphone conditions with carefully balanced trial lists. SRE evaluations emphasize generalization across channel, language, and duration conditions and remain the gold standard for forensic-grade system evaluation.

Evaluation metrics

Equal Error Rate (EER) is the operating point where the False Rejection Rate (FRR, genuine trials rejected) equals the False Acceptance Rate (FAR, impostor trials accepted). It is a threshold-independent summary of discriminative ability — lower is better — and is the primary metric on VoxCeleb.

The minimum Detection Cost Function (minDCF) is a weighted cost that reflects the operational trade-off between false acceptance and false rejection: $$C_{\text{det}} = C_{\text{FA}} \cdot P_{\text{FA}} \cdot P_{\text{tar}} + C_{\text{miss}} \cdot P_{\text{miss}} \cdot (1 - P_{\text{tar}})$$ where the cost weights ($C_{\text{FA}}$, $C_{\text{miss}}$) and the prior target probability ($P_{\text{tar}}$) are set per evaluation to reflect the deployment scenario. NIST SRE uses minDCF as its primary metric. A system optimized for EER may not minimize minDCF if the cost asymmetry or prior differs from the EER operating point.

System	EER (Vox1-O, %)	Year
i-vector + PLDA	5.3	2018
x-vector + PLDA	3.1	2018
ECAPA-TDNN (512ch)	0.87	2020
ResNet-293 + AAM	0.46	2021
WavLM-Large + ECAPA	0.39	2022

Intuition

EER is the point on the DET (Detection Error Tradeoff) curve where the curve crosses the diagonal. minDCF is the minimum of the cost function over all possible thresholds — it tells you the best possible performance given the deployment cost structure, regardless of where you set your operating threshold.

Detection Error Tradeoff (DET) curve. The EER is where FRR = FAR (intersection with the diagonal). Moving left reduces false acceptances but increases false rejections; the optimal operating point depends on deployment cost weights.

Speaker Recognition & Verification, the voice as identity.

How to read this chapter

Contents

What speaker recognition actually solves

The acoustics of speaker identity

Why MFCCs work

GMM-UBM: the pre-neural baseline

Scoring

The i-vector framework

Backend scoring

Neural speaker embeddings: d-vectors

x-vectors and the TDNN era

Statistics pooling

Modern architectures: ECAPA-TDNN and ResNets

ECAPA-TDNN

ResNet speaker encoders

Self-supervised pre-training

Training objectives

Softmax cross-entropy

AAM-Softmax (ArcFace)

Metric learning objectives

The verification pipeline

Score normalization

PLDA and the scoring backend

Length normalization

When to use PLDA vs cosine

Speaker identification and open-set detection

Speaker counting

Anti-spoofing and voice liveness detection

Countermeasure models

Tandem and end-to-end systems

Datasets, benchmarks, and evaluation metrics

Datasets

Evaluation metrics

Further Reading

Foundational Papers

Modern Architectures

Training Objectives & Backends

Anti-Spoofing

Datasets & Benchmarks