Part VIII · Speech, Audio & Music · Chapter 05

Speaker Diarization, who spoke when in the stream.

A recording of a meeting, a phone call, or a clinical interview arrives as a single undifferentiated waveform. Diarization is the process of answering one deceptively simple question — who spoke when? — by segmenting the audio into speaker-homogeneous regions and assigning a consistent label to each. It does not identify speakers by name, but it transforms an anonymous stream into a structured, attributable transcript.

How to read this chapter

The chapter follows the classical pipeline before branching into end-to-end and hybrid approaches. Sections 1–2 frame the problem and cover Voice Activity Detection. Sections 3–6 walk through the clustering-based pipeline: segmentation, short-segment embeddings, Agglomerative Hierarchical Clustering, and spectral clustering. Section 7 addresses the hard problem of overlapping speech. Sections 8–9 cover end-to-end neural diarization (EEND) and its variable-speaker extensions. Sections 10–11 treat streaming diarization and speaker-attributed ASR. Section 12 covers evaluation metrics and benchmarks.

Prerequisites: Chapter 01 (Audio Signal Processing) for VAD and filterbanks; Chapter 04 (Speaker Recognition & Verification) for speaker embeddings, cosine similarity, and clustering fundamentals. Familiarity with attention mechanisms (Part V, Chapter 06) helps in the EEND sections. Notation: a diarization hypothesis is a set of segments $\{(t_s, t_e, k)\}$ where $t_s, t_e$ are start/end times and $k$ is a speaker label; ground truth uses the same form.

The diarization problemInput/output, use cases, the label-permutation ambiguity
Voice Activity DetectionEnergy, zero-crossing, neural VAD, Silero, pyannote
Segmentation and change-point detectionBIC, sliding window, neural change-point detection
Short-segment speaker embeddingsSub-second embeddings, window length trade-offs, multi-scale
Agglomerative Hierarchical ClusteringBottom-up merging, PLDA distance, stopping criteria
Spectral clusteringAffinity matrix, Laplacian, eigengap, auto-estimating k
Overlap detection and multi-speaker segmentsOverlap prevalence, binary classifiers, multi-label output
End-to-end neural diarization (EEND)BLSTM, PIT, frame-level multi-label, simulated mixtures
EEND extensions: variable speakers and EDAEEND-EDA, EEND-VC, local agreements, scalable EEND
Streaming and online diarizationIncremental clustering, UIS-RNN, streaming EEND
Speaker-attributed ASRPipeline vs joint, word-level diarization, meeting transcription
Evaluation: DER, JER, and benchmarksMissed speech, false alarm, confusion; DIHARD, AMI, VoxConverse

Section 01

The diarization problem

Diarization sits at the intersection of audio segmentation, speaker recognition, and sequence labeling — but is reducible to none of them. Its output is a partition of time, not a transcription and not an identity claim.

Formally, given an audio recording $X$ of duration $T$, diarization produces a set of non-overlapping (or in the overlapping case, multi-labeled) segments: $\{(t_s^{(i)}, t_e^{(i)}, k^{(i)})\}$ where $k^{(i)}$ is a speaker label drawn from a set of size $N$ discovered from the data. The labels are relative — "SPEAKER_00" and "SPEAKER_01" carry no intrinsic identity — and $N$ is not known in advance. This distinguishes diarization from identification (which requires enrollment) and from speaker counting (which only produces $N$, not the time boundaries).

The label-permutation ambiguity is fundamental: there are $N!$ equivalent hypotheses corresponding to relabeling all speakers consistently. Evaluation metrics resolve this by finding the optimal label permutation before scoring (Section 12). Systems that track speaker identity across recordings must handle this by linking hypotheses across sessions — an additional problem not addressed by single-recording diarization.

Key Idea

Diarization does not tell you who the speakers are — only that they are different and when each one speaks. Combining diarization with speaker recognition (Chapter 04) to name the speakers is a separate step, requiring enrollment data for the specific speakers in the recording.

Use cases span a wide range of audio conditions: telephone call-centre analytics (2-speaker, narrowband, clean), broadcast news (multiple speakers, music, noise), meeting rooms (4–8 speakers, far-field microphones, significant overlap), clinical recordings (doctor-patient, often 2 speakers but with long pauses and topic shifts), and podcast transcription (2–6 speakers, variable production quality). Each imposes different operating constraints on latency, speaker count, and acoustic difficulty.

Section 02

Voice Activity Detection

Before any speaker labels can be assigned, the system must find the regions of the recording that contain speech at all. VAD is the binary classification problem of separating speech frames from non-speech — silence, noise, music, and background sounds.

Classical VAD uses energy and zero-crossing rate: speech frames have higher energy than silence, and voiced speech has lower zero-crossing rates than fricatives or noise. These heuristics work in clean conditions but break down quickly when background noise is at a similar energy level to speech, or when music is present.

Neural VAD models treat the problem as frame-level binary classification over acoustic features. WebRTC VAD, widely used in telephony, uses a Gaussian Mixture Model on frequency-band energy. Silero VAD uses a small LSTM trained on over 6,000 hours of diverse audio and achieves robust performance with low latency (frame-level decisions at 30ms). The pyannote.audio framework provides a VAD pipeline based on segmentation models trained on a diverse benchmark and is the current standard in the research community.

VAD as a segmentation output

The output of VAD is a binary time series or, equivalently, a list of speech segments. Most diarization pipelines apply a minimum segment duration (typically 0.3s) to suppress brief noise events, and merge adjacent speech segments separated by gaps shorter than a collar (typically 0.5s) to avoid over-fragmenting continuous speech. The quality of the downstream diarization is bounded by VAD quality: false-alarm speech regions introduce spurious speaker changes; missed speech regions lose information.

Intuition

Think of VAD as the gating function of the diarization pipeline: it determines which frames are presented to the speaker embedding model. Errors here compound — a region of music labeled as speech will produce a confusable embedding that the clustering step may incorrectly merge with a real speaker, or split off as a phantom speaker.

VAD output: gold bars mark speech regions detected in the waveform. Non-speech intervals (silence, background noise) are excluded from downstream processing.

Section 03

Segmentation and change-point detection

Within speech regions, speaker changes must be located before clustering can label them. Segmentation is the task of finding those change points — ideally splitting precisely where one speaker hands off to another, without splitting within a single speaker's continuous turn.

The classical approach is Bayesian Information Criterion (BIC) segmentation. For each candidate change point within a window, the BIC tests whether the data before and after are better modeled as one Gaussian or two. A penalty term proportional to the number of model parameters prevents over-splitting. The window is slid across the recording, and the positions that maximize the BIC improvement are retained as change points.

BIC segmentation is slow (O(n²) in the number of frames per window) and performs poorly when speakers overlap or transition gradually. The practical alternative is a sliding-window approach: compute a speaker embedding for two adjacent windows of fixed length (e.g., 1.5s each with 0.5s overlap) and flag a change point when their cosine distance exceeds a threshold. This is fast and requires only the speaker embedding model — no separate segmentation model.

Neural change-point detection

More recently, neural models trained end-to-end on speaker change data have improved over heuristic methods. The pyannote.audio segmentation model uses a sincnet + LSTM architecture to produce frame-level speaker-change probabilities, trained on artificially created mixtures with known change points. A key design choice: it is better to over-segment (more, shorter segments) than to under-segment. Clustering can merge same-speaker segments; it cannot recover a missed change point that caused two speakers to be fused into one segment from the start.

Key Idea

Over-segmentation is the right failure mode. The segmentation step should err on the side of more segments, shorter durations, and conservative splitting. Clustering handles the merging; it cannot undo a missed split.

Section 04

Short-segment speaker embeddings

The speaker embeddings from Chapter 04 were designed for utterances of 5–30 seconds. Diarization requires embeddings for segments as short as 0.5–2 seconds. This changes the operating regime substantially.

At short durations, the statistics pooling layer in x-vector and ECAPA-TDNN systems averages over very few frames. With 10ms frame shift and 1.5s segments, the pooling window contains 150 frames — adequate for a rough spectral mean but insufficient to capture prosodic variability. EER on VoxCeleb increases dramatically below 3 seconds of audio, and the short-duration gap has motivated dedicated model fine-tuning on short-segment data.

The standard practice is to extract embeddings on fixed-length windows (1–2s) with 50% overlap across the audio, producing a dense sequence of embeddings rather than one per segment. These overlapping windows are then clustered, and segment boundaries are inferred from the cluster-label transitions. This avoids committing to change-point times before clustering, at the cost of ambiguous boundary assignment at transitions.

Multi-scale embeddings

Diarization systems that outperform single-scale approaches often compute embeddings at multiple window lengths (e.g., 0.5s, 1.5s, 3s) and fuse them before clustering. Short windows capture rapid transitions but have noisy embeddings; long windows have stable embeddings but miss short speaker turns. The MSDD (Multi-Scale Diarization Decoder) framework uses attention over multi-scale embeddings to produce refined per-frame speaker assignments.

Section 05

Agglomerative Hierarchical Clustering

AHC is the workhorse of classical diarization: it makes no assumption about the number of speakers and produces a full dendrogram of merge decisions that can be cut at any level.

Agglomerative Hierarchical Clustering starts with each segment as its own cluster and iteratively merges the two clusters with the smallest distance (or equivalently, highest similarity). The distance between two clusters is defined by a linkage criterion — for diarization, average linkage (the mean pairwise cosine distance between all segment pairs across the two clusters) is most common, as it is less sensitive to outlier segments than single or complete linkage.

The merging continues until a stopping criterion is met. Two options dominate:

Threshold stopping: merge until the next merge would bring together two clusters with cosine distance above a threshold $\delta$. The threshold must be tuned on validation data; a single global threshold rarely transfers across acoustic conditions.
Oracle k: for evaluation, stop when exactly $k$ clusters remain, where $k$ is the ground-truth speaker count. This is used to separate diarization error from speaker-count error.

PLDA scoring in AHC

When PLDA is available as a backend (fitted on the speaker embedding training set), the cluster distance can be computed as the PLDA log-likelihood ratio between the "same speaker" and "different speakers" hypotheses — a principled Bayesian distance that accounts for within-cluster embedding variability. PLDA-scored AHC consistently outperforms cosine-scored AHC, particularly when recordings contain channel or microphone variability that inflates cosine distances within a single speaker's segments.

Intuition

The dendrogram produced by AHC is a complete history of the merging process. Cutting it at different heights gives different values of $k$. If you're uncertain about the speaker count, you can inspect the dendrogram: large gaps between merge heights correspond to natural cluster boundaries; small gaps suggest the algorithm is merging speakers that may actually be distinct.

Section 06

Spectral clustering

Spectral clustering approaches diarization as a graph partitioning problem, operating on the full pairwise affinity matrix of all segments simultaneously — making it more robust to the greedy local decisions that limit AHC.

Given $n$ segments with embeddings $\{\mathbf{e}_i\}$, construct an affinity matrix $A \in \mathbb{R}^{n \times n}$ where $A_{ij} = \exp(-d_{ij}^2 / \sigma^2)$ and $d_{ij}$ is the cosine distance. The graph Laplacian $L = D - A$ (where $D$ is the diagonal degree matrix) encodes the graph structure. The bottom $k$ eigenvectors of $L$ form a low-dimensional embedding of the segments in which same-speaker segments cluster tightly; k-means in this eigenvector space produces the speaker assignments.

Auto-estimating the number of speakers

The eigengap heuristic estimates $k$ by finding the largest gap between consecutive eigenvalues of the Laplacian — a large gap at position $k$ suggests that $k$ is the natural number of clusters in the data. In practice this heuristic is fragile for small gaps. Refined methods include:

NME-SC (Normalized Maximum Eigengap Spectral Clustering): normalizes the affinity matrix and picks $k$ by maximizing the normalized eigengap, improving robustness to scaling.
Refined spectral clustering: applies a row-wise diffusion sharpening to the affinity matrix before eigen-decomposition, making cluster structure more pronounced.

Spectral clustering became the standard approach after Wang et al. (2018) showed it outperformed AHC on the NIST SRE and CALLHOME benchmarks. It handles non-convex speaker clusters better than centroid-based methods and is particularly effective when embeddings lie on a manifold rather than in well-separated Gaussian blobs.

Spectral clustering: the pairwise affinity matrix (left) reveals block structure corresponding to speakers. Eigenvectors of the Laplacian project segments into a space where same-speaker segments cluster tightly, and k-means produces clean speaker assignments.

Section 07

Overlap detection and multi-speaker segments

Standard diarization assumes at most one speaker is active at any moment. In real recordings — particularly meetings — that assumption fails systematically. Overlapping speech is one of the largest single sources of DER and is the hardest problem the field faces.

In meeting recordings, overlapping speech occupies roughly 10–20% of speech time. In telephone conversations it is lower (~5%), but still substantial. The impact on DER is disproportionate: overlap regions produce embeddings that are mixtures of two speakers, confusing any single-label clustering method, and the missed second speaker also contributes to the speaker confusion and missed-speech components of DER.

Overlap detection as a binary classifier

The most common approach adds an overlap detector as a parallel module to the main diarization pipeline. A neural network (typically a TDNN or Transformer) is trained to classify each frame as single-speaker or multi-speaker, producing a binary mask over time. Overlap-flagged regions are then handled separately: the two most likely speakers are assigned jointly, typically by looking at the adjacent single-speaker segments and propagating their labels into the overlap region.

Neural overlap assignment

More sophisticated methods learn to assign overlapping regions directly. The EEND family (Section 08) handles this naturally via multi-label frame outputs. Hybrid systems combine a clustering-based pipeline for the bulk of the audio with an EEND module that processes overlap-flagged windows, re-labeling them with the full permutation-aware multi-speaker output.

Watch Out

Over-detecting overlap is often worse than under-detecting it. An overlap detector with high false-alarm rate will flag single-speaker regions as multi-speaker, causing incorrect second-speaker assignment and inflating speaker confusion in the DER. Precision matters more than recall for overlap detectors used in assignment pipelines.

Section 08

End-to-end neural diarization

EEND reformulates diarization entirely, abandoning the segment-embed-cluster pipeline in favour of a sequence model that maps raw acoustic features directly to per-frame, per-speaker activity probabilities — naturally handling overlap.

Fujita et al. (2019) introduced EEND with a BLSTM trained on simulated two-speaker mixtures created by concatenating random pairs of utterances from the LibriSpeech corpus. The model takes a sequence of log-mel filterbank features and produces, for each frame, a probability vector $\hat{\mathbf{y}}_t \in [0,1]^S$ indicating which of $S$ output channels (speaker slots) is active. Overlap is handled naturally: multiple entries of $\hat{\mathbf{y}}_t$ can be non-zero simultaneously.

Permutation-Invariant Training

Training EEND requires a loss function that is invariant to the arbitrary labeling of speaker output channels. Permutation-Invariant Training (PIT) solves this: for each training example, all $S!$ permutations of the reference speaker labels are tried, and the loss is computed using the permutation that minimizes it: $$\mathcal{L} = \min_{\phi \in \text{Perm}(S)} \sum_{t=1}^{T} \sum_{s=1}^{S} \text{BCE}(\hat{y}_{t,s},\, y_{t,\phi(s)})$$ where $\text{BCE}$ is binary cross-entropy. This prevents the model from collapsing to a single-channel solution and encourages it to partition speakers consistently within an utterance.

A critical design choice is the source of training data. Real diarization data with precise overlap labels is scarce. EEND systems are typically pre-trained on simulated mixtures (Poisson-distributed speaker durations and silence gaps, clipped from single-speaker corpora) and then fine-tuned on small amounts of real meeting data. The gap between simulated and real distributions is a persistent challenge.

Key Idea

EEND's fundamental advantage over clustering-based diarization is its handling of overlap: the binary cross-entropy loss on each output channel is independent, so the model can activate multiple channels simultaneously without penalty. No post-hoc overlap assignment step is needed.

Section 09

EEND extensions: variable speakers and EDA

The original EEND assumed a fixed number of speakers S known at inference time. Real recordings have an unknown, variable speaker count. The EDA extension and its successors solve this.

EEND-EDA (Encoder-Decoder based Attractor, Horiguchi et al. 2020) introduces an attractor-based mechanism for variable speaker count. An encoder (BLSTM or Transformer) processes the full sequence of acoustic features to produce a context vector. A decoder then iteratively generates speaker attractors $\{\mathbf{a}_s\}$ by attending over the encoder output, until a termination condition (learned stop probability) indicates no more speakers. Each attractor is used as a query to produce the per-frame activation for that speaker: $$\hat{y}_{t,s} = \sigma(\mathbf{h}_t \cdot \mathbf{a}_s)$$ where $\mathbf{h}_t$ is the encoder hidden state at frame $t$. EEND-EDA handles 1–4 speakers with a single model and no speaker-count input at inference time.

Scaling EEND to more speakers

EEND-VC (Vector Clustering) handles recordings with more speakers than the model's fixed output channels by running EEND on short local windows and then clustering the resulting per-window speaker embeddings globally — a hybrid that combines EEND's overlap handling with the scalability of clustering. EEND-OLA (Overlapped Local windows with global Agreements) extends this by enforcing consistency between overlapping window outputs via a stitching algorithm that aligns local permutations.

As of 2024, the leading systems on challenging benchmarks like DIHARD-III combine EEND-based modules for overlap handling with embedding-based clustering for global speaker tracking, supplemented by multi-scale embeddings and learned stopping criteria for speaker counting. No single architectural paradigm dominates across all conditions.

Section 10

Streaming and online diarization

Offline diarization has access to the full recording before producing output. Streaming diarization must produce speaker labels in near real-time, updating incrementally as new audio arrives — which changes the problem fundamentally.

The core challenge of streaming diarization is that clustering is inherently a batch operation: AHC and spectral clustering require the full set of embeddings to construct distance matrices or affinity matrices. Online variants must maintain an evolving speaker model and assign each new segment to an existing speaker or declare a new one.

Online AHC maintains a set of speaker clusters and, for each new segment embedding, finds the nearest cluster by cosine distance. If the distance exceeds a threshold, a new cluster is created. Periodically, clusters are merged if they fall below a separate merge threshold. This requires careful tuning of two thresholds and tends to accumulate speaker fragmentation over time.

UIS-RNN (Unbounded Interleaved-State Recurrent Neural Network, Zhang et al. 2019) models the sequence of speaker labels as a generative process: each new segment is assigned either to an existing speaker (with probability proportional to speaker frequency, à la a Chinese restaurant process) or to a new speaker. The sequence model is an RNN trained to predict the next speaker label given a history of labels and embeddings. UIS-RNN naturally handles new speakers arriving mid-recording and was one of the first neural approaches to online diarization.

Streaming EEND variants apply EEND to overlapping windows and stitch the outputs, as in EEND-OLA. The stitching step — aligning speaker permutations across window boundaries — is the key inference challenge and is typically solved by Hungarian matching between the speaker embeddings of adjacent windows.

Section 11

Speaker-attributed ASR

Diarization's most important downstream application is meeting transcription: producing a word-level record of who said what, when. The gap between a diarization hypothesis and a readable transcript requires integrating diarization with ASR.

The standard pipeline approach runs diarization first, then transcribes each speaker segment independently with an ASR system. This is simple and modular but has well-known failure modes: a speaker change in the middle of a word causes a missed fragment; diarization boundary errors produce truncated words at segment edges; and overlap-attributed segments may confuse the ASR with mixed-speaker audio.

Word-level diarization improves on segment-level attribution by aligning the ASR transcript (with word timestamps) against the diarization hypothesis, assigning each word to the speaker active during that word. When diarization boundaries fall mid-word, the word is assigned to the speaker covering the majority of its duration. This significantly reduces the apparent boundary sensitivity at the transcript level.

Joint and end-to-end SA-ASR

True joint models optimize diarization and ASR together. SA-ASR (Speaker-Attributed ASR, Shafey et al. 2019) adds speaker-tracking tokens to the ASR output vocabulary, producing a sequence of words interleaved with speaker-change markers. More recent systems use the speaker embedding extracted from each diarization segment as a conditioning signal for the ASR model — allowing the acoustic model to focus on a specific speaker's voice when decoding their segments. The NOTSOFAR and CHiME-7 challenge systems push this to multi-microphone array diarization combined with neural beamforming and end-to-end ASR.

In practice, the most widely deployed system in 2024 is a pipeline of pyannote.audio for diarization and whisper-large-v3 for transcription, with word-level alignment via forced alignment tools (WhisperX, whisper-timestamped). This combination achieves meeting-grade transcription quality without joint training, at the cost of the failure modes inherent in the pipeline approach.

Intuition

Think of SA-ASR as the output product of the diarization chapter: not a timeline of speaker regions, but a fully attributed transcript where every word is labeled with who said it. This is the deliverable that makes meeting minutes, clinical notes, and legal proceedings possible from raw recordings.

Section 12

Evaluation: DER, JER, and benchmarks

Diarization evaluation is more complex than speaker verification because the output is a time-annotated sequence of labels, not a binary decision. The standard metric — DER — decomposes into three additive error types, each diagnosing a different failure mode.

Diarization Error Rate (DER) is defined as: $$\text{DER} = \frac{\text{Missed Speech} + \text{False Alarm Speech} + \text{Speaker Confusion}}{\text{Total Reference Speech Time}}$$ Each term covers a distinct error type:

Missed speech: reference speech that the system labeled as non-speech (VAD false negative)
False alarm speech: system-labeled speech in reference non-speech regions (VAD false positive)
Speaker confusion: speech regions where the correct speaker is active but a different speaker label is assigned — the core diarization error

Before computing DER, an optimal label permutation is found via the Hungarian algorithm to resolve the label-permutation ambiguity. A collar of ±0.25 seconds around reference boundaries is typically excluded from scoring to avoid penalizing small timing disagreements — though some evaluations (DIHARD) use a zero-collar condition to maximize strictness.

JER and alternative metrics

Jaccard Error Rate (JER) was introduced in the DIHARD challenges to address a bias in DER: DER weights errors by duration, so a short high-error speaker contributes less to DER than their subjective importance warrants. JER computes a per-speaker Jaccard similarity between reference and hypothesis and averages across all speakers equally: $$\text{JER} = 1 - \frac{1}{N} \sum_{k=1}^{N} \frac{|R_k \cap H_k|}{|R_k \cup H_k|}$$ JER is more sensitive to errors on minority speakers and is now reported alongside DER in most evaluations.

Benchmarks

Dataset	Condition	Speakers/recording	Primary metric
CALLHOME	Telephone, 2-channel	2–7	DER
AMI Meeting	Lapel + far-field	4	DER
DIHARD-III	11 diverse domains	2–10+	DER + JER
VoxConverse	Wild audio (TV, web)	1–20+	DER
CHiME-6/7	Dinner party, array mic	4	DER + tcpWER

Progress on DIHARD-III has been rapid: the winning systems in the 2020 challenge achieved DER around 14% in the full evaluation condition; by 2023, EEND-based systems with multi-scale embeddings pushed below 8% on the same benchmark. CALLHOME, once a difficult benchmark, is now largely solved — the best systems achieve below 5% DER with oracle VAD.

DER decomposition: the reference has three speakers (A, B, C); the hypothesis misses a short region and confuses SPK C for SPK A. Missed speech, false alarm, and speaker confusion each contribute additively to the final DER score.

Speaker Diarization, who spoke when in the stream.

How to read this chapter

Contents

The diarization problem

Voice Activity Detection

VAD as a segmentation output

Segmentation and change-point detection

Neural change-point detection

Short-segment speaker embeddings

Multi-scale embeddings

Agglomerative Hierarchical Clustering

PLDA scoring in AHC

Spectral clustering

Auto-estimating the number of speakers

Overlap detection and multi-speaker segments

Overlap detection as a binary classifier

Neural overlap assignment

End-to-end neural diarization

Permutation-Invariant Training

EEND extensions: variable speakers and EDA

Scaling EEND to more speakers

Streaming and online diarization

Speaker-attributed ASR

Joint and end-to-end SA-ASR

Evaluation: DER, JER, and benchmarks

JER and alternative metrics

Benchmarks

Further Reading

Foundational Methods

End-to-End Neural Diarization

Spectral Clustering & Modern Pipelines

Speaker-Attributed ASR & Evaluation