Part VIII · Speech, Audio & Music · Chapter 02

Automatic Speech Recognition, the long arc from HMMs to Whisper.

Automatic speech recognition is the oldest active machine-learning problem still in production use and has been the proving ground for nearly every probabilistic-modelling idea that later spread across ML. It is not solved: low-resource languages, overlapping speech, and streaming low-latency inference remain genuinely hard. This chapter traces the full arc from HMM-GMM through CTC, RNN-Transducer, attention encoder-decoders, and Conformers to the Whisper paradigm — weakly supervised, massive scale — and covers evaluation, decoding, and deployment.

How to read this chapter

Sections one through three are orientation and data. Section one is why ASR matters — why a chapter that might sound niche is actually one of the most commercially deployed applications of ML in the world; why the modern voice interface (Alexa, Siri, Google Assistant, automotive, medical scribes, captioning, call-centre analytics, accessibility, translation) sits on an ASR substrate; and how the ASR pipeline differs from every other sequence-modelling problem in its extreme length ratios between input and output. Section two is the ASR landscape: the five eras (HMM-GMM, hybrid HMM-DNN, CTC, attention-based, Conformer-plus-self-supervision-plus-Whisper), who built what, which benchmarks mattered, and how the problem statements themselves shifted. Section three is speech data and corpora: LibriSpeech, Common Voice, TED-LIUM, SPGISpeech, GigaSpeech, VoxPopuli, People's Speech, Fleurs, CoVoST, Switchboard, Fisher, the forced-alignment problem, transcription conventions, and why data curation dominates modern ASR performance.

Sections four through six are the classical foundations. Section four is HMM-GMM acoustic modelling: the left-to-right HMM topology, context-dependent triphones, decision-tree state tying, GMM emission densities, the Baum-Welch EM algorithm, Viterbi decoding, MLLR and fMLLR speaker adaptation, and why every modern ASR engineer still needs to read this vocabulary. Section five is hybrid HMM-DNN: replacing the GMM emission density with a deep neural network's posterior, Mohamed-Dahl-Hinton 2012, sequence-discriminative training (MMI, sMBR, LF-MMI), Kaldi's chain models, TDNN and TDNN-F acoustic models, and the five-year gap in which hybrid systems dominated. Section six is Connectionist Temporal Classification: Graves 2006, the blank symbol, the forward-backward CTC loss, alignment-free sequence training, the conditional-independence assumption and its consequences, greedy and beam decoding, prefix-beam decoding with an external LM, and the CTC systems (DeepSpeech, DeepSpeech 2, Wav2letter, QuartzNet, Citrinet) that launched end-to-end ASR into production.

Sections seven through nine cover the other end-to-end families and the modern backbone. Section seven is the RNN-Transducer (RNN-T): Graves 2012, the joint network over encoder and prediction-network states, the transducer loss, why RNN-T is the dominant streaming architecture at Google / Amazon / Apple, alignment restricted training, monotonic transducer variants, stateless prediction networks, and the memory pragmatics of T-T-T computation. Section eight is Listen, Attend and Spell / attention-based encoder-decoders: the LAS paper, soft attention over encoder outputs, the exposure-bias problem, label-smoothing and scheduled sampling, hybrid CTC/attention training (Watanabe), joint CTC-attention decoding, and the class of systems that culminates in Whisper. Section nine is the Conformer: Gulati 2020, the convolution-plus-attention macaron block, why Conformer is now the default encoder for CTC, RNN-T, and attention alike, E-Branchformer and Zipformer variants, and the way this architecture unified the three end-to-end paradigms.

Sections ten through twelve cover streaming, Whisper-scale supervision, and multilinguality. Section ten is streaming ASR: the latency problem, the chunk-wise and monotonic-attention approaches, RNN-T as the natural streaming architecture, look-ahead windows, Emformer and dynamic chunking, trigger attention, MoChA / MILK / MAtCha, endpointing and voice-activity detection, and the word-level latency metrics that drive the field. Section eleven is Whisper and large-scale weakly-supervised ASR: Radford 2022, 680 000 hours of web audio, the multi-task decoder (transcribe, translate, timestamps, language identification), Whisper's failure modes (hallucinated text, long-form drift, timestamp jitter), faster-whisper / whisper.cpp / Distil-Whisper, and the followers — USM, SeamlessM4T, Canary, Parakeet, OWSM — that built on the recipe. Section twelve is multilingual and low-resource ASR: MMS (Meta's Massively Multilingual Speech, 1100+ languages), XLS-R, USM's 300+ languages, cross-lingual transfer, the zero-shot problem, phoneme-based vs grapheme-based multilingual heads, code-switching, and the Fleurs and VoxLingua benchmarks.

Sections thirteen through fifteen cover self-supervised pretraining, decoding theory, and language-model integration. Section thirteen is self-supervised ASR representations: CPC, wav2vec, wav2vec 2.0 (masked-span prediction on quantised latents), HuBERT (clustered-target iterative pseudo-labelling), WavLM (denoising-style masking), data2vec (unified SSL across modalities), how these are fine-tuned for downstream ASR (CTC head on frozen / unfrozen encoder), and why SSL made 10-hour and even 1-hour fine-tuning budgets viable. Section fourteen is decoding: greedy decoding, beam search, prefix-beam search for CTC, label-synchronous beam search for attention, frame-synchronous beam search for RNN-T, WFST composition (HCLG = H ∘ C ∘ L ∘ G), the Kaldi decoding graph, and shallow fusion / deep fusion / cold fusion with external LMs. Section fifteen is language models and rescoring: n-gram LMs (SRILM, KenLM), neural LMs, first-pass shallow fusion, second-pass rescoring, internal-LM estimation and subtraction, density-ratio method, domain biasing and hotword boosting, and why an external LM is still worthwhile even with a powerful encoder-decoder.

Sections sixteen through eighteen cover evaluation, deployment, and the chapter's operational closing. Section sixteen is evaluation: the word-error-rate metric (substitutions + insertions + deletions, divided by reference length), character-error-rate for Asian languages, WER's limitations (semantic equivalence, capitalisation, punctuation, hesitation tokens), the Kincaid-TUDa test sets, long-form WER, streaming WER with latency, hallucination-rate for Whisper-class models, fairness metrics across demographic groups, and the human-parity debates. Section seventeen is deployment: real-time-factor budgets, CPU vs GPU serving, quantisation (int8, nvfp4), batching strategies, streaming protocols (WebSocket, gRPC), VAD and endpointing pipelines, speaker-diarization integration, inverse-text-normalisation, punctuation and capitalisation post-processing, and the shipping stacks (NVIDIA Riva / NeMo, Kaldi / k2 / icefall, ESPnet, SpeechBrain, Whisper.cpp, torchaudio). Section eighteen (in-ml) is the chapter's closing: how ASR relates to the rest of Part VIII (TTS, speaker recognition, audio classification, music generation), the ASR-as-a-component pattern (voice assistants, captioning, translation, meeting summarisation, RAG-over-audio), the modality convergence into speech-foundation-models (Seamless, Qwen-Audio, GPT-4o), and the links back to Parts V–VII (attention, transformers, sequence models, VLMs) that made the last five years of ASR progress possible.

Why automatic speech recognition mattersThe commercial backbone behind voice assistants, captioning, translation, accessibility
The ASR landscapeFive eras from HMM-GMM to Whisper, who built what, which benchmarks
Speech data & corporaLibriSpeech, Common Voice, GigaSpeech, Switchboard, Fleurs, forced alignment
HMM-GMM acoustic modellingTriphones, decision-tree tying, Baum-Welch, Viterbi, MLLR, the classical pipeline
Hybrid HMM-DNN systemsDNN emissions, LF-MMI, TDNN / TDNN-F, Kaldi chain models, 2011–2018 dominance
Connectionist Temporal ClassificationBlank symbol, CTC loss, alignment-free training, DeepSpeech, QuartzNet
RNN-TransducerGraves 2012, joint network, streaming-friendly, Google / Apple / Amazon default
Listen, Attend and Spell & attention-based AEDSoft attention, joint CTC/attention, exposure bias, the road to Whisper
The ConformerConvolution + attention macaron, E-Branchformer, Zipformer, the new default backbone
Streaming ASRChunked attention, Emformer, monotonic attention, endpointing, latency budgets
Whisper & large-scale weak supervision680k hours, multi-task decoder, USM, SeamlessM4T, Canary, Parakeet, OWSM
Multilingual & low-resource ASRMMS (1100+ langs), XLS-R, code-switching, Fleurs, zero-shot
Self-supervised ASRwav2vec 2.0, HuBERT, WavLM, data2vec, CPC, 10-hour fine-tuning
DecodingBeam search, prefix beam, WFST composition, HCLG, shallow fusion
Language models & rescoringn-gram / neural LMs, shallow / deep fusion, biasing, internal-LM subtraction
Evaluation: WER & beyondSubstitutions / insertions / deletions, CER, long-form WER, hallucination rate
Deployment & operationsRTF, quantisation, streaming, VAD, endpointing, ITN, punctuation, diarization
Automatic speech recognition in the ML lifecycleNeMo / Riva / Kaldi / ESPnet / SpeechBrain / Whisper.cpp, Part VIII connections

Section 01

Why audio signal processing matters

Every speech, music, and sound model you will ever touch — HMM-GMM ASR from 1995, DeepSpeech from 2014, Whisper from 2022, AudioLM and MusicLM from 2023, GPT-4o's audio front end in 2024 — shares a hidden dependency: a carefully engineered audio signal-processing pipeline that converts continuous pressure waves into numerical arrays. This chapter is about that pipeline. It is the foundation on which every subsequent chapter of Part VIII — automatic speech recognition, text-to-speech, speaker recognition, sound classification, music generation — is built.

It is tempting, in the era of end-to-end deep learning, to assume that classical digital signal processing (DSP) is obsolete: surely a sufficiently expressive neural network can learn everything from raw waveforms? The reality is more nuanced. End-to-end waveform models exist — SincNet, LEAF, Jasper, wav2vec 2.0 operate directly on samples — but even they incorporate DSP-inspired inductive biases (band-pass filters, learnable mel-like filterbanks, residual vector quantisation). And for the overwhelming majority of deployed systems — including Whisper, SeamlessM4T, Bark, Tacotron, WaveNet vocoders, and every industrial ASR pipeline — the first stage is still a log-mel-spectrogram computed with a short-time Fourier transform, Hann windows, and a mel filterbank whose parameters descend directly from psychoacoustic research of the 1930s–1940s. DSP is not gone; it has been absorbed into the training pipeline.

Why should the choices at this layer matter so much? Because they set the effective resolution of everything downstream. A 10 ms hop at 16 kHz gives a spectrogram with 100 frames per second — enough temporal resolution for phonemes but not for fast music transients. A 16 kHz sample rate discards everything above 8 kHz, which is fine for speech but mutilates the brilliance of a violin. A 64-bin log-mel spectrogram compresses frequency by a factor of five compared to the full STFT, which is efficient for Conformer-based ASR but insufficient for fine-grained music-tagging tasks. Every downstream model inherits these trade-offs; a mis-chosen front end is invisible in the loss curve and catastrophic in production.

Key idea. Audio signal processing is the lossy, opinionated interface between the analogue world and a neural network. The classical mel-spectrogram pipeline — framing, windowing, FFT, magnitude, mel filterbank, log — is not a historical artefact. It is a psychoacoustically motivated, highly efficient feature extractor that has survived every revolution in deep learning because it encodes structure the network would otherwise have to re-derive from data it has not been given.

There is also an engineering story. Audio is high-bandwidth: 16-bit mono at 16 kHz is already 256 kbit/s, and 48 kHz stereo is twenty-four times larger. A one-hour recording is forty to five hundred megabytes uncompressed. This dwarfs images — a 224×224 RGB image is 150 kB — and makes the question of how to compactly represent audio central to every practical system. The 2023 neural-audio-codec revolution (SoundStream, EnCodec, DAC) is the modern answer: quantise waveforms into 75–150 discrete tokens per second and let a decoder-only transformer generate them, exactly as an LLM generates text. But even the codec's encoder is, at heart, a learnable front end — and understanding the classical front end is the prerequisite to understanding its neural successor.

A second reason to learn this material well: audio is where the modern multimodal stack is heading. GPT-4o and Gemini 2.0 accept audio natively. Voice assistants, meeting bots, hearing aids, accessibility tools, music-creation apps, audio-search engines, and the emerging genre of audio foundation models all sit on the same DSP foundation. If you plan to build or debug any of these, you will be reading spectrograms, tuning STFT parameters, comparing sample rates, and arguing about mel-bin counts — skills that are essentially classical, regardless of which deep model you deploy on top.

This chapter is structured around the pipeline: physics and perception, digital-audio fundamentals, time-domain features, Fourier analysis, the STFT, windowing, spectrograms, the mel scale, mel-spectrograms, MFCCs, the broader feature zoo, pitch and onsets, source separation, enhancement, learnable front ends, neural codecs, and finally an operational overview of how these pieces stitch into real-world ML pipelines. Later chapters of Part VIII take this foundation for granted; this is the layer you can always come back to.

Section 02

The physics and perception of sound

Sound is a mechanical pressure wave in a medium — typically air. A vibrating source (a vocal fold, a guitar string, a loudspeaker cone) periodically compresses and rarefies the surrounding molecules, and that pattern of pressure variations propagates outward at roughly 343 m/s at room temperature. Everything else — frequency, amplitude, timbre, pitch, loudness — is a description of this one underlying phenomenon.

The two primary parameters are frequency (how many oscillations per second, measured in Hertz, Hz) and amplitude (the magnitude of the pressure variations, usually measured logarithmically as sound pressure level in decibels). A pure sinusoid at frequency f and amplitude A is fully described by these two numbers plus a phase. Real sounds are superpositions of such sinusoids across a wide frequency range — human speech occupies roughly 80 Hz (deep voice fundamental) to 8 kHz (fricatives like /s/ and /ʃ/); musical instruments span 20 Hz to 20 kHz; bat echolocation goes to 100 kHz. The human auditory system responds to roughly 20 Hz – 20 kHz at the young-adult extreme, and degrades from the top end with age and noise exposure.

The decibel (dB) is logarithmic: a 10 dB increase corresponds to a factor-of-ten increase in power (or a factor-of-√10 ≈ 3.16 in amplitude). Sound pressure level (SPL) is measured in dB relative to 20 μPa, the threshold of hearing at 1 kHz for young adults. A whisper is around 30 dB SPL; normal conversation is 60 dB; a rock concert is 110 dB; the pain threshold is near 130 dB. Logarithmic scaling matches perception: doubling perceived loudness typically requires a ten-fold increase in power, which is exactly the definition the dB scale embodies.

Phase is the position within an oscillation cycle. Two pure tones of identical frequency and amplitude but opposite phase cancel completely — this is the principle behind active noise cancellation headphones. Phase matters for spatial hearing (the inter-aural time difference is a phase difference) and for waveform reconstruction from spectrograms (the fundamental reason Griffin-Lim is an approximation), but humans are remarkably phase-insensitive to the absolute phase of stationary sounds — a crucial fact that justifies the magnitude-only spectrogram that dominates audio ML.

Key idea. The human auditory system is a nonlinear detector. Loudness scales logarithmically with intensity, frequency resolution is roughly logarithmic (we hear pitches as ratios, not differences), and our sensitivity is highly frequency-dependent (peak sensitivity around 2–4 kHz, the speech-intelligibility range). Every classical audio feature — the dB scale, the mel scale, equal-loudness weighting, the gammatone filterbank — is an engineering approximation to one or more of these facts.

Psychoacoustics is the science of what the auditory system actually perceives. Two facts are especially important. First, the equal-loudness contours (Fletcher-Munson 1933, ISO 226): a 30 Hz tone must be about 60 dB louder than a 1 kHz tone to sound equally loud. This is why A-weighting (which de-emphasises low and very high frequencies) is the standard noise-measurement weighting. Second, auditory masking: a loud tone at one frequency raises the perceptual threshold of nearby frequencies (frequency masking), and a loud transient masks softer sounds that occur shortly before or after (temporal masking). Perceptual audio codecs (MP3, AAC, Opus) exploit masking aggressively — they allocate fewer bits to masked regions of the spectrogram — and the same principle motivates why certain log-magnitude compressions are good for ML front ends.

The mel scale (Section 09) and the critical band structure (Bark scale, ERB scale) are consequences of the cochlea's physical construction. The basilar membrane inside the cochlea is a frequency-tuned resonator: high frequencies stimulate the base, low frequencies stimulate the apex, and the tuning width grows nonlinearly with frequency. This means the ear resolves two 100 Hz-apart pure tones near 200 Hz (easy) but not near 10 kHz (impossible). Any feature that ignores this — a linear-frequency spectrogram, say — is giving a neural network information the ear does not use, and burying information the ear does.

Finally, sound propagates: it reflects off walls (producing reverberation), it attenuates with distance, it diffracts around obstacles, and it is perturbed by wind and temperature gradients. Every deployed audio system must cope with these, which is why far-field speech recognition, beamforming, and room-acoustic modelling (dereverberation, RIR estimation) are live subfields. We return to them in the enhancement and source-separation sections.

Section 03

Digital audio: sampling, quantisation, and file formats

To process audio on a computer, the continuous pressure wave picked up by a microphone must be converted into a sequence of numbers. This conversion has two steps — sampling (discretising time) and quantisation (discretising amplitude) — and the choices made at each step constrain everything that can be done downstream.

Sampling replaces the continuous signal x(t) with the sequence x[n] = x(n T), where T is the sampling period and f_s = 1/T is the sample rate. The Nyquist-Shannon sampling theorem is the foundational result: any signal whose Fourier transform vanishes above frequency B can be perfectly reconstructed from samples at rate f_s > 2B. Violating this — attempting to represent a 12 kHz tone with a 16 kHz sample rate, whose Nyquist limit is 8 kHz — causes aliasing: the high-frequency content folds back into the representable band as spurious lower frequencies. Real ADCs therefore always apply an anti-aliasing filter (a steep analogue low-pass below the Nyquist limit) before sampling.

Canonical sample rates cluster around the intended bandwidth: 8 kHz (narrow-band telephony, up to 4 kHz content), 16 kHz (wide-band speech, up to 8 kHz — the default for ASR), 22.05 kHz (half of CD rate, lightweight audio), 24 kHz (modern TTS), 32 kHz (broadcasting), 44.1 kHz (CD / consumer music), and 48 kHz (professional audio and video). Whisper trains at 16 kHz, Bark at 24 kHz, MusicLM at 24 kHz, EnCodec's music mode at 48 kHz. Choosing the wrong rate — e.g. processing 48 kHz music with a 16 kHz model — either discards essential content or wastes compute.

Quantisation replaces each continuous amplitude with a finite integer. Common bit depths are 8-bit (μ-law telephony, 256 levels), 16-bit (CD quality, 65 536 levels, the overwhelming default), 24-bit (professional, 16.7 M levels), and 32-bit float (modern DAWs and intermediate processing). The signal-to-quantisation-noise ratio of a uniform B-bit quantiser of a full-scale sinusoid is approximately 6.02 B + 1.76 dB — so 16-bit audio has ≈ 96 dB of dynamic range, which comfortably exceeds human hearing's roughly 100 dB window for typical listening levels. Dither — adding a small amount of random noise before quantisation — breaks up the correlation between quantisation error and the signal and is standard practice at low bit depths.

Sample-rate cheat sheet. For ASR default to 16 kHz mono; for modern TTS use 22.05 – 24 kHz; for music-tagging models 16 – 22 kHz is fine; for music generation and fidelity-critical codecs use 44.1 or 48 kHz. Every up- or down-sampling step requires a properly designed resampler (polyphase FIR or a high-quality library like soxr or torchaudio.functional.resample); naïve decimation causes aliasing.

Channel layout matters almost as much. Mono is a single channel; stereo is left/right; 5.1 is multichannel surround. Most speech processing downmixes to mono; music work often preserves stereo; multi-microphone arrays (smart speakers, meeting-room capture) operate on raw channel counts up to 8 or 16 and do beamforming before any single-channel stage.

File formats fall into three groups. Uncompressed PCM (WAV, AIFF) stores raw samples — the canonical lossless representation and the default in almost every ML pipeline. Lossless compression (FLAC, ALAC) uses entropy coding to shrink PCM by 40–60% with bit-exact recovery — preferred for archival and dataset distribution. Lossy compression (MP3, AAC, Opus, Vorbis) discards perceptually redundant content, typically reducing size by 10× or more; Opus at 24 kb/s is broadcast-quality for speech. Loading any of these in Python is a one-liner with soundfile, librosa.load, or torchaudio.load, but beware: lossy formats introduce subtle spectral artefacts that can hurt model performance if training data and deployment data use different codecs.

A practical gotcha: Python audio libraries differ in their default output format. torchaudio.load returns a float tensor in [-1, 1]; librosa.load also returns float; scipy.io.wavfile.read returns int16 and requires dividing by 32768 to get floats. Half of all audio bugs come from this mismatch.

Section 04

Time-domain features and operations

Before any frequency analysis, there is a rich toolkit of features and operations that work directly on the waveform. Many of them look elementary; all of them are still in daily use. Whether you are writing a voice-activity detector for a wake-word system or a beat-tracker for a DAW, the time domain is where you start.

Framing is the first operation in almost every pipeline. The waveform is sliced into overlapping frames of length N samples (typically 20–30 ms, so 320–480 samples at 16 kHz) with a hop of H samples (10 ms = 160 samples is standard). This converts a 1D sequence into a 2D array of shape (frames, samples-per-frame). Every subsequent time-frequency transform operates on these frames. The hop sets the effective frame rate — 10 ms hop gives 100 frames per second, which matches phoneme duration and is the de-facto standard for speech.

Root-mean-square energy (RMS) and its close cousin short-time energy are per-frame scalars summarising amplitude: RMS = sqrt(mean(x²)). Plotted over time, this is the signal's envelope. RMS drives voice-activity detection, auto-gain-control, compressors, and most silence-trimming heuristics. It is trivially computable and surprisingly powerful.

Zero-crossing rate (ZCR) counts how often the signal changes sign per frame. High ZCR correlates with high-frequency content: voiced speech has low ZCR (the vocal fold period dominates), unvoiced fricatives (/s/, /ʃ/, /f/) have high ZCR, and percussion transients have very high ZCR. ZCR is a cheap and historically important feature for voiced/unvoiced detection and music/speech discrimination.

Autocorrelation — the correlation of the signal with a shifted copy of itself — reveals periodicities. The first non-trivial autocorrelation peak for a voiced signal lies at the pitch period, which is the basis of pitch-estimation algorithms (YIN, pYIN in Section 13). Autocorrelation is also the foundation of linear predictive coding (LPC), which models the vocal tract as an all-pole filter and was the workhorse of low-bit-rate speech coding for three decades.

Pre-emphasis. Most speech pipelines apply a first-order high-pass filter y[n] = x[n] − α · x[n−1] with α ≈ 0.97 before framing. This boosts high frequencies (which would otherwise be attenuated by the natural −6 dB/octave roll-off of the glottal source and lip radiation), flattening the spectrum so later features give each band comparable weight. Pre-emphasis pre-dates neural networks but is still applied by default in librosa, HTK, and Kaldi.

Voice-activity detection (VAD) decides which frames contain speech. Classical VADs (ITU-T G.729 Annex B, WebRTC VAD) combine energy, ZCR, and periodicity with hand-tuned thresholds; modern VADs (Silero-VAD, NVIDIA Marblenet) are small neural networks operating on mel-spectrogram features. VAD matters more than it looks: in a long conversation, 50–70% of frames contain no speech, and running an ASR model on all of them wastes compute and frequently hallucinates. VAD-gated inference is standard in every production speech pipeline.

Silence trimming and loudness normalisation are close cousins. Trimming removes leading/trailing silence (librosa's trim, torchaudio's vad). Loudness normalisation rescales a clip so its measured loudness (LUFS, the broadcast standard) matches a target — essential for training data where clip gains vary wildly.

Two more operations deserve mention. Resampling converts between sample rates using a polyphase-FIR or sinc-interpolation filter; it is fundamental because trained models assume a single rate. Time stretching and pitch shifting — changing duration without changing pitch, or vice versa — are the foundation of data augmentation (SpecAugment time stretching, speed-perturbation) and musical effects (phase vocoder, WSOLA, Rubberband). These operations are not free; poor implementations introduce phase artefacts that a neural network can easily detect. Use vetted libraries (soxr, torchaudio.transforms.Resample, pedalboard, rubberband) rather than rolling your own.

Section 05

Fourier analysis: DFT, FFT, and the complex spectrum

The Fourier transform is the single most important mathematical tool in audio. It decomposes a signal into a sum of sinusoids at different frequencies — a frequency-domain representation — and virtually every feature, filter, model, and perceptual fact in this chapter is grounded in it. A careful reader should be comfortable with three flavours: the continuous Fourier transform (for theory), the discrete Fourier transform (DFT) for finite sampled signals, and the fast Fourier transform (FFT) for computation.

The continuous Fourier transform of a signal x(t) is X(f) = ∫ x(t) exp(−2πi f t) dt. It maps a function of time to a function of frequency; the inverse transform maps back. The transform is linear, preserves energy (Parseval's theorem), turns convolution into multiplication, and turns differentiation into multiplication by 2πi f. These properties are the reason every linear time-invariant system — every filter — is fully characterised by its frequency response, the Fourier transform of its impulse response.

The discrete Fourier transform (DFT) operates on N samples: X[k] = Σ_n=0^N-1 x[n] exp(−2πi k n / N). The output is N complex numbers, one per frequency bin. Bin k corresponds to frequency k · f_s / N Hz. The first bin is DC; the last unique bin is the Nyquist frequency; the upper half mirrors the lower half for real-valued inputs, which is why audio libraries store only the first N/2 + 1 bins (rfft rather than fft).

The fast Fourier transform (FFT) is not a different transform but an algorithm for computing the DFT in O(N log N) time instead of O(N²). Cooley and Tukey rediscovered it in 1965 (Gauss had sketched it in 1805); it is arguably the most important algorithmic contribution of the twentieth century. Every audio library uses it: NumPy's np.fft, PyTorch's torch.fft, CUDA's cuFFT, Intel MKL's FFT. FFT sizes are typically powers of two (512, 1024, 2048) because power-of-two FFTs are the fastest, though modern implementations handle arbitrary sizes efficiently.

Key idea. Each DFT bin is a complex number: its magnitude is the amplitude of the corresponding sinusoid, its argument is the phase. In almost every audio-ML pipeline we keep only the magnitude (or log-magnitude). We will see in Section 06 and Section 08 that this is a defensible approximation — humans are weakly sensitive to absolute phase — but also the source of Griffin-Lim's ill-posedness, the HiFi-GAN's complexity, and the interest in phase-aware vocoders.

The magnitude spectrum |X[k]| is the basis of all spectrogram-based features. For a periodic signal, the magnitude spectrum has sharp peaks at the fundamental frequency f₀ and its integer multiples (the harmonics) — this is why voiced speech shows distinct horizontal striations in a spectrogram, and why pitch estimation from the magnitude spectrum is feasible. The power spectrum is |X[k]|² and is more convenient when we want energy-based quantities (e.g. for the mel filterbank).

The phase spectrum ∠X[k] is harder to interpret: it varies rapidly with the time origin, wraps modulo 2π, and is essentially noise-like for typical audio. But it is not irrelevant: the phase carries the transient timing information that distinguishes a snare drum from a snare drum played backwards, and it is the reason inverting a spectrogram to a waveform requires either an iterative phase-recovery algorithm (Griffin-Lim) or a learned neural vocoder (WaveNet, HiFi-GAN, BigVGAN — Section 17).

A subtle but practical point: the DFT assumes the signal is periodic with period N. When we take the DFT of a finite-length segment of a non-periodic signal, we are implicitly treating that segment as one period of an infinite tiling. Unless the segment happens to start and end at the same value with the same slope, this creates a discontinuity — a step function repeated every N samples — whose Fourier transform has energy at every bin. The result is spectral leakage: power from one actual sinusoid spreads into neighbouring bins. Windowing (Section 07) is the fix: taper the segment to zero at both ends so the implicit periodic tiling is smooth.

Section 06

The short-time Fourier transform

Audio is non-stationary: the spectrum of a sentence changes from phoneme to phoneme every few tens of milliseconds, and the spectrum of a song changes from note to note every fraction of a second. A single DFT of an entire clip is almost useless because it throws away time. The short-time Fourier transform (STFT) is the standard response: compute a DFT on each short frame so we get a local spectrum that evolves over time.

Formally, the STFT of x[n] with window w[n] of length N and hop H is: X[m, k] = Σ_n=0^N-1 x[m H + n] · w[n] · exp(−2πi k n / N). The output X[m, k] is a 2D complex array indexed by frame m (time) and bin k (frequency). The time resolution is H / f_s seconds; the frequency resolution is f_s / N Hz. Typical speech settings at 16 kHz: window = 400 samples (25 ms), hop = 160 samples (10 ms), FFT size = 512, giving 257 frequency bins and 100 frames per second.

The inverse STFT reconstructs the waveform from the complex STFT. Given perfect magnitude and phase, and a window that satisfies the constant-overlap-add (COLA) condition — the summed, shifted windows equal a constant — the reconstruction is exact: x̂[n] = (1/C) Σ_m (IDFT(X[m, :]))[n − m H] · w[n − m H]. The practical method is torch.istft / librosa.istft. A 50% overlapping Hann window satisfies COLA; so does a 75% overlapping Hann; some window/hop combinations do not, and produce amplitude modulation at the frame rate.

Key idea: the time-frequency uncertainty trade-off. A longer window gives better frequency resolution (more bins per Hertz) but worse time resolution (frames are farther apart, transients blur). A shorter window gives better time resolution but poorer frequency resolution. This is the Heisenberg-Gabor uncertainty principle applied to signal processing: Δt · Δf ≥ 1/(4π). You cannot have arbitrarily fine resolution in both simultaneously. Choose window length to match the task: 25 ms for speech phonemes, 10–20 ms for music transients, 50–100 ms for slowly varying melodic content.

When we need only an amplitude image — as almost every modern audio model does — we take |X[m, k]| and discard the phase. This is the magnitude STFT, and when we sometimes log-compress and remap frequency with a mel filterbank, we get the log-mel spectrogram of Section 10. Note that discarding phase is lossy: the same magnitude can correspond to many different waveforms. Reconstructing a waveform from only the magnitude is the business of Griffin-Lim (iteratively re-estimating phase from a magnitude spectrogram under the constraint that the result's STFT has the target magnitude) and of neural vocoders (learning the phase from data).

Griffin-Lim (Griffin and Lim 1984) is the classical phase-recovery algorithm: start with random phase, take inverse STFT, retake STFT, replace the magnitude with the target, take inverse STFT again, and iterate. It converges to a waveform whose STFT magnitude approximates the target; perceptually it sounds robotic for speech but is acceptable for coarse sketches and is still the default when a quick low-quality waveform is needed. For production TTS, Griffin-Lim has been entirely replaced by neural vocoders.

An important cousin of the STFT is the constant-Q transform (CQT). The STFT has linearly spaced frequency bins, which is awkward for music: an octave at 100 Hz spans 100 bins if Δf is 1 Hz, but an octave at 10 kHz spans 10 000. The CQT uses logarithmically spaced bins — 12 bins per octave matches Western musical pitches — and adapts window length per bin so the quality factor Q = f / Δf is constant. Libraries: librosa.cqt, nnAudio. CQT is preferred over STFT for chord recognition, key estimation, and most music-information-retrieval tasks.

Finally, the STFT is differentiable: torch.stft and torchaudio.transforms.Spectrogram compute it on-GPU, inside the training graph, so modern models treat the spectrogram not as a fixed preprocessing step but as a learnable module whose parameters (window length, hop, FFT size, sometimes the window shape itself) can be tuned or learned. This is the point where classical DSP meets the training loop — a theme the learnable-front-ends section (Section 16) will take all the way.

Section 07

Window functions and spectral leakage

Taking a DFT of a finite segment of a longer signal is equivalent to multiplying an infinite signal by a rectangular function that is one inside the segment and zero outside. This multiplication in the time domain corresponds to convolution in the frequency domain, and the rectangular window's Fourier transform — the sinc function — has a narrow main lobe but enormous side lobes (the first side lobe is only 13 dB below the main lobe). The practical effect is spectral leakage: a single pure sinusoid in the signal produces energy across dozens of DFT bins, obscuring nearby weaker tones.

The standard response is to multiply each frame by a window function — a smooth taper that goes to zero at both ends of the frame — before taking the DFT. The window's own Fourier transform determines how leakage is distributed: a wider main lobe worsens frequency resolution, but lower side lobes reduce leakage of strong tones into weak neighbouring bins. Every window is a trade-off between these two.

The three workhorse windows in audio are Hann, Hamming, and Blackman-Harris. The Hann window w[n] = 0.5 (1 − cos(2π n / (N−1))) has a first side lobe at −31 dB — far better than the rectangular window's −13 dB — and is the default in librosa, torchaudio, and essentially every modern STFT. The Hamming window w[n] = 0.54 − 0.46 cos(2π n / (N−1)) shifts the weighting to suppress the first side lobe to −43 dB at the cost of slower side-lobe roll-off; it was standard in HTK-era ASR and remains common in Kaldi. Blackman-Harris suppresses side lobes to −92 dB and is preferred for precision measurement (instrument tuning, feedback detection) but has a wider main lobe than most ML applications want.

Window cheat sheet. For generic audio ML use Hann. For legacy ASR systems use Hamming. For instrument-tuning or measurement tasks use Blackman-Harris or Kaiser. For real-time analysis where windowed frames must exactly sum to the original signal (overlap-add), choose a window that satisfies COLA at the chosen hop (Hann at 50% or 75% overlap works; arbitrary windows at arbitrary hops do not).

The Kaiser window w[n] = I₀(β sqrt(1 − ((n − (N−1)/2) / ((N−1)/2))²)) / I₀(β) has a shape parameter β that trades main-lobe width for side-lobe suppression: β = 0 is rectangular, β = 5 is similar to Hamming, β = 8.6 approximates Blackman. The Kaiser is essentially a tunable knob between the whole-window pantheon, and it is the default in many filter-design tools (scipy.signal.firwin).

Two subtler windows matter for measurement. The flat-top window maximises amplitude accuracy: a sinusoid that does not fall exactly on a bin centre will still have its true amplitude read at the peak, at the cost of a very wide main lobe. This is why audio analysers use it for SPL calibration and THD measurement, but it is rarely used in ML. The Gaussian window has the minimum time-frequency uncertainty product (it achieves equality in the Heisenberg-Gabor bound) and is the window implicit in the Gabor transform.

A practical question is: does window choice matter for neural networks? Empirically, most ASR and audio-tagging benchmarks are insensitive to Hann vs Hamming — the learned representation dominates any reasonable window. But when a pipeline is extraordinarily sensitive to fine spectral structure — very-low-SNR keyword spotting, high-fidelity music tagging, precise pitch tracking — the choice moves the needle by a percent or two. Always report the window in papers; always reproduce it at inference.

Finally, windows interact with overlap-add reconstruction. A window w[n] with hop H satisfies the constant-overlap-add (COLA) condition when Σ_m w[n − m H]² is constant. Hann at 50% overlap satisfies Σ w² = constant = 0.5; Hann at 75% overlap satisfies Σ w² = constant = 1.5. Violating COLA produces amplitude modulation at the frame rate, audible as a buzz. Modern ISTFT implementations check this and normalise appropriately, but if you write your own overlap-add you must get it right.

Section 08

Spectrograms

The spectrogram is the visual and computational representation that dominates modern audio. It is the matrix |X[m, k]| (or its square, or its log) where m indexes time frames and k indexes frequency bins. Plotted with time on the x-axis and frequency on the y-axis, a spectrogram lets a human read speech phonemes, identify musical notes, detect dog barks, and spot engine-knock signatures. Fed into a CNN or a transformer, a spectrogram is the input to essentially every non-end-to-end audio model.

The magnitude spectrogram |X[m, k]| and the power spectrogram |X[m, k]|² are related by squaring; the power spectrogram is more convenient for the mel filterbank because energy is additive, whereas magnitudes are not. Both are strictly non-negative, and their dynamic range is enormous — a loud vowel can be 60 dB louder than a quiet background, so the raw magnitude varies by a factor of a thousand. Without compression, most pixels in a spectrogram plot would be black.

The universal fix is to log-compress the magnitudes: S[m, k] = log(|X[m, k]|² + ε), or in dB: S[m, k] = 10 log₁₀(|X[m, k]|² + ε). The small constant ε (often 1e-10) prevents log(0). Log scaling mirrors human perception (dB is the perceptual unit of loudness), compresses dynamic range so both loud and quiet content are visible / numerically well-behaved, and is the standard input normalisation for speech models. A log-power-spectrogram is the default feature fed to Conformer-based ASR systems; a log-mel-spectrogram (Section 10) is even more common.

Key idea. A spectrogram is an image. Once you have a log-mel or log-STFT spectrogram, audio ML becomes very similar to computer vision: 2D CNNs, vision transformers, augmentations (SpecAugment's time and frequency masks are literally CutOut applied to the spectrogram), and visualisation techniques all transfer directly. This "audio-as-image" framing, reinforced by AST, Whisper, and the entire PANN / CLAP line, is why progress in vision architectures has repeatedly been ported to audio.

Two variants matter. The wide-band spectrogram uses a short window (≈ 5 ms): it resolves rapid time events (stops, click tracks) but has poor frequency resolution, so individual harmonics blur together. The narrow-band spectrogram uses a long window (≈ 25 ms): it resolves harmonics clearly but smears time. Phoneticians read both. For ASR, 25 ms windows at a 10 ms hop is the near-universal compromise; for music transient analysis, 10 ms windows at 5 ms hop work better.

A linear-frequency spectrogram (raw STFT) gives equal Hz-width bins. This is rarely what we want perceptually: most useful speech content is below 4 kHz, so half of the bins above it are nearly wasted. Mapping to a log-frequency or mel-frequency axis (§Section 09-10) is almost always an improvement. Librosa's specshow displays spectrograms; the y_axis argument controls whether the display is linear, log, or mel.

Normalisation is subtle. Per-utterance mean/variance normalisation (subtract the mean log-spectrogram, divide by the standard deviation) reduces channel and speaker variation and is standard in speech. Global normalisation (using dataset-wide statistics) is preferred when long-term averages matter, for instance in audio-event detection. The choice affects how spectrograms look and how models generalise; verify it matches the pre-trained model you are using at inference.

Finally, do not overlook SpecAugment (Park 2019), the augmentation strategy that treats the spectrogram as an image and randomly masks out contiguous vertical stripes (frequency masking), horizontal stripes (time masking), and slight time warps. SpecAugment was the single largest breakthrough in end-to-end ASR training between 2017 and 2020 and remains the default augmentation for any spectrogram-based pipeline. Its success is the most direct evidence that spectrograms are best treated as images.

Section 09

The mel scale

The mel scale is a perceptual frequency scale whose step size matches how humans perceive pitch. Doubling a frequency from 100 Hz to 200 Hz sounds like a larger step than doubling from 10 000 Hz to 20 000 Hz — even though both are an octave. The mel scale, proposed by Stevens, Volkmann, and Newman in 1937, assigns "mels" such that equal mel differences sound like equal pitch differences, anchored by the convention that 1 000 Hz = 1 000 mel.

The practical mel-to-Hz mapping most commonly used is: m(f) = 2595 · log₁₀(1 + f / 700). Its inverse is f(m) = 700 · (10^m/2595 − 1). This formula (the Fant 1968 / Shannon variant) gives a near-linear mapping below about 1 kHz and a near-logarithmic mapping above. Other formulations exist — Stevens's original, the O'Shaughnessy 1987 formula, the HTK formula — and librosa exposes both "htk" and "slaney" implementations; the differences are small but matter if you need bit-exact reproduction of a paper's results.

Why does the mel scale look roughly log above 1 kHz? Because the physics of the cochlea make it so. The basilar membrane is a tapered strip that vibrates maximally at a frequency that varies with position; the mapping is nearly exponential above roughly 500 Hz. Perceptual studies of critical bands — narrow ranges of frequency within which tones mask each other — give a closely related Bark scale, and another related scale is the equivalent rectangular bandwidth (ERB). All three (mel, Bark, ERB) encode the same basic fact: the ear's frequency resolution is coarser at high frequencies.

Mel / Bark / ERB at a glance. Mel: psychoacoustic pitch scale, used for mel-spectrograms and MFCCs. Bark: critical-band scale, used for psychoacoustic masking models and perceptual codecs (MP3, AAC). ERB: band-shape scale used for gammatone filterbanks and detailed auditory models. For ML features you almost always want mel; for codec design you want Bark; for biologically faithful auditory models you want ERB.

The mel scale is not the last word on psychoacoustics — modern auditory-physiology models (Zilany-Bruce, CARFAC) are considerably more faithful, as are the gammatone and gammachirp filterbanks that match cochlear impulse responses — but it is the right compromise for machine learning. It is cheap to compute (it is just a triangular-filter re-weighting of a linear STFT), it preserves enough perceptual structure, and it has forty years of empirical validation in speech and audio models.

Choosing the number of mel bins is a hyperparameter. Common values: 40 (classical ASR, MFCCs), 64 (YAMNet, VGGish, many audio-tagging models), 80 (Whisper, Conformer, Tacotron 2, modern ASR/TTS), 128 (high-fidelity TTS, music tagging). More bins preserve more spectral detail but inflate compute and can cause adjacent filters to overlap excessively if the STFT FFT size is too small. A rule of thumb: FFT size should be at least 8 × the number of mel bins.

The lower and upper frequency limits of the mel filterbank also matter. For speech, typical limits are 0 – 8000 Hz (telephone / wide-band) or 20 – 8000 Hz (strip DC-noise). For music, 20 – 11 025 Hz or higher. Whisper uses 0 – 8000 Hz at 16 kHz; most TTS systems use 0 – 8000 Hz or 0 – 12 000 Hz. Mis-matched limits between training and inference is a frequent source of silent degradation.

A last note: the mel scale is not sacred. There is active research on learned filterbanks (LEAF, SincNet, Audio Spectrogram Transformer's learnable positional embeddings), and the self-supervised waveform models (wav2vec 2.0, HuBERT) bypass mel entirely. The mel's staying power is a statement about the efficiency of a well-tuned prior — not a claim that nothing can improve on it.

Section 10

Mel-spectrograms

The mel-spectrogram is the single most widely used feature in audio ML. It is the power spectrogram re-weighted by a bank of triangular filters that are uniformly spaced on the mel scale. Librosa's melspectrogram, torchaudio's MelSpectrogram, TensorFlow's tf.signal.linear_to_mel_weight_matrix, and Whisper's preprocessing all compute the same thing.

The pipeline is: compute the power STFT; multiply by a mel filterbank matrix M of shape (n_mels, n_fft/2 + 1) whose rows are triangular filters centred on mel-spaced frequencies; take the log. The triangular filters overlap — each one covers roughly two critical bands — and integrate power over a range of linear-frequency bins into a single mel bin. The output mel-spectrogram has shape (n_mels, n_frames), typically (80, T) for Whisper and Conformer models.

The log compression comes next: S = log(M · |X|² + ε). Some pipelines use natural log; some use log10 with a floor (clamp to a minimum dB value before logging). Whisper uses log10 clamped at −0.8; Tacotron 2 uses natural log with a small epsilon. The exact formula matters for bit-exact reproduction but not for empirical quality.

Canonical recipe. For 16 kHz speech: 25 ms Hann window, 10 ms hop, 512-point FFT, 80 mel bins from 0 – 8000 Hz, log10 magnitude clamped to −0.8 dB. For 22 kHz or 24 kHz TTS: 50 ms window, 12.5 ms hop, 1024-point FFT, 80 mel bins from 0 – 8000 Hz, natural-log magnitude with ε = 1e-5. For 32 kHz+ music tagging: 25 ms window, 10 ms hop, 1024-point FFT, 128 mel bins from 20 – 16 000 Hz, log10 magnitude. Deviate only when you have a good reason.

Why a mel-spectrogram instead of a raw STFT spectrogram? Four reasons. First, compression: 80 mel bins vs 257 STFT bins is 3× smaller, directly reducing compute in downstream models. Second, perceptual alignment: mel bins are wider at high frequencies where humans hear less detail, so the features concentrate capacity where it matters. Third, noise robustness: the triangular integration averages out noise in high-frequency bins. Fourth, standardisation: half of all audio models expect an 80-mel log-mel-spectrogram as input, which makes transfer learning trivial.

Mel-spectrograms have their own augmentation recipe. SpecAugment applies random time masks (zero out a contiguous block of frames) and frequency masks (zero out a contiguous block of mel bins); this alone lifts ASR accuracy by several percent and prevents overfitting. SpecSwap and Mixup-on-spectrograms extend the idea. Pitch-shifting the raw waveform and regenerating the mel-spectrogram is a strong augmentation for music tagging.

Inverting a mel-spectrogram to a waveform is ill-posed twice over: the mel filterbank is a non-invertible many-to-few projection (you must estimate the linear STFT from the mel), and the magnitude-only spectrogram discards phase. Classical inversion (mel → linear magnitude via pseudo-inverse, then Griffin-Lim for phase) produces robotic audio suitable only for debugging. Modern solutions train a neural vocoder — WaveNet, WaveRNN, Parallel WaveGAN, HiFi-GAN, BigVGAN, iSTFTNet — that takes a mel-spectrogram as input and outputs a waveform. Every modern TTS system uses one; Tacotron 2 + WaveNet was the first end-to-end pipeline to do so in 2017, and HiFi-GAN (2020) made the approach fast enough for production.

A practical note: do not conflate the log-mel filterbank feature with the MFCC feature. MFCCs are the DCT of the log-mel energies (Section 11); the log-mel-spectrogram is the log-mel energies themselves. Modern deep learning uses log-mel; classical HMM-GMM ASR used MFCCs. If someone says "filterbank features," they almost always mean log-mel.

Section 11

MFCCs and cepstral features

Mel-frequency cepstral coefficients (MFCCs) were the dominant audio feature in speech recognition from roughly 1980 until 2015. They are the coefficients of the discrete cosine transform (DCT) of the log-mel-energy vector — a decorrelated, compact summary of the spectral envelope. Although deep learning has largely displaced MFCCs with log-mel spectrograms, they remain the default in speaker recognition, keyword spotting on microcontrollers, and a long tail of legacy systems.

The pipeline: compute log-mel energies (Section 10) → apply DCT along the mel axis → keep the first 13 (sometimes 20, or 40) coefficients. The DCT is a real-valued orthogonal transform closely related to the DFT; it decorrelates the log-mel vector under reasonable stationarity assumptions, concentrating signal energy in the first few coefficients. MFCC₀ is the log-energy (DC); MFCC₁ captures the overall spectral tilt; later coefficients capture progressively finer spectral shape.

The cepstrum more generally is the inverse Fourier transform of the log magnitude spectrum. It is a mathematically natural way to separate a source (e.g. the vocal folds, whose pitch period appears as a peak at a specific quefrency) from a filter (e.g. the vocal tract, whose smooth shape appears at low quefrency). The DCT-based "MFCC" approximates the low-quefrency part of the cepstrum and is remarkably stable across speakers and channels.

Why DCT? Two reasons. First, under the assumption that consecutive log-mel bins are positively correlated (they usually are), the DCT is close to the Karhunen-Loève transform — the optimal decorrelating linear transform. Decorrelated features are cheaper for diagonal-covariance Gaussian mixture models, which is why MFCCs won the GMM era. Second, truncating to 13 coefficients discards fine spectral ripple that is mostly formant-irrelevant and partly channel-dependent, giving good speaker/channel invariance.

MFCCs are typically augmented with delta and delta-delta features — first and second time derivatives approximated by finite differences — to capture dynamic information. The standard feature vector in Kaldi and HTK is 13 + 13 + 13 = 39 coefficients. Modern neural models often do not need deltas because CNNs / transformers can implicitly compute them, but Kaldi-style x-vector pipelines still use them.

Normalisation is essential. Cepstral mean normalisation (CMN) subtracts the per-utterance mean MFCC, which cancels any time-invariant channel effect (a microphone's transfer function appears as an additive term in the log spectrum and therefore an additive term in MFCCs). Cepstral mean and variance normalisation (CMVN) additionally divides by the per-utterance standard deviation. Both are standard in speech and critical in real-world deployment where channels vary.

MFCCs are not dead. The i-vector representation (Dehak 2011) — a low-dimensional factor-analysis embedding derived from GMM statistics over MFCCs — was the dominant speaker-recognition feature for a decade and is still used as a baseline. Its successor, the x-vector (Snyder 2018), is a TDNN-learned embedding typically trained on MFCCs or filterbank features. For keyword spotting on microcontrollers (Arduino, ESP32) with kilobytes of memory, MFCCs remain attractive because they are tiny — 13 coefficients per frame vs 80 mel bins — and because they survive fixed-point arithmetic gracefully.

A theoretical caveat: the DCT in MFCC computation is a linear projection, so a sufficiently deep neural network operating on the log-mel-spectrogram can always re-derive the MFCCs internally. That is why modern ASR systems skip the DCT step and feed the log-mel directly: the network learns its own optimal projection. The MFCC's decorrelation was a boon to GMMs, which assumed diagonal covariance; it is neutral (or a weak information loss) for neural models, which do not.

Section 12

Other hand-engineered features

Mel-spectrograms and MFCCs are the default, but they are far from the only audio features that matter. Music-information retrieval, speaker identification, environmental-sound classification, and audio quality assessment each draw on a distinct set of hand-engineered descriptors — many of which retain value as auxiliary inputs, as interpretable probes, or simply as items you will encounter in older literature.

Chroma features collapse the STFT along the octave dimension into 12 pitch classes (C, C#, D, …, B), summing all the energy at frequencies corresponding to the same pitch class regardless of octave. Chromas are the natural input for chord recognition, key estimation, and cover-song identification. Librosa's chroma_stft and chroma_cqt compute them. They are compact (12 dimensions per frame) and interpretable; modern chord-recognition models (the Chord-ISMIR family, Christian's chord-CNN, Harmonic-CNN) still use them as an input or a target.

Spectral shape features summarise the STFT with a handful of scalars per frame. Spectral centroid: the weighted mean frequency — correlates with perceived brightness. Spectral bandwidth: weighted standard deviation — correlates with tonality vs noisiness. Spectral roll-off: the frequency below which 85% of energy lies — separates speech from many noises. Spectral flatness (Wiener entropy): the ratio of the geometric to arithmetic mean power — near 1 for white noise, near 0 for a pure tone. Spectral flux: the frame-to-frame change in magnitude spectrum — the foundation of onset detection (Section 13). All of these are cheap to compute and together constitute the "MIR toolbox" of audio classification.

Constant-Q transform, again. For music applications, prefer the CQT over the STFT for feature extraction. Its log-frequency bins align with musical notes — each bin is one semitone, and each octave has exactly 12 (or 24, or 36) bins. Chord recognition, key estimation, cover detection, structural segmentation, and music transcription all benefit from starting with a CQT rather than an STFT.

Gammatone filterbanks approximate the impulse response of the basilar membrane more faithfully than mel filters: the gammatone kernel is t^n-1 exp(−2π b t) cos(2π f_c t). Gammatone features (GTFs) and their derivatives (GFCCs = the MFCC-style DCT of log-gammatone energies) outperform MFCCs slightly in noisy speech recognition, and they are a common front end in biologically motivated auditory models.

Perceptual linear prediction (PLP, Hermansky 1990) is a close cousin of MFCCs: it uses a Bark-scale auditory warping, applies equal-loudness pre-emphasis, does an LP analysis on the warped spectrum, and takes cepstral coefficients. PLP features were considered marginally better than MFCCs for a decade and are still included in Kaldi. RASTA filtering (Hermansky and Morgan 1994) applies a bandpass filter on each cepstral trajectory to suppress slowly varying channel effects — a learned-free alternative to CMN.

For speaker tasks, the feature lineage runs: MFCCs → GMM supervector → i-vector (factor analysis) → x-vector (TDNN embedding) → d-vector / ECAPA-TDNN (modern neural speaker embeddings). Early stages used MFCCs and their deltas; modern stages use log-mel filterbanks directly. Speaker-recognition benchmarks (VoxCeleb 1/2, SITW) are reported with equal error rate and minimum detection cost and still organise around these embeddings; Chapter 04 of Part VIII covers this in depth.

For music transcription and pitch analysis, useful features include harmonic summation, autocorrelation-derived features, subharmonic summation (SHS), pitch salience, and the pitch-class histogram. These feed polyphonic-pitch-estimation systems (MPE-Net, Onsets and Frames, BasicPitch).

For audio quality assessment, the reference-based metrics PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), ViSQOL, and the modern learned DNSMOS and SQuId are essentially feature-based classifiers that predict subjective listener scores from spectral comparisons. They are covered more fully in Section 15.

The general rule: hand-engineered features are rarely the sole input to a modern model, but they are often useful as auxiliary channels (concatenate chroma to a mel-spectrogram for music tagging), as inexpensive baselines, as interpretable probes (what does my model have that predicts the centroid?), or as frozen front ends for edge deployment. Know what exists; pick carefully.

Section 13

Pitch, f0, and onset detection

Two of the oldest problems in audio analysis are what pitch is sounding right now (pitch / fundamental-frequency estimation) and when did a new note begin (onset detection). Both are deceptively simple for a single clean sinusoid and surprisingly difficult for real speech or polyphonic music. They underlie speech prosody analysis, music transcription, beat tracking, singing-voice synthesis, and query-by-humming — and they expose the gap between what classical DSP can do and where learned models take over.

The fundamental frequency (f₀) of a periodic signal is the inverse of its period. For voiced speech it is the vibration rate of the vocal folds — typically 80 – 180 Hz for men, 150 – 250 Hz for women, 250 – 450 Hz for children. For a musical instrument it is the lowest spectral peak whose harmonics form the perceived pitch. For noisy or aperiodic signals, f₀ is undefined; a good f₀ estimator reports both an estimate and a confidence.

The classical monophonic f₀ algorithms fall into two camps. Time-domain methods find the lag at which the autocorrelation is maximised (or the AMDF — average magnitude difference function — is minimised): YIN (de Cheveigné and Kawahara 2002) is the dominant member of this family, followed by pYIN (probabilistic YIN) which adds an HMM over the candidate lags. Frequency-domain methods find the fundamental from the harmonic comb in the spectrum: harmonic product spectrum (HPS), harmonic summation, SWIPE (Sawtooth Waveform Inspired Pitch Estimator, Camacho 2007). YIN / pYIN are widely used in speech research; SWIPE is competitive; all of them struggle with octave errors and with breathy or creaky voice.

The modern standard is CREPE (Kim 2018): a 6-layer CNN trained to output a 360-dimensional probability distribution over pitches spaced at 20-cent increments. CREPE is near-ground-truth on isolated voice and instrument recordings and is widely used by downstream MIR systems. It is packaged in librosa (via librosa.pyin as a baseline) and in dedicated libraries (crepe, torchcrepe, basic-pitch). A newer sibling, SPICE, uses self-supervised pitch estimation; PESTO is a recent transformer-based replacement.

Polyphonic pitch estimation — the multi-f₀ problem — is much harder. Classical methods (Klapuri's multi-f₀) required combinatorial search over harmonic combinations. Modern methods train CNNs / transformers end-to-end to output piano-roll-style activations: Deep Salience (Bittner 2017), MPE-Net, Onsets and Frames (Hawthorne 2018), MT3 (Gardner 2022), Basic Pitch (Spotify 2022). Polyphonic pitch is the core of music transcription (see the music chapter, Section 06 of Part VIII).

Onset detection finds the moments when new musical events start. The core observation is that new notes produce sudden increases in energy, changes in spectral content, or both. The canonical detection function is spectral flux: the half-wave-rectified L1 or L2 norm of the frame-to-frame magnitude-spectrum difference. Librosa's onset_strength and onset_detect implement this, with optional mel-weighting and adaptive peak picking. Complex-domain flux, energy flux, and high-frequency content (HFC) are variants that suit different instruments.

Modern onset detection uses learned models — convolutional networks with binary onset/no-onset targets, trained on annotated datasets (MAESTRO for piano, MedleyDB for multi-track audio). The madmom library packages state-of-the-art pre-trained onset, beat, and downbeat detectors; it is the default choice for rhythm analysis in research.

Beat and tempo tracking extends onset detection to periodicity analysis. The standard pipeline: compute an onset-strength envelope, apply autocorrelation to estimate tempo, then use dynamic programming or an HMM to place beats at spacing matching the tempo estimate. BeatNet, RNN-Beat-Tracker, and BeatTransformer are the learned successors. Beat tracking feeds every music-information-retrieval task with a temporal grid, and its evaluation (the MIREX beat-tracking contest) has driven steady progress for two decades.

Section 14

Source separation

Source separation — given a mixture, recover the individual sources — is the canonical hard problem of audio. When the mixture is music (vocals + drums + bass + other), the task is music source separation. When it is speech (multiple simultaneous speakers), it is the cocktail-party problem. When it is speech in noise (one speaker plus environmental noise), it becomes speech enhancement (Section 15). All three are under-determined in the single-channel case and have driven some of the most creative work in audio over the last decade.

The classical approach was non-negative matrix factorisation (NMF): given a magnitude spectrogram V of shape (F, T), factor it as V ≈ W H with W (F, K) a non-negative "dictionary" of spectral templates and H (K, T) non-negative activations. Each source occupies a subset of dictionary atoms; grouping atoms by source yields the reconstructed source spectrograms. NMF was the dominant method from 2005 to 2015, supported by extensions (convolutive NMF, semi-supervised NMF, Itakura-Saito divergence), but it was eclipsed by neural models by 2017.

The transition was via masking-based separation: train a network to predict a time-frequency mask M that, applied to the mixture spectrogram, recovers a source. The ideal binary mask (IBM) assigns each bin to whichever source dominates; the ideal ratio mask (IRM) assigns a soft fraction in [0, 1]; the phase-sensitive mask (PSM) incorporates phase. Training a CNN or BLSTM to predict the IRM, and applying it to the mixture's STFT, gave the first strong music-separation results (DeepKaraoke, MSS-Unmix).

Key idea. Separation is really about identity: what makes the vocal track the vocal track, irrespective of which song it is in? A learned model encodes this identity in its weights. Masking-based separators use the spectrogram as a canvas and let the network colour in which pixels belong to each source; time-domain separators (ConvTasNet, Demucs) let the network operate on the waveform directly and learn both the encoder and the masker. The underlying structure that makes separation possible — sources are sparse and structured in the time-frequency plane — is identical.

Open-Unmix (Stöter 2019) is the canonical open-source music separator: a BLSTM per source predicting magnitude masks on STFT. Demucs (Défossez 2019) runs in the time domain with a U-Net over raw waveforms, setting state-of-the-art on MUSDB18 and becoming Facebook / Meta's default separator. Spleeter (Deezer 2019) was the consumer-friendly release — four-stem vocal / drums / bass / other separation, packaged as a Docker image, widely adopted by hobbyists. Hybrid Demucs combines time- and frequency-domain processing; Band-Split RNN (BSRNN) and MDX23-winning MDX-Net variants extend the approach.

Speech separation has its own lineage. Deep clustering (Hershey 2016) embeds each time-frequency bin in a high-dimensional space where bins belonging to the same speaker cluster together — sidestepping the speaker-permutation problem by using clustering rather than direct source assignment. Permutation-invariant training (PIT, Yu 2017) trains the model with a loss that is invariant to the permutation of output source channels.

TasNet (Luo and Mesgarani 2018) was the first time-domain speech separator competitive with STFT-based methods; ConvTasNet (2019) replaced the LSTM with a dilated-convolution TCN and pushed SI-SDRi on WSJ0-2mix dramatically; DPRNN (Dual-Path RNN, 2019) and SepFormer (Subakan 2021) progressed further. The latest transformer-based separators approach 22–23 dB SI-SDRi on WSJ0-2mix — a staggering jump from the ~10 dB of IBM-based baselines.

A recent twist: diffusion-based separation (MSDM, DiffSep) and language-conditioned separation (AudioSep, LASS, "separate the violin") open new directions. They also blur the line between separation and generation — if you can generate a convincing isolated vocal, you can "separate" without ever seeing the mixture. The field is moving fast; MDX Challenges (annual music-separation contest) are the empirical barometer.

Section 15

Speech enhancement and noise reduction

Speech enhancement is source separation with a specific goal: recover clean speech from noisy speech. Every video-conferencing app, every voice-assistant wake-word detector, every hearing aid, and every industrial-headset system runs an enhancement stage. The field is old — Ephraim and Malah's MMSE estimator is from 1984 — and the modern neural methods have pushed both perceptual quality and real-time compute dramatically.

The classical methods work in the STFT domain. Spectral subtraction estimates the noise magnitude spectrum (e.g. during silent frames detected by a VAD) and subtracts it from each noisy frame; it is brutally simple and introduces the "musical noise" artefact familiar from early noise-reduction. Wiener filtering derives the optimal linear estimator of the clean spectrum under assumed spectral models of speech and noise; it is less aggressive than spectral subtraction and produces cleaner output. MMSE-STSA and log-MMSE (Ephraim and Malah 1984, 1985) give the Bayesian optimal magnitude estimators under Gaussian assumptions and remained the de-facto standard for two decades.

Neural enhancement begins around 2014. RNN-based enhancement (Xu 2014, Weninger 2015) trains an LSTM to predict clean log-magnitude spectra from noisy; SEGAN (Pascual 2017) was the first GAN-based enhancement; DeepXi estimates a priori SNR with a CNN. The modern workhorses are CRN (Convolutional Recurrent Network), DCCRN (Deep Complex CRN — operates on complex STFT so it models phase explicitly), DEMUCS-Denoiser (Défossez 2020, time-domain), and the industrial favourite RNNoise / NSNet (Microsoft).

Real-time constraints. Video-conferencing enhancement runs inside a 10 – 40 ms audio block with essentially zero algorithmic look-ahead. Models that depend on future samples — bidirectional LSTMs, convolutions with long right context, global attention — are ruled out. The dominant choices are causal GRUs / LSTMs, causal 1D convolutions (Conformer-style), and sub-band decompositions with low-latency filters. RNNoise (1.5 MB, < 1% CPU) and the Microsoft DNS-Challenge baseline (NSNet2) exemplify this regime.

Evaluation of enhancement is notoriously tricky. The classical reference metrics are PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) and STOI (Short-Time Objective Intelligibility, Taal 2011); the more modern SI-SDR (scale-invariant signal-to-distortion ratio) is preferred for separation / enhancement literature because it is insensitive to output gain. All three are reference-based — they compare enhanced to clean. The frontier is non-reference learned metrics: DNSMOS (Microsoft, trained on crowd listener ratings), ViSQOL, NORESQA, SQuId. These predict subjective quality from the enhanced audio alone and drive the DNS Challenge leaderboards.

Beamforming uses multiple microphones to improve SNR before any single-channel stage. Classical beamformers (delay-and-sum, MVDR) exploit time-of-arrival differences between microphones to null out noise from specific directions; neural beamformers (Mask-MVDR, ADL-MVDR) estimate the spatial covariance from deep-learning-predicted masks. Meeting-room transcription (Microsoft ASR, Zoom's meeting summariser) and far-field smart-speaker wake-word detection (Amazon Echo's 7-mic array) rely heavily on beamforming.

Dereverberation is the specific case where the distortion is room-impulse-response convolution rather than additive noise. WPE (Weighted Prediction Error, Nakatani 2010) is the classical method; neural dereverberation (Yoshioka 2015, Han 2020) is the modern answer; and a line of work treats the joint problem (dereverberation + denoising + separation) as a single network training problem. The DNS Challenge (Deep Noise Suppression, Microsoft, annual since 2020) and the REVERB challenge (2014, reactivated periodically) have driven the field.

A subtle but important deployment issue: evaluate your enhancement on the same acoustic conditions your product will encounter. A model tuned for high-SNR office conditions will underperform on low-SNR conference-room scenarios and may actively hurt in very high-SNR (clean-studio) recordings. The DNS Challenge's emphasis on diverse noise sets and Microsoft's rule of "never degrade clean" are both direct responses to this.

Section 16

Learnable audio front ends

The argument for hand-engineered front ends (mel-spectrograms, MFCCs) is that they encode perceptual structure for free. The counter-argument is that any fixed feature is a bet — and data-driven bets are usually better. The last five years have seen a steady migration toward learnable audio front ends, from parameterised filterbanks operating on raw waveforms, to fully self-supervised transformer models that treat the waveform as just another token sequence.

SincNet (Ravanelli and Bengio 2018) was the first influential learnable front end. Instead of a fixed mel filterbank, SincNet parameterises a bank of band-pass filters by only their cut-off frequencies — each filter is a sinc function in time — so the network learns which frequency bands to attend to while preserving the interpretability of a filterbank. SincNet consistently matches or exceeds mel front ends on speaker-recognition and keyword-spotting tasks with dramatically fewer parameters.

LEAF (Zeghidour 2021) generalises the idea: a Gabor-like learnable filterbank followed by compression and per-channel-gain, trained end-to-end with the downstream task. LEAF was pitched as a drop-in replacement for mel-spectrograms and matches mel on keyword spotting, audio tagging, and speech recognition.

A parallel line runs through pure 1D-convolution front ends: Jasper (NVIDIA), QuartzNet, Wav2Letter++ stack large 1D convs directly on raw waveforms, bypassing mel entirely. They reach competitive ASR WER at roughly the cost of a bigger network. Conformer, the current ASR architecture king, still uses mel — but it is now a choice rather than a default.

Self-supervised speech representations. The most consequential shift of the last five years: pre-train a large model on unlabelled audio with a self-supervised objective, then freeze it and use its hidden states as features for downstream tasks. The pre-trained encoder is the new front end — it replaces mel and MFCC entirely. Wav2vec 2.0, HuBERT, WavLM, data2vec, and BEATs are the representatives. Downstream ASR, speaker recognition, and speech translation have all been revolutionised.

Contrastive Predictive Coding (CPC, van den Oord 2018) was the ancestor: predict future audio embeddings from past context with InfoNCE. wav2vec (Schneider 2019) and wav2vec 2.0 (Baevski 2020) refined the approach with quantised latent codes and transformer encoders. Wav2vec 2.0 trained on 53 000 hours of Libri-Light audio set a new ASR state-of-the-art at the time and made semi-supervised speech recognition practical — fine-tune on a few hours of labelled data, get close to fully supervised performance.

HuBERT (Hsu 2021) replaced wav2vec's contrastive objective with masked prediction of cluster labels (a BERT-like MLM over pseudo-tokens obtained by k-means clustering of MFCCs, then progressively of HuBERT's own hidden states). HuBERT is slightly simpler to train and matches or exceeds wav2vec 2.0 on most benchmarks. WavLM (Chen 2022) extended HuBERT with denoising and adversarial auxiliaries, improving robustness; data2vec (Baevski 2022) unified the self-supervised recipe across speech, text, and vision.

The SUPERB benchmark (Yang 2021) evaluates self-supervised speech encoders across a dozen downstream tasks (phoneme recognition, ASR, speaker identification, emotion, intent, slot filling, query-by-example, voice conversion) by freezing the pre-trained model and training only a small task head. SUPERB rankings are the standard way to compare encoders and have driven the field's empirical progress. A follow-up SUPERB-SG extends to generation tasks.

For multilingual and cross-lingual work, XLS-R (Babu 2021, 128-language wav2vec 2.0) and Meta's MMS (Massively Multilingual Speech, 2023, 1 400 languages) took the self-supervised recipe to enormous language coverage. These now underlie much of the low-resource ASR literature; Whisper (OpenAI 2022) took a different path — fully supervised weak-label training on 680 000 hours of web audio — and is itself a widely used learned front end now (its encoder feeds many downstream audio-tagging and translation pipelines).

The practical upshot: unless you are training at genuine frontier scale, a pre-trained self-supervised encoder (WavLM-large, HuBERT-X-large, Whisper encoder, MMS-1B) is now the preferred front end for speech tasks. For music and general audio, BEATs, CLAP (audio-text contrastive), AudioMAE, M2D, and EfficientAT serve the same role. Mel-spectrograms survive for TTS (where generative models still condition on them directly) and for edge deployment (where a small pre-trained encoder is still too heavy).

Section 17

Neural audio codecs and waveform modelling

The final layer of the modern audio stack is waveform modelling: turning a spectrogram, a set of discrete tokens, or a text prompt into an audible waveform. This layer has undergone three revolutions in a decade — from concatenative / Griffin-Lim synthesis, to autoregressive neural vocoders (WaveNet), to GAN-based parallel vocoders (HiFi-GAN), to neural audio codecs (SoundStream, EnCodec, DAC) that compress waveforms to discrete tokens and enable the audio-as-tokens paradigm that powers AudioLM, MusicLM, VALL-E, and GPT-4o's voice.

WaveNet (van den Oord 2016) was the breakthrough: a dilated 1D CNN autoregressive over 16-bit μ-law-encoded waveform samples. WaveNet sounded astonishingly natural but was ~100× slower than real time, so the first production use (Google Assistant 2017) required Google's proprietary parallel-WaveNet distillation. WaveRNN (Kalchbrenner 2018), LPCNet (Valin 2019), and Parallel WaveGAN (Yamamoto 2019) progressively sped things up.

The breakthrough for deployment was HiFi-GAN (Kong 2020): a GAN-based vocoder that maps a mel-spectrogram to a waveform in a single forward pass with high perceptual quality. HiFi-GAN runs several hundred times faster than real time on a single GPU and is the de-facto vocoder for TTS research in 2022 – 2025. Extensions: iSTFTNet (uses an iSTFT output head), UnivNet (multi-resolution discriminators), BigVGAN (NVIDIA 2022, universality across speakers and music).

Neural audio codecs. The decisive step of 2022 – 2023. A codec is an encoder–decoder that compresses audio to a small set of discrete tokens per frame (typically 50 – 150 tokens/second at 6 – 12 kbps) and reconstructs waveforms from those tokens. Once audio is discrete, a language model can generate it exactly as a text LM generates text. SoundStream, EnCodec, and DAC are the three canonical codecs; every audio-generation model from 2023 onward uses one of them.

SoundStream (Google 2021) introduced residual vector quantisation (RVQ): the encoder's latent is quantised iteratively by a cascade of codebooks, each fitting the residual of the previous stage. Eight-codebook RVQ at 50 Hz and 4 096 entries per codebook gives 6 kbps. EnCodec (Meta 2022) followed the same recipe with a more aggressive discriminator and per-codebook-LM-smoothing; its 24 kHz model is the de-facto default for speech tokenisation and feeds Bark, VALL-E, and MusicGen. DAC (Descript 2023) improved on both with better loss balancing and snake activations, achieving near-transparent 44.1 kHz audio at 8 kbps.

Once audio is tokens, the entire LLM toolkit applies. AudioLM (Borsos 2022) generates audio with a hierarchical transformer over two token streams: semantic tokens from a w2v-BERT encoder and acoustic tokens from SoundStream. MusicLM (2023) extends AudioLM to text-conditioned music. VALL-E (Microsoft 2023) is a zero-shot TTS: given a 3-second prompt and a text, it generates an EnCodec-token continuation that clones the prompt's voice. Bark (Suno 2023) and SpeechT5 do similar things with open weights.

The codec itself is a careful DSP-meets-deep-learning design. The encoder is a stack of strided 1D convolutions that downsample the waveform to an embedding at roughly 75 – 150 Hz. RVQ quantises the embedding into a fixed-size code index. The decoder mirrors the encoder with transposed convolutions. The training loss combines a reconstruction loss, a multi-scale STFT loss (ensures spectral fidelity), adversarial losses (ensures perceptual naturalness), and a VQ commitment loss. Each of these components corresponds to a DSP desideratum that a single L2 loss in the time domain would fail to capture.

Codecs also replace the mel-spectrogram in modern TTS: VALL-E, NaturalSpeech 3, and Tortoise-TTS all condition directly on codec tokens rather than mel. This is the cleanest demonstration of the trend — the mel-spectrogram, so central to Tacotron and FastSpeech, is being replaced by learned discrete tokens. The hand-engineered front end is finally dissolving; but note that EnCodec's encoder and decoder are still, at heart, DSP-inspired 1D-convolutional filterbanks. The ideas of this chapter are not retired — they are absorbed.

Section 18

Audio signal processing in the ML lifecycle

The classical and modern audio pipeline described in this chapter is the foundation the rest of Part VIII builds on. This closing section surveys how the pieces fit together in real ML workflows — the libraries, the GPU-accelerated toolkits, the typical preprocessing pipeline, the streaming and real-time constraints, and the handoff between this chapter and its successors.

The Python audio ecosystem revolves around a handful of libraries. librosa is the feature-rich research library: load audio, resample, compute STFT / mel / MFCC / chroma / pitch, apply pitch-shift and time-stretch, detect onsets and beats — all on CPU with NumPy. It is slow but encyclopedic and has been the teaching tool for a decade. torchaudio mirrors librosa's feature set with PyTorch tensors and GPU support — torchaudio.transforms.MelSpectrogram is the canonical mel-spectrogram in any modern training pipeline. soundfile is the go-to for reading / writing WAV / FLAC. pydub wraps ffmpeg for any format ffmpeg supports. SoX and ffmpeg themselves are the command-line workhorses for dataset preparation at scale.

Industrial toolkits sit on top of these libraries. NVIDIA NeMo is the production-grade ASR / TTS / speaker framework — Conformer-CTC, Conformer-Transducer, FastPitch, HiFi-GAN, SortFormer diarisation. SpeechBrain is the friendly academic framework — modular recipes for ASR, speaker recognition, enhancement, separation. ESPnet is the Kaldi successor for speech recognition and translation. Hugging Face transformers hosts every self-supervised speech encoder (wav2vec 2.0, HuBERT, WavLM, Whisper) with a uniform interface. Kaldi itself is the C++ classical ASR toolkit, still used for speaker diarisation and low-resource languages.

A canonical preprocessing pipeline for supervised audio training looks something like: (1) load waveforms and convert to float32 in [-1, 1]; (2) resample to the target rate (usually 16 kHz); (3) downmix to mono; (4) optionally apply VAD and trim silence; (5) optionally normalise loudness to -23 LUFS; (6) optionally augment (SpecAugment, noise addition from the MUSAN or DEMAND datasets, room-impulse-response convolution from BUT ReverbDB, speed perturbation); (7) compute log-mel-spectrogram; (8) apply cepstral / per-speaker normalisation; (9) feed to model. Each step deserves its own config file because small variations (16 vs 22.05 kHz, 80 vs 128 mel bins, SpecAugment on vs off) move accuracy by several percent.

Real-time and streaming constraints. Many production audio systems run in streaming mode with latency budgets of 10 – 300 ms. Streaming imposes three constraints: (i) the front end must be causal or have tightly bounded look-ahead, (ii) state (STFT overlap buffers, RNN hidden states) must be carried across chunk boundaries, (iii) the model must tolerate chunk-boundary discontinuities. Streaming-safe STFTs are implemented in all production toolkits (NeMo's streaming_features, NVIDIA Riva's frame store, Google's Lyra). If you are writing a streaming service, do not reinvent this — use one of the existing implementations.

GPU-accelerated preprocessing matters at scale. Computing a log-mel-spectrogram on CPU for a few hundred hours of training data is trivial; computing it inside every training step for several thousand hours of wav2vec 2.0 pretraining demands either pre-cached features or GPU spectrograms. torchaudio.transforms implements GPU STFT / mel; nnAudio does the same for CQT and gammatone; NVIDIA's DALI offers a C++-accelerated audio-loading + feature-extraction path for enormous datasets. Pre-caching log-mel-spectrograms to disk is a common optimisation when features don't depend on training-time randomness.

The rest of Part VIII builds directly on this foundation. Chapter 02 (Automatic Speech Recognition) takes log-mel spectrograms or wav2vec 2.0 embeddings and runs CTC / RNN-T / attention-based decoders to produce text — Whisper's encoder is literally a log-mel spectrogram processor from Section 10. Chapter 03 (Text-to-Speech) learns the inverse path — text → log-mel spectrogram → waveform via HiFi-GAN / BigVGAN (Section 17). Chapter 04 (Speaker Recognition) trains x-vectors / ECAPA-TDNN on log-mel features (Section 10–11) and uses i-vector / cosine scoring for verification. Chapter 05 (Audio Classification) feeds log-mel spectrograms into CNNs / ASTs / transformers for environmental-sound and music-tag prediction. Chapter 06 (Music Generation) either predicts mel spectrograms with diffusion (AudioLDM, Stable Audio) or predicts EnCodec / SoundStream tokens with autoregressive transformers (MusicLM, MusicGen).

Audio signal processing also feeds the broader compendium. Part VI's Transformer Architecture and Pretraining Paradigms supply the backbone for every speech transformer. Part V's Convolutional Neural Networks and Sequence Models provide the architecture toolkit for 1D-conv front ends and RNN-based streaming ASR. Part VII's Vision-Language Models preview the multimodal extension — VLMs plus audio encoders give you audio-language models (CLAP, AudioPaLM, Qwen-Audio, GPT-4o audio), which are the natural endpoint of this part of the compendium. Part I's Signal Processing chapter has the mathematical foundations — Fourier transforms, filter design, windowing — that we have used throughout. This is the most cross-linked chapter in the compendium because audio is the most cross-modal of tasks.

The practical claim is that a practitioner who understands the chapter they just read — sample rates, the STFT, mel-spectrograms, MFCCs, CREPE, separation / enhancement, wav2vec, EnCodec — has the full vocabulary to read any paper or debug any pipeline in Part VIII. The theoretical claim is stronger: audio is the closest that machine learning gets to working with physical reality in real time. Every decision in this chapter — window length, hop, bit depth, sample rate — is a statement about what matters in the physical world. That is why the chapter is longer than it first appears, and why it rewards a careful second reading once the later chapters have given it context.

Automatic Speech Recognition, the long arc from HMMs to Whisper.

How to read this chapter

Contents

Why audio signal processing matters

The physics and perception of sound

Digital audio: sampling, quantisation, and file formats

Time-domain features and operations

Fourier analysis: DFT, FFT, and the complex spectrum

The short-time Fourier transform

Window functions and spectral leakage

Spectrograms

The mel scale

Mel-spectrograms

MFCCs and cepstral features

Other hand-engineered features

Pitch, f0, and onset detection

Source separation

Speech enhancement and noise reduction

Learnable audio front ends

Neural audio codecs and waveform modelling

Audio signal processing in the ML lifecycle

Further reading

Textbooks & surveys

Sampling, quantisation & perception

Fourier analysis & the STFT

MFCCs, filterbanks & classical features

Pitch, onsets & music analysis

Source separation

Speech enhancement & denoising

Learnable front ends & self-supervised speech

Neural audio codecs & vocoders

Datasets, benchmarks & software