Automatic speech recognition is the oldest active machine-learning problem still in production use and has been the proving ground for nearly every probabilistic-modelling idea that later spread across ML. It is not solved: low-resource languages, overlapping speech, and streaming low-latency inference remain genuinely hard. This chapter traces the full arc from HMM-GMM through CTC, RNN-Transducer, attention encoder-decoders, and Conformers to the Whisper paradigm — weakly supervised, massive scale — and covers evaluation, decoding, and deployment.
Sections one through three are orientation and data. Section one is why ASR matters — why a chapter that might sound niche is actually one of the most commercially deployed applications of ML in the world; why the modern voice interface (Alexa, Siri, Google Assistant, automotive, medical scribes, captioning, call-centre analytics, accessibility, translation) sits on an ASR substrate; and how the ASR pipeline differs from every other sequence-modelling problem in its extreme length ratios between input and output. Section two is the ASR landscape: the five eras (HMM-GMM, hybrid HMM-DNN, CTC, attention-based, Conformer-plus-self-supervision-plus-Whisper), who built what, which benchmarks mattered, and how the problem statements themselves shifted. Section three is speech data and corpora: LibriSpeech, Common Voice, TED-LIUM, SPGISpeech, GigaSpeech, VoxPopuli, People's Speech, Fleurs, CoVoST, Switchboard, Fisher, the forced-alignment problem, transcription conventions, and why data curation dominates modern ASR performance.
Sections four through six are the classical foundations. Section four is HMM-GMM acoustic modelling: the left-to-right HMM topology, context-dependent triphones, decision-tree state tying, GMM emission densities, the Baum-Welch EM algorithm, Viterbi decoding, MLLR and fMLLR speaker adaptation, and why every modern ASR engineer still needs to read this vocabulary. Section five is hybrid HMM-DNN: replacing the GMM emission density with a deep neural network's posterior, Mohamed-Dahl-Hinton 2012, sequence-discriminative training (MMI, sMBR, LF-MMI), Kaldi's chain models, TDNN and TDNN-F acoustic models, and the five-year gap in which hybrid systems dominated. Section six is Connectionist Temporal Classification: Graves 2006, the blank symbol, the forward-backward CTC loss, alignment-free sequence training, the conditional-independence assumption and its consequences, greedy and beam decoding, prefix-beam decoding with an external LM, and the CTC systems (DeepSpeech, DeepSpeech 2, Wav2letter, QuartzNet, Citrinet) that launched end-to-end ASR into production.
Sections seven through nine cover the other end-to-end families and the modern backbone. Section seven is the RNN-Transducer (RNN-T): Graves 2012, the joint network over encoder and prediction-network states, the transducer loss, why RNN-T is the dominant streaming architecture at Google / Amazon / Apple, alignment restricted training, monotonic transducer variants, stateless prediction networks, and the memory pragmatics of T-T-T computation. Section eight is Listen, Attend and Spell / attention-based encoder-decoders: the LAS paper, soft attention over encoder outputs, the exposure-bias problem, label-smoothing and scheduled sampling, hybrid CTC/attention training (Watanabe), joint CTC-attention decoding, and the class of systems that culminates in Whisper. Section nine is the Conformer: Gulati 2020, the convolution-plus-attention macaron block, why Conformer is now the default encoder for CTC, RNN-T, and attention alike, E-Branchformer and Zipformer variants, and the way this architecture unified the three end-to-end paradigms.
Sections ten through twelve cover streaming, Whisper-scale supervision, and multilinguality. Section ten is streaming ASR: the latency problem, the chunk-wise and monotonic-attention approaches, RNN-T as the natural streaming architecture, look-ahead windows, Emformer and dynamic chunking, trigger attention, MoChA / MILK / MAtCha, endpointing and voice-activity detection, and the word-level latency metrics that drive the field. Section eleven is Whisper and large-scale weakly-supervised ASR: Radford 2022, 680 000 hours of web audio, the multi-task decoder (transcribe, translate, timestamps, language identification), Whisper's failure modes (hallucinated text, long-form drift, timestamp jitter), faster-whisper / whisper.cpp / Distil-Whisper, and the followers — USM, SeamlessM4T, Canary, Parakeet, OWSM — that built on the recipe. Section twelve is multilingual and low-resource ASR: MMS (Meta's Massively Multilingual Speech, 1100+ languages), XLS-R, USM's 300+ languages, cross-lingual transfer, the zero-shot problem, phoneme-based vs grapheme-based multilingual heads, code-switching, and the Fleurs and VoxLingua benchmarks.
Sections thirteen through fifteen cover self-supervised pretraining, decoding theory, and language-model integration. Section thirteen is self-supervised ASR representations: CPC, wav2vec, wav2vec 2.0 (masked-span prediction on quantised latents), HuBERT (clustered-target iterative pseudo-labelling), WavLM (denoising-style masking), data2vec (unified SSL across modalities), how these are fine-tuned for downstream ASR (CTC head on frozen / unfrozen encoder), and why SSL made 10-hour and even 1-hour fine-tuning budgets viable. Section fourteen is decoding: greedy decoding, beam search, prefix-beam search for CTC, label-synchronous beam search for attention, frame-synchronous beam search for RNN-T, WFST composition (HCLG = H ∘ C ∘ L ∘ G), the Kaldi decoding graph, and shallow fusion / deep fusion / cold fusion with external LMs. Section fifteen is language models and rescoring: n-gram LMs (SRILM, KenLM), neural LMs, first-pass shallow fusion, second-pass rescoring, internal-LM estimation and subtraction, density-ratio method, domain biasing and hotword boosting, and why an external LM is still worthwhile even with a powerful encoder-decoder.
Sections sixteen through eighteen cover evaluation, deployment, and the chapter's operational closing. Section sixteen is evaluation: the word-error-rate metric (substitutions + insertions + deletions, divided by reference length), character-error-rate for Asian languages, WER's limitations (semantic equivalence, capitalisation, punctuation, hesitation tokens), the Kincaid-TUDa test sets, long-form WER, streaming WER with latency, hallucination-rate for Whisper-class models, fairness metrics across demographic groups, and the human-parity debates. Section seventeen is deployment: real-time-factor budgets, CPU vs GPU serving, quantisation (int8, nvfp4), batching strategies, streaming protocols (WebSocket, gRPC), VAD and endpointing pipelines, speaker-diarization integration, inverse-text-normalisation, punctuation and capitalisation post-processing, and the shipping stacks (NVIDIA Riva / NeMo, Kaldi / k2 / icefall, ESPnet, SpeechBrain, Whisper.cpp, torchaudio). Section eighteen (in-ml) is the chapter's closing: how ASR relates to the rest of Part VIII (TTS, speaker recognition, audio classification, music generation), the ASR-as-a-component pattern (voice assistants, captioning, translation, meeting summarisation, RAG-over-audio), the modality convergence into speech-foundation-models (Seamless, Qwen-Audio, GPT-4o), and the links back to Parts V–VII (attention, transformers, sequence models, VLMs) that made the last five years of ASR progress possible.
Every speech, music, and sound model you will ever touch — HMM-GMM ASR from 1995, DeepSpeech from 2014, Whisper from 2022, AudioLM and MusicLM from 2023, GPT-4o's audio front end in 2024 — shares a hidden dependency: a carefully engineered audio signal-processing pipeline that converts continuous pressure waves into numerical arrays. This chapter is about that pipeline. It is the foundation on which every subsequent chapter of Part VIII — automatic speech recognition, text-to-speech, speaker recognition, sound classification, music generation — is built.
It is tempting, in the era of end-to-end deep learning, to assume that classical digital signal processing (DSP) is obsolete: surely a sufficiently expressive neural network can learn everything from raw waveforms? The reality is more nuanced. End-to-end waveform models exist — SincNet, LEAF, Jasper, wav2vec 2.0 operate directly on samples — but even they incorporate DSP-inspired inductive biases (band-pass filters, learnable mel-like filterbanks, residual vector quantisation). And for the overwhelming majority of deployed systems — including Whisper, SeamlessM4T, Bark, Tacotron, WaveNet vocoders, and every industrial ASR pipeline — the first stage is still a log-mel-spectrogram computed with a short-time Fourier transform, Hann windows, and a mel filterbank whose parameters descend directly from psychoacoustic research of the 1930s–1940s. DSP is not gone; it has been absorbed into the training pipeline.
Why should the choices at this layer matter so much? Because they set the effective resolution of everything downstream. A 10 ms hop at 16 kHz gives a spectrogram with 100 frames per second — enough temporal resolution for phonemes but not for fast music transients. A 16 kHz sample rate discards everything above 8 kHz, which is fine for speech but mutilates the brilliance of a violin. A 64-bin log-mel spectrogram compresses frequency by a factor of five compared to the full STFT, which is efficient for Conformer-based ASR but insufficient for fine-grained music-tagging tasks. Every downstream model inherits these trade-offs; a mis-chosen front end is invisible in the loss curve and catastrophic in production.
There is also an engineering story. Audio is high-bandwidth: 16-bit mono at 16 kHz is already 256 kbit/s, and 48 kHz stereo is twenty-four times larger. A one-hour recording is forty to five hundred megabytes uncompressed. This dwarfs images — a 224×224 RGB image is 150 kB — and makes the question of how to compactly represent audio central to every practical system. The 2023 neural-audio-codec revolution (SoundStream, EnCodec, DAC) is the modern answer: quantise waveforms into 75–150 discrete tokens per second and let a decoder-only transformer generate them, exactly as an LLM generates text. But even the codec's encoder is, at heart, a learnable front end — and understanding the classical front end is the prerequisite to understanding its neural successor.
A second reason to learn this material well: audio is where the modern multimodal stack is heading. GPT-4o and Gemini 2.0 accept audio natively. Voice assistants, meeting bots, hearing aids, accessibility tools, music-creation apps, audio-search engines, and the emerging genre of audio foundation models all sit on the same DSP foundation. If you plan to build or debug any of these, you will be reading spectrograms, tuning STFT parameters, comparing sample rates, and arguing about mel-bin counts — skills that are essentially classical, regardless of which deep model you deploy on top.
This chapter is structured around the pipeline: physics and perception, digital-audio fundamentals, time-domain features, Fourier analysis, the STFT, windowing, spectrograms, the mel scale, mel-spectrograms, MFCCs, the broader feature zoo, pitch and onsets, source separation, enhancement, learnable front ends, neural codecs, and finally an operational overview of how these pieces stitch into real-world ML pipelines. Later chapters of Part VIII take this foundation for granted; this is the layer you can always come back to.
Sound is a mechanical pressure wave in a medium — typically air. A vibrating source (a vocal fold, a guitar string, a loudspeaker cone) periodically compresses and rarefies the surrounding molecules, and that pattern of pressure variations propagates outward at roughly 343 m/s at room temperature. Everything else — frequency, amplitude, timbre, pitch, loudness — is a description of this one underlying phenomenon.
The two primary parameters are frequency (how many oscillations per second, measured in Hertz, Hz) and amplitude (the magnitude of the pressure variations, usually measured logarithmically as sound pressure level in decibels). A pure sinusoid at frequency f and amplitude A is fully described by these two numbers plus a phase. Real sounds are superpositions of such sinusoids across a wide frequency range — human speech occupies roughly 80 Hz (deep voice fundamental) to 8 kHz (fricatives like /s/ and /ʃ/); musical instruments span 20 Hz to 20 kHz; bat echolocation goes to 100 kHz. The human auditory system responds to roughly 20 Hz – 20 kHz at the young-adult extreme, and degrades from the top end with age and noise exposure.
The decibel (dB) is logarithmic: a 10 dB increase corresponds to a factor-of-ten increase in power (or a factor-of-√10 ≈ 3.16 in amplitude). Sound pressure level (SPL) is measured in dB relative to 20 μPa, the threshold of hearing at 1 kHz for young adults. A whisper is around 30 dB SPL; normal conversation is 60 dB; a rock concert is 110 dB; the pain threshold is near 130 dB. Logarithmic scaling matches perception: doubling perceived loudness typically requires a ten-fold increase in power, which is exactly the definition the dB scale embodies.
Phase is the position within an oscillation cycle. Two pure tones of identical frequency and amplitude but opposite phase cancel completely — this is the principle behind active noise cancellation headphones. Phase matters for spatial hearing (the inter-aural time difference is a phase difference) and for waveform reconstruction from spectrograms (the fundamental reason Griffin-Lim is an approximation), but humans are remarkably phase-insensitive to the absolute phase of stationary sounds — a crucial fact that justifies the magnitude-only spectrogram that dominates audio ML.
Psychoacoustics is the science of what the auditory system actually perceives. Two facts are especially important. First, the equal-loudness contours (Fletcher-Munson 1933, ISO 226): a 30 Hz tone must be about 60 dB louder than a 1 kHz tone to sound equally loud. This is why A-weighting (which de-emphasises low and very high frequencies) is the standard noise-measurement weighting. Second, auditory masking: a loud tone at one frequency raises the perceptual threshold of nearby frequencies (frequency masking), and a loud transient masks softer sounds that occur shortly before or after (temporal masking). Perceptual audio codecs (MP3, AAC, Opus) exploit masking aggressively — they allocate fewer bits to masked regions of the spectrogram — and the same principle motivates why certain log-magnitude compressions are good for ML front ends.
The mel scale (Section 09) and the critical band structure (Bark scale, ERB scale) are consequences of the cochlea's physical construction. The basilar membrane inside the cochlea is a frequency-tuned resonator: high frequencies stimulate the base, low frequencies stimulate the apex, and the tuning width grows nonlinearly with frequency. This means the ear resolves two 100 Hz-apart pure tones near 200 Hz (easy) but not near 10 kHz (impossible). Any feature that ignores this — a linear-frequency spectrogram, say — is giving a neural network information the ear does not use, and burying information the ear does.
Finally, sound propagates: it reflects off walls (producing reverberation), it attenuates with distance, it diffracts around obstacles, and it is perturbed by wind and temperature gradients. Every deployed audio system must cope with these, which is why far-field speech recognition, beamforming, and room-acoustic modelling (dereverberation, RIR estimation) are live subfields. We return to them in the enhancement and source-separation sections.
To process audio on a computer, the continuous pressure wave picked up by a microphone must be converted into a sequence of numbers. This conversion has two steps — sampling (discretising time) and quantisation (discretising amplitude) — and the choices made at each step constrain everything that can be done downstream.
Sampling replaces the continuous signal x(t) with the sequence x[n] = x(n T), where T is the sampling period and f_s = 1/T is the sample rate. The Nyquist-Shannon sampling theorem is the foundational result: any signal whose Fourier transform vanishes above frequency B can be perfectly reconstructed from samples at rate f_s > 2B. Violating this — attempting to represent a 12 kHz tone with a 16 kHz sample rate, whose Nyquist limit is 8 kHz — causes aliasing: the high-frequency content folds back into the representable band as spurious lower frequencies. Real ADCs therefore always apply an anti-aliasing filter (a steep analogue low-pass below the Nyquist limit) before sampling.
Canonical sample rates cluster around the intended bandwidth: 8 kHz (narrow-band telephony, up to 4 kHz content), 16 kHz (wide-band speech, up to 8 kHz — the default for ASR), 22.05 kHz (half of CD rate, lightweight audio), 24 kHz (modern TTS), 32 kHz (broadcasting), 44.1 kHz (CD / consumer music), and 48 kHz (professional audio and video). Whisper trains at 16 kHz, Bark at 24 kHz, MusicLM at 24 kHz, EnCodec's music mode at 48 kHz. Choosing the wrong rate — e.g. processing 48 kHz music with a 16 kHz model — either discards essential content or wastes compute.
Quantisation replaces each continuous amplitude with a finite integer. Common bit depths are 8-bit (μ-law telephony, 256 levels), 16-bit (CD quality, 65 536 levels, the overwhelming default), 24-bit (professional, 16.7 M levels), and 32-bit float (modern DAWs and intermediate processing). The signal-to-quantisation-noise ratio of a uniform B-bit quantiser of a full-scale sinusoid is approximately 6.02 B + 1.76 dB — so 16-bit audio has ≈ 96 dB of dynamic range, which comfortably exceeds human hearing's roughly 100 dB window for typical listening levels. Dither — adding a small amount of random noise before quantisation — breaks up the correlation between quantisation error and the signal and is standard practice at low bit depths.
soxr or torchaudio.functional.resample); naïve decimation causes aliasing.
Channel layout matters almost as much. Mono is a single channel; stereo is left/right; 5.1 is multichannel surround. Most speech processing downmixes to mono; music work often preserves stereo; multi-microphone arrays (smart speakers, meeting-room capture) operate on raw channel counts up to 8 or 16 and do beamforming before any single-channel stage.
File formats fall into three groups. Uncompressed PCM (WAV, AIFF) stores raw samples — the canonical lossless representation and the default in almost every ML pipeline. Lossless compression (FLAC, ALAC) uses entropy coding to shrink PCM by 40–60% with bit-exact recovery — preferred for archival and dataset distribution. Lossy compression (MP3, AAC, Opus, Vorbis) discards perceptually redundant content, typically reducing size by 10× or more; Opus at 24 kb/s is broadcast-quality for speech. Loading any of these in Python is a one-liner with soundfile, librosa.load, or torchaudio.load, but beware: lossy formats introduce subtle spectral artefacts that can hurt model performance if training data and deployment data use different codecs.
A practical gotcha: Python audio libraries differ in their default output format. torchaudio.load returns a float tensor in [-1, 1]; librosa.load also returns float; scipy.io.wavfile.read returns int16 and requires dividing by 32768 to get floats. Half of all audio bugs come from this mismatch.
Before any frequency analysis, there is a rich toolkit of features and operations that work directly on the waveform. Many of them look elementary; all of them are still in daily use. Whether you are writing a voice-activity detector for a wake-word system or a beat-tracker for a DAW, the time domain is where you start.
Framing is the first operation in almost every pipeline. The waveform is sliced into overlapping frames of length N samples (typically 20–30 ms, so 320–480 samples at 16 kHz) with a hop of H samples (10 ms = 160 samples is standard). This converts a 1D sequence into a 2D array of shape (frames, samples-per-frame). Every subsequent time-frequency transform operates on these frames. The hop sets the effective frame rate — 10 ms hop gives 100 frames per second, which matches phoneme duration and is the de-facto standard for speech.
Root-mean-square energy (RMS) and its close cousin short-time energy are per-frame scalars summarising amplitude: RMS = sqrt(mean(x²)). Plotted over time, this is the signal's envelope. RMS drives voice-activity detection, auto-gain-control, compressors, and most silence-trimming heuristics. It is trivially computable and surprisingly powerful.
Zero-crossing rate (ZCR) counts how often the signal changes sign per frame. High ZCR correlates with high-frequency content: voiced speech has low ZCR (the vocal fold period dominates), unvoiced fricatives (/s/, /ʃ/, /f/) have high ZCR, and percussion transients have very high ZCR. ZCR is a cheap and historically important feature for voiced/unvoiced detection and music/speech discrimination.
Autocorrelation — the correlation of the signal with a shifted copy of itself — reveals periodicities. The first non-trivial autocorrelation peak for a voiced signal lies at the pitch period, which is the basis of pitch-estimation algorithms (YIN, pYIN in Section 13). Autocorrelation is also the foundation of linear predictive coding (LPC), which models the vocal tract as an all-pole filter and was the workhorse of low-bit-rate speech coding for three decades.
Voice-activity detection (VAD) decides which frames contain speech. Classical VADs (ITU-T G.729 Annex B, WebRTC VAD) combine energy, ZCR, and periodicity with hand-tuned thresholds; modern VADs (Silero-VAD, NVIDIA Marblenet) are small neural networks operating on mel-spectrogram features. VAD matters more than it looks: in a long conversation, 50–70% of frames contain no speech, and running an ASR model on all of them wastes compute and frequently hallucinates. VAD-gated inference is standard in every production speech pipeline.
Silence trimming and loudness normalisation are close cousins. Trimming removes leading/trailing silence (librosa's trim, torchaudio's vad). Loudness normalisation rescales a clip so its measured loudness (LUFS, the broadcast standard) matches a target — essential for training data where clip gains vary wildly.
Two more operations deserve mention. Resampling converts between sample rates using a polyphase-FIR or sinc-interpolation filter; it is fundamental because trained models assume a single rate. Time stretching and pitch shifting — changing duration without changing pitch, or vice versa — are the foundation of data augmentation (SpecAugment time stretching, speed-perturbation) and musical effects (phase vocoder, WSOLA, Rubberband). These operations are not free; poor implementations introduce phase artefacts that a neural network can easily detect. Use vetted libraries (soxr, torchaudio.transforms.Resample, pedalboard, rubberband) rather than rolling your own.
The Fourier transform is the single most important mathematical tool in audio. It decomposes a signal into a sum of sinusoids at different frequencies — a frequency-domain representation — and virtually every feature, filter, model, and perceptual fact in this chapter is grounded in it. A careful reader should be comfortable with three flavours: the continuous Fourier transform (for theory), the discrete Fourier transform (DFT) for finite sampled signals, and the fast Fourier transform (FFT) for computation.
The continuous Fourier transform of a signal x(t) is X(f) = ∫ x(t) exp(−2πi f t) dt. It maps a function of time to a function of frequency; the inverse transform maps back. The transform is linear, preserves energy (Parseval's theorem), turns convolution into multiplication, and turns differentiation into multiplication by 2πi f. These properties are the reason every linear time-invariant system — every filter — is fully characterised by its frequency response, the Fourier transform of its impulse response.
The discrete Fourier transform (DFT) operates on N samples: X[k] = Σn=0N-1 x[n] exp(−2πi k n / N). The output is N complex numbers, one per frequency bin. Bin k corresponds to frequency k · fs / N Hz. The first bin is DC; the last unique bin is the Nyquist frequency; the upper half mirrors the lower half for real-valued inputs, which is why audio libraries store only the first N/2 + 1 bins (rfft rather than fft).
The fast Fourier transform (FFT) is not a different transform but an algorithm for computing the DFT in O(N log N) time instead of O(N²). Cooley and Tukey rediscovered it in 1965 (Gauss had sketched it in 1805); it is arguably the most important algorithmic contribution of the twentieth century. Every audio library uses it: NumPy's np.fft, PyTorch's torch.fft, CUDA's cuFFT, Intel MKL's FFT. FFT sizes are typically powers of two (512, 1024, 2048) because power-of-two FFTs are the fastest, though modern implementations handle arbitrary sizes efficiently.
The magnitude spectrum |X[k]| is the basis of all spectrogram-based features. For a periodic signal, the magnitude spectrum has sharp peaks at the fundamental frequency f0 and its integer multiples (the harmonics) — this is why voiced speech shows distinct horizontal striations in a spectrogram, and why pitch estimation from the magnitude spectrum is feasible. The power spectrum is |X[k]|² and is more convenient when we want energy-based quantities (e.g. for the mel filterbank).
The phase spectrum ∠X[k] is harder to interpret: it varies rapidly with the time origin, wraps modulo 2π, and is essentially noise-like for typical audio. But it is not irrelevant: the phase carries the transient timing information that distinguishes a snare drum from a snare drum played backwards, and it is the reason inverting a spectrogram to a waveform requires either an iterative phase-recovery algorithm (Griffin-Lim) or a learned neural vocoder (WaveNet, HiFi-GAN, BigVGAN — Section 17).
A subtle but practical point: the DFT assumes the signal is periodic with period N. When we take the DFT of a finite-length segment of a non-periodic signal, we are implicitly treating that segment as one period of an infinite tiling. Unless the segment happens to start and end at the same value with the same slope, this creates a discontinuity — a step function repeated every N samples — whose Fourier transform has energy at every bin. The result is spectral leakage: power from one actual sinusoid spreads into neighbouring bins. Windowing (Section 07) is the fix: taper the segment to zero at both ends so the implicit periodic tiling is smooth.
Audio is non-stationary: the spectrum of a sentence changes from phoneme to phoneme every few tens of milliseconds, and the spectrum of a song changes from note to note every fraction of a second. A single DFT of an entire clip is almost useless because it throws away time. The short-time Fourier transform (STFT) is the standard response: compute a DFT on each short frame so we get a local spectrum that evolves over time.
Formally, the STFT of x[n] with window w[n] of length N and hop H is: X[m, k] = Σn=0N-1 x[m H + n] · w[n] · exp(−2πi k n / N). The output X[m, k] is a 2D complex array indexed by frame m (time) and bin k (frequency). The time resolution is H / fs seconds; the frequency resolution is fs / N Hz. Typical speech settings at 16 kHz: window = 400 samples (25 ms), hop = 160 samples (10 ms), FFT size = 512, giving 257 frequency bins and 100 frames per second.
The inverse STFT reconstructs the waveform from the complex STFT. Given perfect magnitude and phase, and a window that satisfies the constant-overlap-add (COLA) condition — the summed, shifted windows equal a constant — the reconstruction is exact: x̂[n] = (1/C) Σm (IDFT(X[m, :]))[n − m H] · w[n − m H]. The practical method is torch.istft / librosa.istft. A 50% overlapping Hann window satisfies COLA; so does a 75% overlapping Hann; some window/hop combinations do not, and produce amplitude modulation at the frame rate.
When we need only an amplitude image — as almost every modern audio model does — we take |X[m, k]| and discard the phase. This is the magnitude STFT, and when we sometimes log-compress and remap frequency with a mel filterbank, we get the log-mel spectrogram of Section 10. Note that discarding phase is lossy: the same magnitude can correspond to many different waveforms. Reconstructing a waveform from only the magnitude is the business of Griffin-Lim (iteratively re-estimating phase from a magnitude spectrogram under the constraint that the result's STFT has the target magnitude) and of neural vocoders (learning the phase from data).
Griffin-Lim (Griffin and Lim 1984) is the classical phase-recovery algorithm: start with random phase, take inverse STFT, retake STFT, replace the magnitude with the target, take inverse STFT again, and iterate. It converges to a waveform whose STFT magnitude approximates the target; perceptually it sounds robotic for speech but is acceptable for coarse sketches and is still the default when a quick low-quality waveform is needed. For production TTS, Griffin-Lim has been entirely replaced by neural vocoders.
An important cousin of the STFT is the constant-Q transform (CQT). The STFT has linearly spaced frequency bins, which is awkward for music: an octave at 100 Hz spans 100 bins if Δf is 1 Hz, but an octave at 10 kHz spans 10 000. The CQT uses logarithmically spaced bins — 12 bins per octave matches Western musical pitches — and adapts window length per bin so the quality factor Q = f / Δf is constant. Libraries: librosa.cqt, nnAudio. CQT is preferred over STFT for chord recognition, key estimation, and most music-information-retrieval tasks.
Finally, the STFT is differentiable: torch.stft and torchaudio.transforms.Spectrogram compute it on-GPU, inside the training graph, so modern models treat the spectrogram not as a fixed preprocessing step but as a learnable module whose parameters (window length, hop, FFT size, sometimes the window shape itself) can be tuned or learned. This is the point where classical DSP meets the training loop — a theme the learnable-front-ends section (Section 16) will take all the way.
Taking a DFT of a finite segment of a longer signal is equivalent to multiplying an infinite signal by a rectangular function that is one inside the segment and zero outside. This multiplication in the time domain corresponds to convolution in the frequency domain, and the rectangular window's Fourier transform — the sinc function — has a narrow main lobe but enormous side lobes (the first side lobe is only 13 dB below the main lobe). The practical effect is spectral leakage: a single pure sinusoid in the signal produces energy across dozens of DFT bins, obscuring nearby weaker tones.
The standard response is to multiply each frame by a window function — a smooth taper that goes to zero at both ends of the frame — before taking the DFT. The window's own Fourier transform determines how leakage is distributed: a wider main lobe worsens frequency resolution, but lower side lobes reduce leakage of strong tones into weak neighbouring bins. Every window is a trade-off between these two.
The three workhorse windows in audio are Hann, Hamming, and Blackman-Harris. The Hann window w[n] = 0.5 (1 − cos(2π n / (N−1))) has a first side lobe at −31 dB — far better than the rectangular window's −13 dB — and is the default in librosa, torchaudio, and essentially every modern STFT. The Hamming window w[n] = 0.54 − 0.46 cos(2π n / (N−1)) shifts the weighting to suppress the first side lobe to −43 dB at the cost of slower side-lobe roll-off; it was standard in HTK-era ASR and remains common in Kaldi. Blackman-Harris suppresses side lobes to −92 dB and is preferred for precision measurement (instrument tuning, feedback detection) but has a wider main lobe than most ML applications want.
The Kaiser window w[n] = I0(β sqrt(1 − ((n − (N−1)/2) / ((N−1)/2))²)) / I0(β) has a shape parameter β that trades main-lobe width for side-lobe suppression: β = 0 is rectangular, β = 5 is similar to Hamming, β = 8.6 approximates Blackman. The Kaiser is essentially a tunable knob between the whole-window pantheon, and it is the default in many filter-design tools (scipy.signal.firwin).
Two subtler windows matter for measurement. The flat-top window maximises amplitude accuracy: a sinusoid that does not fall exactly on a bin centre will still have its true amplitude read at the peak, at the cost of a very wide main lobe. This is why audio analysers use it for SPL calibration and THD measurement, but it is rarely used in ML. The Gaussian window has the minimum time-frequency uncertainty product (it achieves equality in the Heisenberg-Gabor bound) and is the window implicit in the Gabor transform.
A practical question is: does window choice matter for neural networks? Empirically, most ASR and audio-tagging benchmarks are insensitive to Hann vs Hamming — the learned representation dominates any reasonable window. But when a pipeline is extraordinarily sensitive to fine spectral structure — very-low-SNR keyword spotting, high-fidelity music tagging, precise pitch tracking — the choice moves the needle by a percent or two. Always report the window in papers; always reproduce it at inference.
Finally, windows interact with overlap-add reconstruction. A window w[n] with hop H satisfies the constant-overlap-add (COLA) condition when Σm w[n − m H]² is constant. Hann at 50% overlap satisfies Σ w² = constant = 0.5; Hann at 75% overlap satisfies Σ w² = constant = 1.5. Violating COLA produces amplitude modulation at the frame rate, audible as a buzz. Modern ISTFT implementations check this and normalise appropriately, but if you write your own overlap-add you must get it right.
The spectrogram is the visual and computational representation that dominates modern audio. It is the matrix |X[m, k]| (or its square, or its log) where m indexes time frames and k indexes frequency bins. Plotted with time on the x-axis and frequency on the y-axis, a spectrogram lets a human read speech phonemes, identify musical notes, detect dog barks, and spot engine-knock signatures. Fed into a CNN or a transformer, a spectrogram is the input to essentially every non-end-to-end audio model.
The magnitude spectrogram |X[m, k]| and the power spectrogram |X[m, k]|² are related by squaring; the power spectrogram is more convenient for the mel filterbank because energy is additive, whereas magnitudes are not. Both are strictly non-negative, and their dynamic range is enormous — a loud vowel can be 60 dB louder than a quiet background, so the raw magnitude varies by a factor of a thousand. Without compression, most pixels in a spectrogram plot would be black.
The universal fix is to log-compress the magnitudes: S[m, k] = log(|X[m, k]|² + ε), or in dB: S[m, k] = 10 log10(|X[m, k]|² + ε). The small constant ε (often 1e-10) prevents log(0). Log scaling mirrors human perception (dB is the perceptual unit of loudness), compresses dynamic range so both loud and quiet content are visible / numerically well-behaved, and is the standard input normalisation for speech models. A log-power-spectrogram is the default feature fed to Conformer-based ASR systems; a log-mel-spectrogram (Section 10) is even more common.
Two variants matter. The wide-band spectrogram uses a short window (≈ 5 ms): it resolves rapid time events (stops, click tracks) but has poor frequency resolution, so individual harmonics blur together. The narrow-band spectrogram uses a long window (≈ 25 ms): it resolves harmonics clearly but smears time. Phoneticians read both. For ASR, 25 ms windows at a 10 ms hop is the near-universal compromise; for music transient analysis, 10 ms windows at 5 ms hop work better.
A linear-frequency spectrogram (raw STFT) gives equal Hz-width bins. This is rarely what we want perceptually: most useful speech content is below 4 kHz, so half of the bins above it are nearly wasted. Mapping to a log-frequency or mel-frequency axis (§Section 09-10) is almost always an improvement. Librosa's specshow displays spectrograms; the y_axis argument controls whether the display is linear, log, or mel.
Normalisation is subtle. Per-utterance mean/variance normalisation (subtract the mean log-spectrogram, divide by the standard deviation) reduces channel and speaker variation and is standard in speech. Global normalisation (using dataset-wide statistics) is preferred when long-term averages matter, for instance in audio-event detection. The choice affects how spectrograms look and how models generalise; verify it matches the pre-trained model you are using at inference.
Finally, do not overlook SpecAugment (Park 2019), the augmentation strategy that treats the spectrogram as an image and randomly masks out contiguous vertical stripes (frequency masking), horizontal stripes (time masking), and slight time warps. SpecAugment was the single largest breakthrough in end-to-end ASR training between 2017 and 2020 and remains the default augmentation for any spectrogram-based pipeline. Its success is the most direct evidence that spectrograms are best treated as images.
The mel scale is a perceptual frequency scale whose step size matches how humans perceive pitch. Doubling a frequency from 100 Hz to 200 Hz sounds like a larger step than doubling from 10 000 Hz to 20 000 Hz — even though both are an octave. The mel scale, proposed by Stevens, Volkmann, and Newman in 1937, assigns "mels" such that equal mel differences sound like equal pitch differences, anchored by the convention that 1 000 Hz = 1 000 mel.
The practical mel-to-Hz mapping most commonly used is: m(f) = 2595 · log10(1 + f / 700). Its inverse is f(m) = 700 · (10m/2595 − 1). This formula (the Fant 1968 / Shannon variant) gives a near-linear mapping below about 1 kHz and a near-logarithmic mapping above. Other formulations exist — Stevens's original, the O'Shaughnessy 1987 formula, the HTK formula — and librosa exposes both "htk" and "slaney" implementations; the differences are small but matter if you need bit-exact reproduction of a paper's results.
Why does the mel scale look roughly log above 1 kHz? Because the physics of the cochlea make it so. The basilar membrane is a tapered strip that vibrates maximally at a frequency that varies with position; the mapping is nearly exponential above roughly 500 Hz. Perceptual studies of critical bands — narrow ranges of frequency within which tones mask each other — give a closely related Bark scale, and another related scale is the equivalent rectangular bandwidth (ERB). All three (mel, Bark, ERB) encode the same basic fact: the ear's frequency resolution is coarser at high frequencies.
The mel scale is not the last word on psychoacoustics — modern auditory-physiology models (Zilany-Bruce, CARFAC) are considerably more faithful, as are the gammatone and gammachirp filterbanks that match cochlear impulse responses — but it is the right compromise for machine learning. It is cheap to compute (it is just a triangular-filter re-weighting of a linear STFT), it preserves enough perceptual structure, and it has forty years of empirical validation in speech and audio models.
Choosing the number of mel bins is a hyperparameter. Common values: 40 (classical ASR, MFCCs), 64 (YAMNet, VGGish, many audio-tagging models), 80 (Whisper, Conformer, Tacotron 2, modern ASR/TTS), 128 (high-fidelity TTS, music tagging). More bins preserve more spectral detail but inflate compute and can cause adjacent filters to overlap excessively if the STFT FFT size is too small. A rule of thumb: FFT size should be at least 8 × the number of mel bins.
The lower and upper frequency limits of the mel filterbank also matter. For speech, typical limits are 0 – 8000 Hz (telephone / wide-band) or 20 – 8000 Hz (strip DC-noise). For music, 20 – 11 025 Hz or higher. Whisper uses 0 – 8000 Hz at 16 kHz; most TTS systems use 0 – 8000 Hz or 0 – 12 000 Hz. Mis-matched limits between training and inference is a frequent source of silent degradation.
A last note: the mel scale is not sacred. There is active research on learned filterbanks (LEAF, SincNet, Audio Spectrogram Transformer's learnable positional embeddings), and the self-supervised waveform models (wav2vec 2.0, HuBERT) bypass mel entirely. The mel's staying power is a statement about the efficiency of a well-tuned prior — not a claim that nothing can improve on it.
The mel-spectrogram is the single most widely used feature in audio ML. It is the power spectrogram re-weighted by a bank of triangular filters that are uniformly spaced on the mel scale. Librosa's melspectrogram, torchaudio's MelSpectrogram, TensorFlow's tf.signal.linear_to_mel_weight_matrix, and Whisper's preprocessing all compute the same thing.
The pipeline is: compute the power STFT; multiply by a mel filterbank matrix M of shape (n_mels, n_fft/2 + 1) whose rows are triangular filters centred on mel-spaced frequencies; take the log. The triangular filters overlap — each one covers roughly two critical bands — and integrate power over a range of linear-frequency bins into a single mel bin. The output mel-spectrogram has shape (n_mels, n_frames), typically (80, T) for Whisper and Conformer models.
The log compression comes next: S = log(M · |X|² + ε). Some pipelines use natural log; some use log10 with a floor (clamp to a minimum dB value before logging). Whisper uses log10 clamped at −0.8; Tacotron 2 uses natural log with a small epsilon. The exact formula matters for bit-exact reproduction but not for empirical quality.
Why a mel-spectrogram instead of a raw STFT spectrogram? Four reasons. First, compression: 80 mel bins vs 257 STFT bins is 3× smaller, directly reducing compute in downstream models. Second, perceptual alignment: mel bins are wider at high frequencies where humans hear less detail, so the features concentrate capacity where it matters. Third, noise robustness: the triangular integration averages out noise in high-frequency bins. Fourth, standardisation: half of all audio models expect an 80-mel log-mel-spectrogram as input, which makes transfer learning trivial.
Mel-spectrograms have their own augmentation recipe. SpecAugment applies random time masks (zero out a contiguous block of frames) and frequency masks (zero out a contiguous block of mel bins); this alone lifts ASR accuracy by several percent and prevents overfitting. SpecSwap and Mixup-on-spectrograms extend the idea. Pitch-shifting the raw waveform and regenerating the mel-spectrogram is a strong augmentation for music tagging.
Inverting a mel-spectrogram to a waveform is ill-posed twice over: the mel filterbank is a non-invertible many-to-few projection (you must estimate the linear STFT from the mel), and the magnitude-only spectrogram discards phase. Classical inversion (mel → linear magnitude via pseudo-inverse, then Griffin-Lim for phase) produces robotic audio suitable only for debugging. Modern solutions train a neural vocoder — WaveNet, WaveRNN, Parallel WaveGAN, HiFi-GAN, BigVGAN, iSTFTNet — that takes a mel-spectrogram as input and outputs a waveform. Every modern TTS system uses one; Tacotron 2 + WaveNet was the first end-to-end pipeline to do so in 2017, and HiFi-GAN (2020) made the approach fast enough for production.
A practical note: do not conflate the log-mel filterbank feature with the MFCC feature. MFCCs are the DCT of the log-mel energies (Section 11); the log-mel-spectrogram is the log-mel energies themselves. Modern deep learning uses log-mel; classical HMM-GMM ASR used MFCCs. If someone says "filterbank features," they almost always mean log-mel.
Mel-frequency cepstral coefficients (MFCCs) were the dominant audio feature in speech recognition from roughly 1980 until 2015. They are the coefficients of the discrete cosine transform (DCT) of the log-mel-energy vector — a decorrelated, compact summary of the spectral envelope. Although deep learning has largely displaced MFCCs with log-mel spectrograms, they remain the default in speaker recognition, keyword spotting on microcontrollers, and a long tail of legacy systems.
The pipeline: compute log-mel energies (Section 10) → apply DCT along the mel axis → keep the first 13 (sometimes 20, or 40) coefficients. The DCT is a real-valued orthogonal transform closely related to the DFT; it decorrelates the log-mel vector under reasonable stationarity assumptions, concentrating signal energy in the first few coefficients. MFCC0 is the log-energy (DC); MFCC1 captures the overall spectral tilt; later coefficients capture progressively finer spectral shape.
The cepstrum more generally is the inverse Fourier transform of the log magnitude spectrum. It is a mathematically natural way to separate a source (e.g. the vocal folds, whose pitch period appears as a peak at a specific quefrency) from a filter (e.g. the vocal tract, whose smooth shape appears at low quefrency). The DCT-based "MFCC" approximates the low-quefrency part of the cepstrum and is remarkably stable across speakers and channels.
MFCCs are typically augmented with delta and delta-delta features — first and second time derivatives approximated by finite differences — to capture dynamic information. The standard feature vector in Kaldi and HTK is 13 + 13 + 13 = 39 coefficients. Modern neural models often do not need deltas because CNNs / transformers can implicitly compute them, but Kaldi-style x-vector pipelines still use them.
Normalisation is essential. Cepstral mean normalisation (CMN) subtracts the per-utterance mean MFCC, which cancels any time-invariant channel effect (a microphone's transfer function appears as an additive term in the log spectrum and therefore an additive term in MFCCs). Cepstral mean and variance normalisation (CMVN) additionally divides by the per-utterance standard deviation. Both are standard in speech and critical in real-world deployment where channels vary.
MFCCs are not dead. The i-vector representation (Dehak 2011) — a low-dimensional factor-analysis embedding derived from GMM statistics over MFCCs — was the dominant speaker-recognition feature for a decade and is still used as a baseline. Its successor, the x-vector (Snyder 2018), is a TDNN-learned embedding typically trained on MFCCs or filterbank features. For keyword spotting on microcontrollers (Arduino, ESP32) with kilobytes of memory, MFCCs remain attractive because they are tiny — 13 coefficients per frame vs 80 mel bins — and because they survive fixed-point arithmetic gracefully.
A theoretical caveat: the DCT in MFCC computation is a linear projection, so a sufficiently deep neural network operating on the log-mel-spectrogram can always re-derive the MFCCs internally. That is why modern ASR systems skip the DCT step and feed the log-mel directly: the network learns its own optimal projection. The MFCC's decorrelation was a boon to GMMs, which assumed diagonal covariance; it is neutral (or a weak information loss) for neural models, which do not.
Mel-spectrograms and MFCCs are the default, but they are far from the only audio features that matter. Music-information retrieval, speaker identification, environmental-sound classification, and audio quality assessment each draw on a distinct set of hand-engineered descriptors — many of which retain value as auxiliary inputs, as interpretable probes, or simply as items you will encounter in older literature.
Chroma features collapse the STFT along the octave dimension into 12 pitch classes (C, C#, D, …, B), summing all the energy at frequencies corresponding to the same pitch class regardless of octave. Chromas are the natural input for chord recognition, key estimation, and cover-song identification. Librosa's chroma_stft and chroma_cqt compute them. They are compact (12 dimensions per frame) and interpretable; modern chord-recognition models (the Chord-ISMIR family, Christian's chord-CNN, Harmonic-CNN) still use them as an input or a target.
Spectral shape features summarise the STFT with a handful of scalars per frame. Spectral centroid: the weighted mean frequency — correlates with perceived brightness. Spectral bandwidth: weighted standard deviation — correlates with tonality vs noisiness. Spectral roll-off: the frequency below which 85% of energy lies — separates speech from many noises. Spectral flatness (Wiener entropy): the ratio of the geometric to arithmetic mean power — near 1 for white noise, near 0 for a pure tone. Spectral flux: the frame-to-frame change in magnitude spectrum — the foundation of onset detection (Section 13). All of these are cheap to compute and together constitute the "MIR toolbox" of audio classification.
Gammatone filterbanks approximate the impulse response of the basilar membrane more faithfully than mel filters: the gammatone kernel is tn-1 exp(−2π b t) cos(2π fc t). Gammatone features (GTFs) and their derivatives (GFCCs = the MFCC-style DCT of log-gammatone energies) outperform MFCCs slightly in noisy speech recognition, and they are a common front end in biologically motivated auditory models.
Perceptual linear prediction (PLP, Hermansky 1990) is a close cousin of MFCCs: it uses a Bark-scale auditory warping, applies equal-loudness pre-emphasis, does an LP analysis on the warped spectrum, and takes cepstral coefficients. PLP features were considered marginally better than MFCCs for a decade and are still included in Kaldi. RASTA filtering (Hermansky and Morgan 1994) applies a bandpass filter on each cepstral trajectory to suppress slowly varying channel effects — a learned-free alternative to CMN.
For speaker tasks, the feature lineage runs: MFCCs → GMM supervector → i-vector (factor analysis) → x-vector (TDNN embedding) → d-vector / ECAPA-TDNN (modern neural speaker embeddings). Early stages used MFCCs and their deltas; modern stages use log-mel filterbanks directly. Speaker-recognition benchmarks (VoxCeleb 1/2, SITW) are reported with equal error rate and minimum detection cost and still organise around these embeddings; Chapter 04 of Part VIII covers this in depth.
For music transcription and pitch analysis, useful features include harmonic summation, autocorrelation-derived features, subharmonic summation (SHS), pitch salience, and the pitch-class histogram. These feed polyphonic-pitch-estimation systems (MPE-Net, Onsets and Frames, BasicPitch).
For audio quality assessment, the reference-based metrics PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), ViSQOL, and the modern learned DNSMOS and SQuId are essentially feature-based classifiers that predict subjective listener scores from spectral comparisons. They are covered more fully in Section 15.
The general rule: hand-engineered features are rarely the sole input to a modern model, but they are often useful as auxiliary channels (concatenate chroma to a mel-spectrogram for music tagging), as inexpensive baselines, as interpretable probes (what does my model have that predicts the centroid?), or as frozen front ends for edge deployment. Know what exists; pick carefully.
Two of the oldest problems in audio analysis are what pitch is sounding right now (pitch / fundamental-frequency estimation) and when did a new note begin (onset detection). Both are deceptively simple for a single clean sinusoid and surprisingly difficult for real speech or polyphonic music. They underlie speech prosody analysis, music transcription, beat tracking, singing-voice synthesis, and query-by-humming — and they expose the gap between what classical DSP can do and where learned models take over.
The fundamental frequency (f0) of a periodic signal is the inverse of its period. For voiced speech it is the vibration rate of the vocal folds — typically 80 – 180 Hz for men, 150 – 250 Hz for women, 250 – 450 Hz for children. For a musical instrument it is the lowest spectral peak whose harmonics form the perceived pitch. For noisy or aperiodic signals, f0 is undefined; a good f0 estimator reports both an estimate and a confidence.
The classical monophonic f0 algorithms fall into two camps. Time-domain methods find the lag at which the autocorrelation is maximised (or the AMDF — average magnitude difference function — is minimised): YIN (de Cheveigné and Kawahara 2002) is the dominant member of this family, followed by pYIN (probabilistic YIN) which adds an HMM over the candidate lags. Frequency-domain methods find the fundamental from the harmonic comb in the spectrum: harmonic product spectrum (HPS), harmonic summation, SWIPE (Sawtooth Waveform Inspired Pitch Estimator, Camacho 2007). YIN / pYIN are widely used in speech research; SWIPE is competitive; all of them struggle with octave errors and with breathy or creaky voice.
The modern standard is CREPE (Kim 2018): a 6-layer CNN trained to output a 360-dimensional probability distribution over pitches spaced at 20-cent increments. CREPE is near-ground-truth on isolated voice and instrument recordings and is widely used by downstream MIR systems. It is packaged in librosa (via librosa.pyin as a baseline) and in dedicated libraries (crepe, torchcrepe, basic-pitch). A newer sibling, SPICE, uses self-supervised pitch estimation; PESTO is a recent transformer-based replacement.
Onset detection finds the moments when new musical events start. The core observation is that new notes produce sudden increases in energy, changes in spectral content, or both. The canonical detection function is spectral flux: the half-wave-rectified L1 or L2 norm of the frame-to-frame magnitude-spectrum difference. Librosa's onset_strength and onset_detect implement this, with optional mel-weighting and adaptive peak picking. Complex-domain flux, energy flux, and high-frequency content (HFC) are variants that suit different instruments.
Modern onset detection uses learned models — convolutional networks with binary onset/no-onset targets, trained on annotated datasets (MAESTRO for piano, MedleyDB for multi-track audio). The madmom library packages state-of-the-art pre-trained onset, beat, and downbeat detectors; it is the default choice for rhythm analysis in research.
Beat and tempo tracking extends onset detection to periodicity analysis. The standard pipeline: compute an onset-strength envelope, apply autocorrelation to estimate tempo, then use dynamic programming or an HMM to place beats at spacing matching the tempo estimate. BeatNet, RNN-Beat-Tracker, and BeatTransformer are the learned successors. Beat tracking feeds every music-information-retrieval task with a temporal grid, and its evaluation (the MIREX beat-tracking contest) has driven steady progress for two decades.
Source separation — given a mixture, recover the individual sources — is the canonical hard problem of audio. When the mixture is music (vocals + drums + bass + other), the task is music source separation. When it is speech (multiple simultaneous speakers), it is the cocktail-party problem. When it is speech in noise (one speaker plus environmental noise), it becomes speech enhancement (Section 15). All three are under-determined in the single-channel case and have driven some of the most creative work in audio over the last decade.
The classical approach was non-negative matrix factorisation (NMF): given a magnitude spectrogram V of shape (F, T), factor it as V ≈ W H with W (F, K) a non-negative "dictionary" of spectral templates and H (K, T) non-negative activations. Each source occupies a subset of dictionary atoms; grouping atoms by source yields the reconstructed source spectrograms. NMF was the dominant method from 2005 to 2015, supported by extensions (convolutive NMF, semi-supervised NMF, Itakura-Saito divergence), but it was eclipsed by neural models by 2017.
The transition was via masking-based separation: train a network to predict a time-frequency mask M that, applied to the mixture spectrogram, recovers a source. The ideal binary mask (IBM) assigns each bin to whichever source dominates; the ideal ratio mask (IRM) assigns a soft fraction in [0, 1]; the phase-sensitive mask (PSM) incorporates phase. Training a CNN or BLSTM to predict the IRM, and applying it to the mixture's STFT, gave the first strong music-separation results (DeepKaraoke, MSS-Unmix).
Open-Unmix (Stöter 2019) is the canonical open-source music separator: a BLSTM per source predicting magnitude masks on STFT. Demucs (Défossez 2019) runs in the time domain with a U-Net over raw waveforms, setting state-of-the-art on MUSDB18 and becoming Facebook / Meta's default separator. Spleeter (Deezer 2019) was the consumer-friendly release — four-stem vocal / drums / bass / other separation, packaged as a Docker image, widely adopted by hobbyists. Hybrid Demucs combines time- and frequency-domain processing; Band-Split RNN (BSRNN) and MDX23-winning MDX-Net variants extend the approach.
Speech separation has its own lineage. Deep clustering (Hershey 2016) embeds each time-frequency bin in a high-dimensional space where bins belonging to the same speaker cluster together — sidestepping the speaker-permutation problem by using clustering rather than direct source assignment. Permutation-invariant training (PIT, Yu 2017) trains the model with a loss that is invariant to the permutation of output source channels.
TasNet (Luo and Mesgarani 2018) was the first time-domain speech separator competitive with STFT-based methods; ConvTasNet (2019) replaced the LSTM with a dilated-convolution TCN and pushed SI-SDRi on WSJ0-2mix dramatically; DPRNN (Dual-Path RNN, 2019) and SepFormer (Subakan 2021) progressed further. The latest transformer-based separators approach 22–23 dB SI-SDRi on WSJ0-2mix — a staggering jump from the ~10 dB of IBM-based baselines.
A recent twist: diffusion-based separation (MSDM, DiffSep) and language-conditioned separation (AudioSep, LASS, "separate the violin") open new directions. They also blur the line between separation and generation — if you can generate a convincing isolated vocal, you can "separate" without ever seeing the mixture. The field is moving fast; MDX Challenges (annual music-separation contest) are the empirical barometer.
Speech enhancement is source separation with a specific goal: recover clean speech from noisy speech. Every video-conferencing app, every voice-assistant wake-word detector, every hearing aid, and every industrial-headset system runs an enhancement stage. The field is old — Ephraim and Malah's MMSE estimator is from 1984 — and the modern neural methods have pushed both perceptual quality and real-time compute dramatically.
The classical methods work in the STFT domain. Spectral subtraction estimates the noise magnitude spectrum (e.g. during silent frames detected by a VAD) and subtracts it from each noisy frame; it is brutally simple and introduces the "musical noise" artefact familiar from early noise-reduction. Wiener filtering derives the optimal linear estimator of the clean spectrum under assumed spectral models of speech and noise; it is less aggressive than spectral subtraction and produces cleaner output. MMSE-STSA and log-MMSE (Ephraim and Malah 1984, 1985) give the Bayesian optimal magnitude estimators under Gaussian assumptions and remained the de-facto standard for two decades.
Neural enhancement begins around 2014. RNN-based enhancement (Xu 2014, Weninger 2015) trains an LSTM to predict clean log-magnitude spectra from noisy; SEGAN (Pascual 2017) was the first GAN-based enhancement; DeepXi estimates a priori SNR with a CNN. The modern workhorses are CRN (Convolutional Recurrent Network), DCCRN (Deep Complex CRN — operates on complex STFT so it models phase explicitly), DEMUCS-Denoiser (Défossez 2020, time-domain), and the industrial favourite RNNoise / NSNet (Microsoft).
Evaluation of enhancement is notoriously tricky. The classical reference metrics are PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862) and STOI (Short-Time Objective Intelligibility, Taal 2011); the more modern SI-SDR (scale-invariant signal-to-distortion ratio) is preferred for separation / enhancement literature because it is insensitive to output gain. All three are reference-based — they compare enhanced to clean. The frontier is non-reference learned metrics: DNSMOS (Microsoft, trained on crowd listener ratings), ViSQOL, NORESQA, SQuId. These predict subjective quality from the enhanced audio alone and drive the DNS Challenge leaderboards.
Beamforming uses multiple microphones to improve SNR before any single-channel stage. Classical beamformers (delay-and-sum, MVDR) exploit time-of-arrival differences between microphones to null out noise from specific directions; neural beamformers (Mask-MVDR, ADL-MVDR) estimate the spatial covariance from deep-learning-predicted masks. Meeting-room transcription (Microsoft ASR, Zoom's meeting summariser) and far-field smart-speaker wake-word detection (Amazon Echo's 7-mic array) rely heavily on beamforming.
Dereverberation is the specific case where the distortion is room-impulse-response convolution rather than additive noise. WPE (Weighted Prediction Error, Nakatani 2010) is the classical method; neural dereverberation (Yoshioka 2015, Han 2020) is the modern answer; and a line of work treats the joint problem (dereverberation + denoising + separation) as a single network training problem. The DNS Challenge (Deep Noise Suppression, Microsoft, annual since 2020) and the REVERB challenge (2014, reactivated periodically) have driven the field.
A subtle but important deployment issue: evaluate your enhancement on the same acoustic conditions your product will encounter. A model tuned for high-SNR office conditions will underperform on low-SNR conference-room scenarios and may actively hurt in very high-SNR (clean-studio) recordings. The DNS Challenge's emphasis on diverse noise sets and Microsoft's rule of "never degrade clean" are both direct responses to this.
The argument for hand-engineered front ends (mel-spectrograms, MFCCs) is that they encode perceptual structure for free. The counter-argument is that any fixed feature is a bet — and data-driven bets are usually better. The last five years have seen a steady migration toward learnable audio front ends, from parameterised filterbanks operating on raw waveforms, to fully self-supervised transformer models that treat the waveform as just another token sequence.
SincNet (Ravanelli and Bengio 2018) was the first influential learnable front end. Instead of a fixed mel filterbank, SincNet parameterises a bank of band-pass filters by only their cut-off frequencies — each filter is a sinc function in time — so the network learns which frequency bands to attend to while preserving the interpretability of a filterbank. SincNet consistently matches or exceeds mel front ends on speaker-recognition and keyword-spotting tasks with dramatically fewer parameters.
LEAF (Zeghidour 2021) generalises the idea: a Gabor-like learnable filterbank followed by compression and per-channel-gain, trained end-to-end with the downstream task. LEAF was pitched as a drop-in replacement for mel-spectrograms and matches mel on keyword spotting, audio tagging, and speech recognition.
A parallel line runs through pure 1D-convolution front ends: Jasper (NVIDIA), QuartzNet, Wav2Letter++ stack large 1D convs directly on raw waveforms, bypassing mel entirely. They reach competitive ASR WER at roughly the cost of a bigger network. Conformer, the current ASR architecture king, still uses mel — but it is now a choice rather than a default.
Contrastive Predictive Coding (CPC, van den Oord 2018) was the ancestor: predict future audio embeddings from past context with InfoNCE. wav2vec (Schneider 2019) and wav2vec 2.0 (Baevski 2020) refined the approach with quantised latent codes and transformer encoders. Wav2vec 2.0 trained on 53 000 hours of Libri-Light audio set a new ASR state-of-the-art at the time and made semi-supervised speech recognition practical — fine-tune on a few hours of labelled data, get close to fully supervised performance.
HuBERT (Hsu 2021) replaced wav2vec's contrastive objective with masked prediction of cluster labels (a BERT-like MLM over pseudo-tokens obtained by k-means clustering of MFCCs, then progressively of HuBERT's own hidden states). HuBERT is slightly simpler to train and matches or exceeds wav2vec 2.0 on most benchmarks. WavLM (Chen 2022) extended HuBERT with denoising and adversarial auxiliaries, improving robustness; data2vec (Baevski 2022) unified the self-supervised recipe across speech, text, and vision.
The SUPERB benchmark (Yang 2021) evaluates self-supervised speech encoders across a dozen downstream tasks (phoneme recognition, ASR, speaker identification, emotion, intent, slot filling, query-by-example, voice conversion) by freezing the pre-trained model and training only a small task head. SUPERB rankings are the standard way to compare encoders and have driven the field's empirical progress. A follow-up SUPERB-SG extends to generation tasks.
For multilingual and cross-lingual work, XLS-R (Babu 2021, 128-language wav2vec 2.0) and Meta's MMS (Massively Multilingual Speech, 2023, 1 400 languages) took the self-supervised recipe to enormous language coverage. These now underlie much of the low-resource ASR literature; Whisper (OpenAI 2022) took a different path — fully supervised weak-label training on 680 000 hours of web audio — and is itself a widely used learned front end now (its encoder feeds many downstream audio-tagging and translation pipelines).
The practical upshot: unless you are training at genuine frontier scale, a pre-trained self-supervised encoder (WavLM-large, HuBERT-X-large, Whisper encoder, MMS-1B) is now the preferred front end for speech tasks. For music and general audio, BEATs, CLAP (audio-text contrastive), AudioMAE, M2D, and EfficientAT serve the same role. Mel-spectrograms survive for TTS (where generative models still condition on them directly) and for edge deployment (where a small pre-trained encoder is still too heavy).
The final layer of the modern audio stack is waveform modelling: turning a spectrogram, a set of discrete tokens, or a text prompt into an audible waveform. This layer has undergone three revolutions in a decade — from concatenative / Griffin-Lim synthesis, to autoregressive neural vocoders (WaveNet), to GAN-based parallel vocoders (HiFi-GAN), to neural audio codecs (SoundStream, EnCodec, DAC) that compress waveforms to discrete tokens and enable the audio-as-tokens paradigm that powers AudioLM, MusicLM, VALL-E, and GPT-4o's voice.
WaveNet (van den Oord 2016) was the breakthrough: a dilated 1D CNN autoregressive over 16-bit μ-law-encoded waveform samples. WaveNet sounded astonishingly natural but was ~100× slower than real time, so the first production use (Google Assistant 2017) required Google's proprietary parallel-WaveNet distillation. WaveRNN (Kalchbrenner 2018), LPCNet (Valin 2019), and Parallel WaveGAN (Yamamoto 2019) progressively sped things up.
The breakthrough for deployment was HiFi-GAN (Kong 2020): a GAN-based vocoder that maps a mel-spectrogram to a waveform in a single forward pass with high perceptual quality. HiFi-GAN runs several hundred times faster than real time on a single GPU and is the de-facto vocoder for TTS research in 2022 – 2025. Extensions: iSTFTNet (uses an iSTFT output head), UnivNet (multi-resolution discriminators), BigVGAN (NVIDIA 2022, universality across speakers and music).
SoundStream (Google 2021) introduced residual vector quantisation (RVQ): the encoder's latent is quantised iteratively by a cascade of codebooks, each fitting the residual of the previous stage. Eight-codebook RVQ at 50 Hz and 4 096 entries per codebook gives 6 kbps. EnCodec (Meta 2022) followed the same recipe with a more aggressive discriminator and per-codebook-LM-smoothing; its 24 kHz model is the de-facto default for speech tokenisation and feeds Bark, VALL-E, and MusicGen. DAC (Descript 2023) improved on both with better loss balancing and snake activations, achieving near-transparent 44.1 kHz audio at 8 kbps.
Once audio is tokens, the entire LLM toolkit applies. AudioLM (Borsos 2022) generates audio with a hierarchical transformer over two token streams: semantic tokens from a w2v-BERT encoder and acoustic tokens from SoundStream. MusicLM (2023) extends AudioLM to text-conditioned music. VALL-E (Microsoft 2023) is a zero-shot TTS: given a 3-second prompt and a text, it generates an EnCodec-token continuation that clones the prompt's voice. Bark (Suno 2023) and SpeechT5 do similar things with open weights.
The codec itself is a careful DSP-meets-deep-learning design. The encoder is a stack of strided 1D convolutions that downsample the waveform to an embedding at roughly 75 – 150 Hz. RVQ quantises the embedding into a fixed-size code index. The decoder mirrors the encoder with transposed convolutions. The training loss combines a reconstruction loss, a multi-scale STFT loss (ensures spectral fidelity), adversarial losses (ensures perceptual naturalness), and a VQ commitment loss. Each of these components corresponds to a DSP desideratum that a single L2 loss in the time domain would fail to capture.
Codecs also replace the mel-spectrogram in modern TTS: VALL-E, NaturalSpeech 3, and Tortoise-TTS all condition directly on codec tokens rather than mel. This is the cleanest demonstration of the trend — the mel-spectrogram, so central to Tacotron and FastSpeech, is being replaced by learned discrete tokens. The hand-engineered front end is finally dissolving; but note that EnCodec's encoder and decoder are still, at heart, DSP-inspired 1D-convolutional filterbanks. The ideas of this chapter are not retired — they are absorbed.
The classical and modern audio pipeline described in this chapter is the foundation the rest of Part VIII builds on. This closing section surveys how the pieces fit together in real ML workflows — the libraries, the GPU-accelerated toolkits, the typical preprocessing pipeline, the streaming and real-time constraints, and the handoff between this chapter and its successors.
The Python audio ecosystem revolves around a handful of libraries. librosa is the feature-rich research library: load audio, resample, compute STFT / mel / MFCC / chroma / pitch, apply pitch-shift and time-stretch, detect onsets and beats — all on CPU with NumPy. It is slow but encyclopedic and has been the teaching tool for a decade. torchaudio mirrors librosa's feature set with PyTorch tensors and GPU support — torchaudio.transforms.MelSpectrogram is the canonical mel-spectrogram in any modern training pipeline. soundfile is the go-to for reading / writing WAV / FLAC. pydub wraps ffmpeg for any format ffmpeg supports. SoX and ffmpeg themselves are the command-line workhorses for dataset preparation at scale.
Industrial toolkits sit on top of these libraries. NVIDIA NeMo is the production-grade ASR / TTS / speaker framework — Conformer-CTC, Conformer-Transducer, FastPitch, HiFi-GAN, SortFormer diarisation. SpeechBrain is the friendly academic framework — modular recipes for ASR, speaker recognition, enhancement, separation. ESPnet is the Kaldi successor for speech recognition and translation. Hugging Face transformers hosts every self-supervised speech encoder (wav2vec 2.0, HuBERT, WavLM, Whisper) with a uniform interface. Kaldi itself is the C++ classical ASR toolkit, still used for speaker diarisation and low-resource languages.
A canonical preprocessing pipeline for supervised audio training looks something like: (1) load waveforms and convert to float32 in [-1, 1]; (2) resample to the target rate (usually 16 kHz); (3) downmix to mono; (4) optionally apply VAD and trim silence; (5) optionally normalise loudness to -23 LUFS; (6) optionally augment (SpecAugment, noise addition from the MUSAN or DEMAND datasets, room-impulse-response convolution from BUT ReverbDB, speed perturbation); (7) compute log-mel-spectrogram; (8) apply cepstral / per-speaker normalisation; (9) feed to model. Each step deserves its own config file because small variations (16 vs 22.05 kHz, 80 vs 128 mel bins, SpecAugment on vs off) move accuracy by several percent.
streaming_features, NVIDIA Riva's frame store, Google's Lyra). If you are writing a streaming service, do not reinvent this — use one of the existing implementations.
GPU-accelerated preprocessing matters at scale. Computing a log-mel-spectrogram on CPU for a few hundred hours of training data is trivial; computing it inside every training step for several thousand hours of wav2vec 2.0 pretraining demands either pre-cached features or GPU spectrograms. torchaudio.transforms implements GPU STFT / mel; nnAudio does the same for CQT and gammatone; NVIDIA's DALI offers a C++-accelerated audio-loading + feature-extraction path for enormous datasets. Pre-caching log-mel-spectrograms to disk is a common optimisation when features don't depend on training-time randomness.
The rest of Part VIII builds directly on this foundation. Chapter 02 (Automatic Speech Recognition) takes log-mel spectrograms or wav2vec 2.0 embeddings and runs CTC / RNN-T / attention-based decoders to produce text — Whisper's encoder is literally a log-mel spectrogram processor from Section 10. Chapter 03 (Text-to-Speech) learns the inverse path — text → log-mel spectrogram → waveform via HiFi-GAN / BigVGAN (Section 17). Chapter 04 (Speaker Recognition) trains x-vectors / ECAPA-TDNN on log-mel features (Section 10–11) and uses i-vector / cosine scoring for verification. Chapter 05 (Audio Classification) feeds log-mel spectrograms into CNNs / ASTs / transformers for environmental-sound and music-tag prediction. Chapter 06 (Music Generation) either predicts mel spectrograms with diffusion (AudioLDM, Stable Audio) or predicts EnCodec / SoundStream tokens with autoregressive transformers (MusicLM, MusicGen).
Audio signal processing also feeds the broader compendium. Part VI's Transformer Architecture and Pretraining Paradigms supply the backbone for every speech transformer. Part V's Convolutional Neural Networks and Sequence Models provide the architecture toolkit for 1D-conv front ends and RNN-based streaming ASR. Part VII's Vision-Language Models preview the multimodal extension — VLMs plus audio encoders give you audio-language models (CLAP, AudioPaLM, Qwen-Audio, GPT-4o audio), which are the natural endpoint of this part of the compendium. Part I's Signal Processing chapter has the mathematical foundations — Fourier transforms, filter design, windowing — that we have used throughout. This is the most cross-linked chapter in the compendium because audio is the most cross-modal of tasks.
The practical claim is that a practitioner who understands the chapter they just read — sample rates, the STFT, mel-spectrograms, MFCCs, CREPE, separation / enhancement, wav2vec, EnCodec — has the full vocabulary to read any paper or debug any pipeline in Part VIII. The theoretical claim is stronger: audio is the closest that machine learning gets to working with physical reality in real time. Every decision in this chapter — window length, hop, bit depth, sample rate — is a statement about what matters in the physical world. That is why the chapter is longer than it first appears, and why it rewards a careful second reading once the later chapters have given it context.
Audio signal processing draws on eight decades of DSP plus the last ten years of deep-learning audio work. The canon below organises the key references as the chapter does — physics and perception, sampling and quantisation, the Fourier workhorse, spectrograms and mel features, MFCCs, pitch and onsets, source separation, enhancement, learnable front ends, neural codecs, datasets, and the modern software ecosystem.
Discrete-Time Signal Processing — Oppenheim & Schafer (3rd ed., 2009)
The graduate-textbook reference for DSP. Chapters on sampling, DFT/FFT, the STFT, and filter design are the canonical treatments; every audio ML engineer should have read at least the sampling, Fourier, and STFT chapters.
Speech and Language Processing — Jurafsky & Martin (3rd ed. draft)
Chapters 16 (Feature Extraction for Speech) and 17 (Speech Recognition) are the best single-source treatment of MFCCs, filterbanks, and modern ASR preprocessing. Chapter 16 in particular is a compressed version of this chapter.
Fundamentals of Music Processing — Meinard Müller (2nd ed., 2021)
The standard music-information-retrieval textbook. Covers chroma, CQT, onset / beat tracking, chord / key estimation, and music segmentation in beautifully clear prose. Much of Section 12 and Section 13 of this chapter follow Müller's presentation.
Think DSP — Allen B. Downey (2016)
An introductory DSP textbook in Python — every concept is illustrated with NumPy code. The gentlest path for an ML engineer who needs to learn what sampling / Fourier / STFT actually do.
A Survey of Deep Learning for Audio Signal Processing — Purwins et al. (2019)
Pre-wav2vec 2.0 but still the best cross-domain (speech, music, environmental-sound) survey of deep-learning audio.
Communication in the Presence of Noise — Shannon (1949)
The original sampling theorem paper. Worth reading for the clarity of Shannon's exposition.
Loudness, Its Definition, Measurement and Calculation — Fletcher & Munson (1933)
The equal-loudness contours that anchor the dB-weighting standards and shape every psychoacoustically motivated feature.
A Scale for the Measurement of the Psychological Magnitude Pitch — Stevens, Volkmann & Newman (1937)
The original mel scale. The scale everyone cites but few read; it is short and illuminating.
Acoustic Theory of Speech Production — Gunnar Fant (1960)
The source-filter model of speech. Everything about why MFCCs decorrelate so well traces back to the vocal-tract all-pole interpretation here.
An Algorithm for the Machine Calculation of Complex Fourier Series — Cooley & Tukey (1965)
The (re)discovery of the FFT — arguably the most consequential algorithm of the twentieth century.
On the Use of Windows for Harmonic Analysis with the Discrete Fourier Transform — Harris (1978)
The encyclopedia of window functions. Reference for anyone tuning an STFT for a measurement application; the Blackman-Harris windows are named after this paper.
Signal Estimation from Modified Short-Time Fourier Transform — Griffin & Lim (1984)
The Griffin-Lim algorithm for phase reconstruction. Still used as a baseline and for quick-and-dirty mel-to-wave inversion.
Calculation of a Constant Q Spectral Transform — Brown (1991)
The CQT for music analysis — log-spaced bins aligned with musical pitches.
Comparison of Parametric Representations for Monosyllabic Word Recognition — Davis & Mermelstein (1980)
The MFCC paper. MFCCs dominated speech recognition for over thirty years after this.
Perceptual Linear Predictive (PLP) Analysis of Speech — Hermansky (1990)
PLP features — Bark-scale warping + equal-loudness pre-emphasis + LP analysis. The main rival to MFCCs through the 1990s.
RASTA Processing of Speech — Hermansky & Morgan (1994)
Band-pass filtering of cepstral trajectories to remove slow channel variation — a free alternative to cepstral mean normalisation.
Front-End Factor Analysis for Speaker Verification — Dehak et al. (2011)
The i-vector paper. Dominated speaker recognition for a decade; still a competitive baseline.
X-Vectors: Robust DNN Embeddings for Speaker Recognition — Snyder et al. (2018)
The neural successor to i-vectors. Establishes the pattern of embedding-based speaker modelling that ECAPA-TDNN and later models continue.
YIN, a Fundamental Frequency Estimator for Speech and Music — de Cheveigné & Kawahara (2002)
The autocorrelation-based pitch estimator that was the research default for fifteen years. Still competitive and cheap.
CREPE: A Convolutional Representation for Pitch Estimation — Kim et al. (2018)
The modern neural pitch tracker. Near-ground-truth on monophonic signals; the default for singing-voice synthesis and MIR.
Onsets and Frames: Dual-Objective Piano Transcription — Hawthorne et al. (2018)
The piano-transcription paper. Establishes the dual-target (onset + sustained frame) approach that polyphonic transcription still uses.
Basic Pitch: A Lightweight Pitch Detection Model — Bittner et al. (2022)
Fast, open-source polyphonic pitch estimation for vocals, instruments, and synthesis. Currently the default MIR pitch tool.
Algorithms for Non-Negative Matrix Factorization — Lee & Seung (2001)
The NMF paper that spawned a decade of non-negative audio separation work before deep learning.
Deep Clustering: Discriminative Embeddings for Segmentation and Separation — Hershey et al. (2016)
The embedding-and-cluster approach that sidestepped the speaker-permutation problem and opened the neural-separation era.
Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking — Luo & Mesgarani (2019)
The time-domain TCN separator that dramatically beat STFT-based baselines on WSJ0-2mix. Established that STFT is not required.
Demucs: Music Source Separation in the Waveform Domain — Défossez et al. (2019)
The music-separation workhorse: a Wave-U-Net in the time domain. State-of-the-art on MUSDB18 for years and still widely used.
Attention Is All You Need in Speech Separation (SepFormer) — Subakan et al. (2021)
The transformer-based separator that pushed SI-SDRi on WSJ0-2mix beyond 22 dB.
Open-Unmix — Stöter et al. (2019)
A widely used open-source BLSTM music separator. Good baseline and tutorial reference.
Speech Enhancement Using a Minimum Mean-Square Error Short-Time Spectral Amplitude Estimator — Ephraim & Malah (1984)
The MMSE-STSA estimator. The de-facto classical denoiser for two decades; still the baseline in many enhancement papers.
RNNoise: Learning Noise Suppression — Valin (2018)
A tiny hybrid DSP + RNN denoiser that runs on phones. Widely deployed in open-source apps.
DCCRN: Deep Complex Convolution Recurrent Network — Hu et al. (2020)
The complex-domain denoiser that won the 2020 DNS Challenge. Establishes that modelling phase explicitly improves enhancement.
DNSMOS: A Non-Intrusive Perceptual Speech Quality Metric — Reddy et al. (2021)
The learned non-reference quality metric behind every DNS Challenge since 2021. Replaces PESQ / STOI as the evaluation criterion of choice.
Speaker Recognition from Raw Waveform with SincNet — Ravanelli & Bengio (2018)
The learnable band-pass front end that replaces the mel filterbank. Opens the parameterised-filterbank line.
LEAF: A Learnable Frontend for Audio Classification — Zeghidour et al. (2021)
Gabor-like learnable filters with learned per-channel gain and compression. A clean modern learnable-mel alternative.
Representation Learning with Contrastive Predictive Coding — van den Oord et al. (2018)
CPC — the InfoNCE objective that spawned wav2vec and essentially every self-supervised speech model.
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations — Baevski et al. (2020)
The self-supervised speech encoder that made semi-supervised ASR practical and transformed the field. Start here.
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction — Hsu et al. (2021)
The BERT-for-speech recipe: predict cluster labels of masked frames. Simpler to train than wav2vec 2.0 and the current backbone of many pipelines.
WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing — Chen et al. (2022)
HuBERT + denoising + overlap-mixing. The top-performing self-supervised encoder on SUPERB for a long stretch; robust to noise.
SUPERB: Speech Processing Universal PERformance Benchmark — Yang et al. (2021)
The multi-task benchmark that organises all self-supervised speech encoder comparisons. The gold standard for front-end evaluation.
Robust Speech Recognition via Large-Scale Weak Supervision (Whisper) — Radford et al. (2022)
Not self-supervised (it is weakly supervised on 680 k hours of labelled web audio), but Whisper's encoder has become a widely used front end in its own right.
WaveNet: A Generative Model for Raw Audio — van den Oord et al. (2016)
The dilated autoregressive waveform model. Launched the neural-vocoder era and set the quality bar that every successor has aimed at.
HiFi-GAN: Generative Adversarial Networks for Efficient and High-Fidelity Speech Synthesis — Kong et al. (2020)
The GAN vocoder that made real-time neural TTS standard. Default vocoder in most TTS research through 2024.
BigVGAN: A Universal Neural Vocoder with Large-Scale Training — Lee et al. (2023)
A universal vocoder that handles unseen speakers and music. Uses snake activations and anti-aliased upsampling.
SoundStream: An End-to-End Neural Audio Codec — Zeghidour et al. (2021)
The first neural audio codec with residual vector quantisation. The ancestor of EnCodec and DAC.
High Fidelity Neural Audio Compression (EnCodec) — Défossez et al. (2022)
The de-facto audio codec of 2023 – 2024. Feeds Bark, MusicGen, and many other audio-token models.
High-Fidelity Audio Compression with Improved RVQGAN (DAC) — Kumar et al. (2023)
The current state-of-the-art codec for 44.1 / 48 kHz audio. Improves on EnCodec's reconstruction quality at similar bitrates.
AudioLM: A Language Modeling Approach to Audio Generation — Borsos et al. (2022)
The audio-as-tokens paradigm — generate SoundStream tokens with a hierarchical transformer. The blueprint for MusicLM, VALL-E, and every modern audio generator.
LibriSpeech: An ASR Corpus Based on Public-Domain Audio Books — Panayotov et al. (2015)
The default English ASR benchmark: 960 hours of clean read speech. Still the first dataset most ASR projects touch.
Common Voice: A Massively-Multilingual Speech Corpus — Ardila et al. (2020)
Crowdsourced multilingual speech. The foundation of open multilingual ASR and the most-used counterpart to LibriSpeech.
AudioSet: An Ontology and Human-Labeled Dataset for Audio Events — Gemmeke et al. (2017)
2 M clips, 527 classes. The foundational audio-tagging benchmark; training data for PANN, VGGish, YAMNet, AST, and BEATs.
MUSDB18: A Corpus for Music Separation — Rafii et al. (2017)
The standard music-source-separation benchmark. Vocals / drums / bass / other stems for 150 songs.
MUSAN: A Music, Speech, and Noise Corpus — Snyder et al. (2015)
The go-to noise / music / speech corpus for data augmentation in ASR and speaker recognition.
librosa: Audio and Music Signal Analysis in Python — McFee et al. (2015)
The research-default Python audio library. Ubiquitous; every audio ML paper uses it somewhere.
torchaudio: Building Blocks for Audio and Speech Processing — Yang et al. (2022)
PyTorch's audio library: GPU-accelerated STFT / mel / MFCC, resampling, augmentation, pre-trained models. The default for training pipelines.
SpeechBrain: A General-Purpose Speech Toolkit — Ravanelli et al. (2021)
A friendly, research-oriented PyTorch toolkit for ASR, speaker recognition, enhancement, separation. Good for reproducing and extending published recipes.
NVIDIA NeMo: Conversational AI Toolkit — NVIDIA (ongoing)
The production-grade speech toolkit. Conformer-CTC, Parakeet, Canary, FastPitch, Sortformer. The reference if you want state-of-the-art speech models off the shelf.
ESPnet: End-to-End Speech Processing Toolkit — Watanabe et al. (2018)
The academic PyTorch successor to Kaldi — E2E ASR, TTS, and speech translation. Widely used for low-resource and multilingual work.
Kaldi Speech Recognition Toolkit — Povey et al. (2011)
The classical C++ HMM-GMM / DNN-HMM ASR toolkit. Still widely used for speaker diarisation and as a baseline; the vocabulary of Kaldi-trained features (13-d MFCC + deltas, i-vectors) pervades the literature.
madmom: A New Python Audio and Music Signal Processing Library — Böck et al. (2016)
The reference MIR library for onset / beat / downbeat / chord detection with pre-trained models. Used in most rhythm-analysis research.