Most audio in the world is not speech. Glass shatters, engines idle, birds call, crowds roar — and every one of these sounds carries information a system might need to act on. Audio classification assigns semantic labels to acoustic events; sound event detection locates them in time; and audio-language models like CLAP tie the whole taxonomy to natural language, enabling zero-shot recognition of any sound a person can describe.
The chapter builds from features to architectures to task formulations, then broadens to multimodal grounding. Sections 1–2 define the task landscape and the acoustic features used. Sections 3–5 cover the three architectural eras: CNNs (VGGish, CNN14, PANNs), Transformers (AST, PaSST, BEATs), and self-supervised pre-training. Sections 6–7 address sound event detection — the temporally resolved version of tagging — and its weakly supervised regime. Section 8 covers music tagging and MIR. Sections 9–10 cover CLAP and audio-language models. Section 11 surveys evaluation metrics and benchmark datasets.
Prerequisites: Chapter 01 (Audio Signal Processing) for log-mel spectrograms and filterbanks; familiarity with CNNs (Part V, Chapter 04) and Transformers (Part V, Chapter 06). The CLAP and zero-shot sections parallel the vision-language material in Part VII, Chapter 06 — reading both together is worthwhile.
Audio classification is not one task but a family, distinguished by what is being labeled, at what temporal resolution, and whether the label set is fixed or open.
The three primary task formulations are audio tagging (clip-level, multi-label: "this 10-second clip contains a dog bark and traffic noise"), sound event detection (frame-level, temporally located: "a dog bark occurs from 2.3s to 3.7s"), and audio retrieval (ranking: "find all clips in the database matching 'coffee shop ambience'"). Tagging and detection differ in output resolution and in what training data is required; retrieval connects audio to language.
The domain coverage is broad. Environmental sounds are the most studied: breaking glass, clapping, rain, traffic, animal calls. Industrial and safety sounds include machinery anomalies (for predictive maintenance), glass breaks and gunshots (for security), and HVAC noise (for smart buildings). Medical audio covers respiratory sounds (coughs, wheezes — used in COVID-19 screening), heart sounds, and neonatal cry classification. Music understanding covers genre, mood, tempo, key, and instrument — the domain of Music Information Retrieval (MIR).
Unlike speech recognition, audio classification treats content as the primary signal rather than a nuisance. The same acoustic event — a sustained "shhh" — is noise to a speech recognizer and informative to a scene classifier. The design choices that help speech recognition (suppressing non-linguistic variation) often hurt audio classification (which relies on precisely that variation).
The temporal structure of the task matters enormously for model choice. A clip-level tag can be learned from a single global pooling operation over frame features. A sound event onset requires the model to locate a transient in time with sub-second precision. These are different generalization demands, and the gap between them is one of the central tensions in the field.
The dominant input for audio classification is the log-mel spectrogram — a 2D time-frequency representation that compresses perceptually relevant information into a compact matrix suitable for both CNNs and Transformers.
A standard log-mel spectrogram for audio classification uses a 25ms window with 10ms hop, 128 mel filterbanks, and log compression. A 10-second AudioSet clip becomes a matrix of roughly $998 \times 128$ values. This is intentionally analogous to a grayscale image, which is why image CNN architectures transfer well: the model sees "ridges" in the time-frequency plane corresponding to tonal sounds, "spots" for transients, and broadband textures for noise.
An alternative is to process raw waveforms directly with 1D convolutions, as in SincNet: the first layer learns parametric bandpass filters (sine functions with learned cutoff frequencies) rather than fixed mel filterbanks. Raw waveform models preserve phase information lost in the STFT magnitude spectrum, which matters for some binaural and spatial audio tasks. However, on large-scale benchmarks like AudioSet, spectrogram-based models consistently outperform raw waveform models at equivalent parameter counts — the mel spectrogram's perceptual compression is doing useful feature engineering.
AudioSet and similar datasets suffer from extreme class imbalance (speech: millions of examples; rare industrial sounds: hundreds). Augmentation is essential. The most important techniques are Mixup — blending two clips and their labels with a random $\lambda \in [0,1]$: $\tilde{x} = \lambda x_i + (1-\lambda) x_j$, $\tilde{y} = \lambda y_i + (1-\lambda) y_j$ — and SpecAugment — masking random time steps and frequency bands in the spectrogram before input. Both are simple and produce reliable gains. Additional augmentations include time-stretching (changing tempo without changing pitch), pitch-shifting, and adding background noise at random SNRs.
Treating the log-mel spectrogram as an image and applying image augmentation ideas (random crops, horizontal flip on time axis, colour jitter → frequency shift) is not just an analogy — it is a productive engineering strategy. The main caveat is that vertical flipping (frequency mirroring) has no acoustic meaning and should be avoided.
AudioSet is to audio classification what ImageNet was to image classification: the dataset that forced the field to scale up, exposed the gap between small- and large-model performance, and defined the benchmark against which all systems are measured.
Released by Google in 2017, AudioSet contains approximately 1.79 million 10-second YouTube clips annotated with 527 sound classes drawn from a hierarchical ontology. The ontology groups classes into broad categories: Human sounds, Animal, Music, Natural sounds, Sounds of things, and Channel/environment noise. Classes are organized as a directed acyclic graph — a "dog bark" is a child of "dog" which is a child of "domestic animals" — enabling hierarchical evaluation.
Labels in the main AudioSet release are weak: a clip labeled "guitar" contains a guitar somewhere in its 10 seconds, but no onset or offset information is provided. A smaller strongly labeled subset (the AudioSet Strongly Annotated subset used in DCASE challenges) provides frame-level annotations for a subset of classes. The gap between weak and strong supervision is one of the central challenges of sound event detection (Section 7).
AudioSet's class distribution spans five orders of magnitude: "speech" has over 1.7 million examples; "cricket chirping" has 339. Labels were assigned by a classifier trained on human ratings, introducing substantial noise — estimated at 20–40% for some classes. Models trained naively on AudioSet learn to predict "speech" reliably and struggle on rare classes. Effective systems use class-balanced sampling, focal loss (which down-weights easy examples), or mixup to mitigate both problems.
AudioSet's mAP is computed across all 527 classes equally, giving each class equal weight regardless of frequency. A system that perfectly classifies "speech" but fails on every rare class can still achieve high mAP — which is why per-class AP curves and class-stratified analysis are essential alongside the headline number.
The first decade of large-scale audio classification was dominated by convolutional networks applied to log-mel spectrograms. Three systems defined the progression: VGGish established the paradigm, CNN14 optimised it, and PANNs democratised it via open pre-trained weights.
VGGish (Hershey et al., Google, 2017) applied a VGG-like CNN to 96×64 log-mel patches with 0.96-second duration. Trained on YouTube-100M with weak audio labels, it produces 128-dimensional clip embeddings and became the default audio feature extractor — embedded in dozens of downstream systems without any task-specific training. Its principal limitation is that it was trained before AudioSet was public and on a proprietary dataset, making reproducibility difficult.
CNN14 (Kong et al., QMUL, 2020) is a 14-layer CNN trained directly on AudioSet with global average pooling and global max pooling fused before the classifier. It achieved mAP of 0.431 on AudioSet eval — a 60% relative improvement over earlier CNNs — and was released as part of the PANNs (Pre-trained Audio Neural Networks) library, which provides CNN6, CNN10, CNN14, ResNet38, ResNet54, and MobileNetV1 checkpoints pretrained on AudioSet.
PANNs work as general-purpose audio feature extractors. The standard recipe: freeze CNN14 up to the penultimate layer, add a task-specific head (linear classifier, LSTM, attention pooling), and fine-tune on the target dataset. This is effective even with very small downstream datasets (under 1000 examples) because AudioSet pre-training covers a broad acoustic distribution. CNN14 embeddings have been used for COVID-19 cough detection, urban sound tagging, bird call identification, and gunshot detection — all from the same frozen backbone.
Attention-based models surpassed CNNs on AudioSet around 2021, following the same trajectory as in vision. Three systems illustrate the architectural evolution: AST adapted ViT directly, PaSST introduced patchout training, and BEATs added self-supervised pre-training with iterative tokenizer refinement.
Audio Spectrogram Transformer (AST) (Gong et al. 2021, MIT) applies a Vision Transformer directly to overlapping 16×16 patches of a log-mel spectrogram. The key modification from ViT is positional embedding interpolation: AST is initialised from ImageNet-pretrained ViT weights, but the input grid dimensions differ between images and audio spectrograms, so 2D positional embeddings are bilinearly interpolated. AST achieved mAP 0.485 on AudioSet with ensemble — at the time, a substantial advance over CNN14.
PaSST (Patchout faSt Spectrogram Transformer, Koutini et al. 2021) introduced patchout: randomly dropping a fraction of input patches during training, analogous to dropout but operating at the token level. Patchout dramatically reduces memory usage, enabling larger batch sizes and longer training, and also acts as a regularizer. PaSST with patchout achieves mAP 0.496 on AudioSet-balanced and scales efficiently to AudioSet-full.
BEATs (Bidirectional Encoder representation from Audio Transformers, Chen et al. 2022, Microsoft) introduces iterative self-supervised pre-training. The training alternates between two phases: first, a tokenizer is trained to assign discrete acoustic tokens to spectrogram patches; second, the Transformer is pre-trained with a masked prediction objective — predicting the discrete tokens of masked patches, analogous to BERT for text. The tokenizer and Transformer are then jointly refined for several iterations, progressively improving the token vocabulary and the representation. BEATs achieves mAP 0.486 on AudioSet without any fine-grained AudioSet labels during pre-training — demonstrating that the labeled AudioSet data can be replaced with large-scale unlabeled audio for pre-training, then a small fine-tuning stage on labeled clips.
| Model | Architecture | AudioSet mAP | Year |
|---|---|---|---|
| CNN14 | 14-layer CNN | 0.431 | 2020 |
| AST | ViT-B/16, ImageNet init | 0.485 | 2021 |
| PaSST | ViT + patchout | 0.496 | 2021 |
| BEATs | Transformer, SSL pretrain | 0.486 | 2022 |
| EfficientAT | EfficientNet-M, pretrained | 0.497 | 2023 |
Audio tagging assigns a label to a clip. Sound event detection goes further: it must locate each event in time, producing onset and offset timestamps alongside the class label. This requires frame-level, not clip-level, predictions.
The canonical SED architecture is the CRNN (Convolutional Recurrent Neural Network): CNN layers extract local spectro-temporal features, a bidirectional GRU or LSTM processes the resulting frame sequence, and a sigmoid-activated linear layer produces per-class activation probabilities at each time step. The output is a matrix of shape $T' \times C$ where $T'$ is the number of output frames (after CNN pooling) and $C$ is the number of sound classes. Onset and offset times are extracted by thresholding per-class activation curves and finding rising and falling edges.
More recently, Transformer-based SED systems have surpassed CRNNs by replacing the GRU with self-attention. The key challenge is that attention over long sequences is expensive, and SED requires fine temporal resolution (10–20ms frame shift). Efficient variants use dilated attention, local windowed attention, or hierarchical downsampling-upsampling architectures (U-Net style) to combine global context with high-resolution output. DCASE 2023 Task 4 was won by systems combining PaSST-based feature extractors with conformer decoders — showing that large-scale pre-training transfers to SED even when the fine-tuning dataset is only a few hundred clips.
Think of SED as the audio analogue of object detection in vision: tagging is classification (is there a car?), detection is localization (where is the car, and for how long?). Just as image classification features are reused for detection via feature pyramid networks, audio classification backbones (CNN14, AST) are reused for SED by replacing global pooling with frame-level output heads.
Frame-level annotations are expensive to produce. Most real SED data provides only clip-level labels — a "weak" supervision signal. The challenge is to train a temporally precise detector from imprecise labels.
Multiple Instance Learning (MIL) provides the theoretical foundation. A clip (the "bag") is positive for a class if at least one of its frames (the "instances") is positive. Under MIL, the clip-level prediction is computed as: $$\hat{y}_{\text{clip}} = \text{pool}(\{\hat{y}_t\}_{t=1}^{T'})$$ where the pooling function determines how frame-level evidence aggregates. Global max pooling — $\hat{y}_{\text{clip}} = \max_t \hat{y}_t$ — focuses the model on the single most activating frame. Global average pooling treats all frames equally. In practice, attention-based pooling learns a weighted average: $$\hat{y}_{\text{clip}} = \sum_t \alpha_t \hat{y}_t, \quad \alpha_t = \text{softmax}(\mathbf{w}^\top \tanh(\mathbf{W} \mathbf{h}_t))$$ where $\alpha_t$ are learned per-frame weights. At inference, the per-frame $\hat{y}_t$ (before pooling) serve as the SED output, and the attention weights $\alpha_t$ can be inspected to understand which frames drove the clip-level decision.
Frame-level evaluation with a standard threshold fails for weakly supervised systems because their activation values are not well-calibrated. The Polyphonic Sound Detection Score (PSDS) integrates performance across all possible thresholds and decision parameters, weighting temporal precision differently from class precision. PSDS replaced event-based F1 as the primary metric in DCASE Task 4 from 2021 onward, producing more robust comparisons between systems with different threshold sensitivities.
Music is structured sound — and that structure makes it both easier and harder to classify than environmental audio. Genre, mood, and instrument can be captured from global spectral shape; beat and harmony require temporal parsing at a much finer scale.
Music auto-tagging assigns categorical labels to music clips: genre (rock, jazz, classical), mood (energetic, melancholic, calm), instrumentation (piano, drums, vocals), and production characteristics (acoustic, electric, live). The MagnaTagATune dataset (25,000 clips, 188 tags) and Million Song Dataset (1M tracks, Last.fm tags) are the canonical resources. Tagging models are typically the same CNN or Transformer architectures used for environmental sound — the difference is in the training data, not the architecture.
Beat tracking is the task of estimating the times at which musical beats occur. The standard pipeline: compute a tempogram (local autocorrelation of the onset strength function) to estimate tempo, then use dynamic programming to find the beat sequence consistent with the estimated tempo that maximises onset strength. Neural beat trackers (TCN-based, Böck et al.) now outperform classical methods on most benchmarks by learning directly from annotated beat annotations without handcrafted tempo models.
Chord recognition identifies the harmonic content (C major, Am7, etc.) at each time step. Modern systems use CRNNs trained on JAMS-annotated data; the output is a sequence of chord labels at 10-frame resolution. Key estimation is a coarser version: classify the overall tonal centre of a segment. Both tasks are used in music transcription pipelines and DJ software.
Music poses a different classification geometry than environmental sound: genres overlap and evolve (is trip-hop electronic or hip-hop?), moods are culturally and individually variable, and the same chord progression can evoke completely different moods depending on tempo and instrumentation. Soft, multi-label outputs with calibrated uncertainty are more honest than hard single-label predictions for music.
CLAP extends the CLIP paradigm from vision to audio, training a shared embedding space for audio clips and text descriptions. The result is a zero-shot classifier: any sound can be recognised if you can write a sentence describing it.
CLAP (Elizalde et al., Microsoft 2022) trains two encoders jointly: an audio encoder (HTSAT Transformer or CNN14) and a text encoder (RoBERTa or BERT). For each training pair $(\text{audio}_i, \text{caption}_i)$, the contrastive loss pushes the corresponding audio and text embeddings close while pushing non-corresponding pairs apart — the InfoNCE loss over a batch of $N$ pairs: $$\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{\exp(\mathbf{a}_i \cdot \mathbf{t}_i / \tau)}{\sum_{j=1}^{N} \exp(\mathbf{a}_i \cdot \mathbf{t}_j / \tau)}$$ where $\tau$ is a learned temperature. After training, zero-shot classification of a new audio clip proceeds by comparing its embedding to the embeddings of class name text strings ("the sound of a dog barking", "traffic noise on a busy street"), and returning the class with the highest cosine similarity.
CLAP was trained on a mixture of paired audio-text data: AudioSet with machine-generated captions derived from class labels, FreeSound with user-provided text descriptions, BBC Sound Effects with metadata, and AudioCaps with human-written captions. The diversity and quality of this text is crucial — pairing AudioSet clips with their raw label names ("dog", "bark") produces weak supervision; pairing with natural language descriptions ("a large dog barking repeatedly in an outdoor environment") produces much stronger alignment.
CLAP maps audio into a text-compatible space; audio captioning inverts this — generating a natural language description from a clip. Together they open a richer interface between audio systems and the rest of the language-model ecosystem.
Audio captioning is the task of generating a free-form text sentence that describes the acoustic content of a clip. The canonical dataset is AudioCaps (Kim et al. 2019): 49,838 AudioSet clips each annotated with five human-written captions. Models are typically encoder-decoder: a CNN or Transformer audio encoder produces a sequence of feature vectors, which a transformer decoder attends over to generate tokens. Evaluation uses BLEU, METEOR, ROUGE, and SPIDEr (a composite metric developed for audio captions).
WavCaps (Mei et al. 2023) dramatically scales captioning training data by weakly pairing FreeSound, BBC Sound Effects, and SoundBible clips with ChatGPT-refined metadata descriptions — producing 400k+ audio-text pairs without human annotation. WavCaps pre-training improves downstream captioning and CLAP retrieval performance, demonstrating that LLM-generated synthetic captions are useful even if individually noisy.
More recent systems connect audio encoders directly to large language models. Qwen-Audio (Alibaba, 2023) and SALMONN (Tsinghua, 2023) add audio encoders as additional input modalities to instruction-tuned LLMs, enabling open-ended question answering about audio clips: "What instruments are playing?", "What is the speaker's emotion?", "How many gunshots are there?". These models can also perform speech recognition and speaker description in the same architecture — unifying ASR, audio classification, and captioning in a single system.
CLAP enables zero-shot classification of any sound class describable in language — without retraining. A single CLAP model can classify gunshots, baby cries, and specific bird species simply by updating the text prompts. This fundamentally changes the scalability economics of audio classification: instead of collecting labeled data for every new class, you write a description.
Audio classification evaluation is complicated by multi-label outputs, class imbalance, and the gap between tagging and detection. No single metric captures all failure modes, and benchmark choice matters as much as metric choice.
The dominant metric for multi-label audio tagging is mean Average Precision (mAP). For each class, the Average Precision (AP) is the area under the precision-recall curve computed by varying the classification threshold. mAP is the mean of AP over all classes. Because it is threshold-independent and class-averaged, mAP is more informative than accuracy or AUC for multi-label, imbalanced problems. AudioSet's 527-class mAP is the primary number reported for general-purpose audio classification models.
For SED, two metrics coexist. Event-based F1 computes precision and recall over detected events (onset within 200ms, offset within 200ms or 20% of event duration). PSDS (Polyphonic Sound Detection Score) integrates over thresholds and penalises systems that trade temporal resolution for clip-level accuracy — making it a better proxy for the practical quality of a detection system. DCASE challenges now use PSDS as the primary metric for SED tasks.
| Dataset | Task | Classes | Primary metric |
|---|---|---|---|
| AudioSet (eval) | Multi-label tagging | 527 | mAP |
| ESC-50 | 5-class environmental sound | 50 | Accuracy (5-fold) |
| UrbanSound8K | Urban sound tagging | 10 | Accuracy (10-fold) |
| DCASE Task 4 | Weakly supervised SED | 10 | PSDS |
| AudioCaps | Audio captioning | open | SPIDEr |
| MagnaTagATune | Music auto-tagging | 188 | AUC-ROC, AP |
ESC-50 (Piczak 2015) remains widely reported despite its small size (2,000 clips, 5-fold CV) because it provides a quick sanity check across diverse environmental categories — animals, natural sounds, human non-speech, domestic sounds, urban noise. Human accuracy on ESC-50 is 81.3%; CNN14 achieves 94.7%; AST with AudioSet pre-training achieves 95.6%, meaning the benchmark is now essentially saturated. Progress on AudioSet, which remains far from saturated, is a more informative signal.