Part VIII · Speech, Audio & Music · Chapter 07

Music Generation & Music AI, when models learn to sing.

From MIDI-era Markov chains to billion-parameter text-to-music systems, we trace how machines came to write harmony, generate waveforms, and understand the structure of sound as human expression.

Chapter Preface

Music AI operates at a unique intersection of structure and feeling. A melody is simultaneously a sequence of discrete tokens, a waveform in continuous time, a cultural artifact, and an emotional act. That duality — computable structure, ineffable experience — makes music one of the richest testbeds for generative modeling.

This chapter covers two intertwined threads. The first is music generation: symbolic (MIDI, tokens) and audio-domain (waveform, spectrogram, codecs), from early RNNs through Music Transformer, AudioLM, MusicLM, MusicGen, and diffusion-based approaches. The second is music understanding — the Music Information Retrieval (MIR) tasks of beat tracking, chord recognition, key detection, and structural segmentation that underpin both downstream applications and evaluation. We close with the evaluation landscape and the unresolved ethical questions surrounding training data and copyright.

Prerequisite chapters: Audio Signal Processing (VIII·01), Audio Classification (VIII·06), and a working knowledge of transformer architectures and variational/diffusion generative models.

The Music Generation Landscape taxonomy
Music Representations MIDI · piano roll · audio codecs
Classical & Neural Symbolic Generation Markov · RNN · Magenta · MuseNet
Music Transformer relative attention · long-range structure
Neural Audio Codecs EnCodec · SoundStream · RVQ
Language Models over Audio Tokens AudioLM · SoundStorm · hierarchy
Text-Conditioned Music Generation MusicLM · MusicGen · Stable Audio
Diffusion-Based Audio Generation Riffusion · AudioLDM · Moûsai
Music Information Retrieval beat · chord · key · structure
Evaluation FAD · CLAP score · MUSHRA · MOS
Ethics, Copyright & Creative Practice training data · opt-out · authorship

Section 01

The Music Generation Landscape

Before choosing an architecture, you must choose what you are generating — and at what level of abstraction music lives for your purpose.

Music AI systems differ along three fundamental axes: the representation domain (symbolic vs. audio), the conditioning modality (unconditional, genre/style, text, melody, image, or continuation), and the generation paradigm (autoregressive, diffusion, flow-matching, or hybrid). Each choice entails different data requirements, evaluation protocols, and creative affordances.

Symbolic vs. Audio

Symbolic generation operates on discrete score representations — MIDI note events, ABC notation strings, or custom token vocabularies — and is analogous to generating text. Outputs are compact, editable, and instrument-agnostic; they require a separate synthesizer or vocoder to become sound. Audio generation operates in the raw waveform or spectrogram domain and directly produces listenable output, but the search space is vast: a 30-second stereo clip at 44.1 kHz contains over 2.6 million samples.

Modern systems increasingly bridge both worlds via neural audio codecs — learned compressors that map waveforms to short sequences of discrete tokens, enabling language-model machinery to operate on audio with manageable sequence lengths. This hybrid approach underlies AudioLM, MusicLM, and MusicGen.

Task Taxonomy

The major generative tasks are: unconditional generation (sample from the prior — mainly used for evaluation); text-to-music (generate audio from a natural-language description, the dominant consumer task); music continuation (extend a given prompt); accompaniment generation (add accompaniment tracks to a given melody or vocals); style transfer (re-render audio in a different genre or instrument timbre); and arrangement and transcription (convert audio to score or rearrange across instruments).

Key Distinction

Symbolic generation excels at controllable, editable composition; audio generation is necessary for timbre, dynamics, performance expression, and direct playback. The practical frontier — typified by MusicGen and Stable Audio — combines both by training language or diffusion models on codec tokens derived from audio.

Section 02

Music Representations

The choice of representation determines what the model can learn, what it can control, and how easily humans can inspect or edit its outputs.

Symbolic Representations

MIDI (Musical Instrument Digital Interface) encodes music as a stream of events — note_on, note_off, pitch (0–127), velocity, and time — stored as delta-tick offsets. A MIDI file for a four-minute piece might contain only a few thousand events. MIDI is lossless with respect to score content but loses timbre and performance nuance. Most symbolic generation papers use MIDI as the primary data format, often tokenizing it as a flat sequence of events.

The piano roll is a dense 2D matrix where rows represent MIDI pitches and columns represent time frames; a cell is 1 if the note is active and 0 otherwise. Piano rolls are convenient for CNN-based models but grow large with duration, and converting back to MIDI requires onset/offset detection. ABC notation is a text encoding for traditional folk music, amenable to direct language-model training. Newer approaches build custom event-based token vocabularies that encode pitch, duration, velocity, instrument, and time-shift as distinct token types — e.g. the REMI (Recurrent Event-Music Interaction) tokenizer used by Pop Music Transformer.

Audio Representations

Raw PCM waveforms carry all information but are extremely high-dimensional. The mel spectrogram — a 2D time-frequency image with mel-scaled frequency bins — is the standard intermediate for most audio-domain models. Spectrograms enable CNN and vision-transformer architectures; the main limitation is phase information loss, which is recovered using vocoders (Griffin-Lim, HiFi-GAN, BigVGAN) when converting back to audio.

Neural audio codecs — EnCodec, SoundStream, DAC — represent the modern alternative: a learned encoder-decoder compressed to 50–75 tokens/second at 24 kHz with multiple residual codebooks. These are covered fully in Section 05.

Left: piano roll — a 2D pitch × time matrix. Right: residual vector quantization (RVQ) codec tokens — each frame is represented by tokens from multiple codebooks, with coarser semantic content in RVQ-1 and finer acoustic detail in deeper codebooks.

Section 03

Classical & Neural Symbolic Generation

Before transformers dominated, Markov chains, HMMs, and recurrent networks produced surprisingly musical results — and revealed the deep challenge of long-range structure.

Statistical Precursors

Markov chains over pitch n-grams were among the first algorithmic composition tools. A first-order Markov chain learns the transition probability $P(\text{pitch}_{t+1} \mid \text{pitch}_t)$ from a corpus and samples iteratively. Higher-order chains capture short melodic phrases but explode in state space. Hidden Markov Models extended this with latent harmonic or stylistic states, enabling, for instance, models of chord progressions that emit melodic notes.

Google Brain's Magenta project (2016–) brought deep learning to open-source music generation. Its initial models — MelodyRNN and ImprovRNN — were LSTM-based next-token predictors over MIDI event sequences, trained on millions of folk tunes from the Lakh MIDI dataset. MelodyRNN introduced lookback encoding and attention encoding to help the RNN track beat position and repeated musical phrases, substantially improving musical coherence over vanilla LSTMs.

MuseNet and the GPT Pivot

OpenAI's MuseNet (2019) applied a large GPT-style transformer to MIDI event sequences, training on a diverse corpus spanning classical, jazz, country, and pop with ten different instrument combinations. MuseNet could generate four-minute pieces with consistent style and showed that scale — more parameters, more data — transferred from language to music almost directly. Its token vocabulary included pitch, duration, velocity, and instrument; conditioning tokens prepended at the start of the sequence steered genre and orchestration.

MuseNet established the standard symbolic generation recipe: flatten music into a causal token sequence, pretrain a decoder-only transformer with cross-entropy loss, and condition via prefix tokens or special control tokens. The main limitation remained long-range structural coherence: a standard transformer with absolute positional embeddings has difficulty tying a reprise at bar 32 to the theme introduced at bar 1.

Section 04

Music Transformer

Relative attention — tracking relationships between tokens rather than their absolute positions — lets transformers learn the repetition and variation that make music coherent over time.

The Long-Range Structure Problem

Music is defined by repetition: motifs recur, themes develop, sections return. A transformer with absolute positional embeddings treats position 1 and position 257 as unrelated even if they both hold the same melodic motif. Standard self-attention learns $Q$, $K$, $V$ from token content plus absolute position; when the same pattern appears at different positions, the model must learn the equivalence implicitly — a tall order for long sequences.

Relative Global Attention

Huang et al. (Google Brain, 2018) introduced Music Transformer, which replaces absolute with relative positional attention. For query position $i$ and key position $j$, the attention logit becomes:

$$e_{ij} = \frac{q_i \cdot k_j + q_i \cdot r_{i-j}}{\sqrt{d_k}}$$

where $r_{i-j}$ is a learned embedding for the relative offset $i-j$ (clipped to a maximum distance $L_{\text{rel}}$). The $q_i \cdot r_{i-j}$ term is computed via a memory-efficient "skewing" trick that reshapes a $(T \times T \times d)$ tensor into the required relative form in $O(T^2 d)$ time rather than $O(T^3)$. This allows the model to attend equally strongly to the same motif regardless of where in the piece it occurs, directly modeling musical repetition.

Music Transformer trained on a piano performance dataset showed dramatically improved phrase-level coherence compared to absolute-position baselines. Listening tests revealed the model could sustain a theme across 20+ seconds and introduce plausible variations — behavior that had required hand-crafted structure rules in earlier systems.

Also Notable

Pop Music Transformer (Huang et al., 2020) extended the idea with the REMI tokenization scheme, which adds explicit beat-position and chord tokens into the MIDI event stream. This gave the model structured rhythmic anchors, further improving harmonic coherence and genre fidelity.

Section 05

Neural Audio Codecs

Residual vector quantization compresses audio into short discrete token sequences — the lingua franca linking language-model machinery to the audio domain.

The Codec Problem

To apply autoregressive language models to audio, we need a discrete representation that is both compact (short enough for transformer sequence limits) and high-fidelity (recoverable to perceptually high-quality audio). Raw PCM at 24 kHz gives 24,000 samples/second. A mel spectrogram at 100 frames/second is more compact, but discrete coding of mel bins loses too much information. The solution is a neural audio codec: a learned encoder–quantizer–decoder that compresses to ~75 tokens/second with reconstruction quality rivaling 128 kbps MP3.

Residual Vector Quantization (RVQ)

The core mechanism is residual vector quantization. The encoder $E$ maps the audio waveform to a continuous latent $z = E(x)$. The first codebook $\mathcal{C}_1$ (size $K$) quantizes $z$ to the nearest code $\hat{z}_1 = \arg\min_c \|z - c\|$. The residual $r_1 = z - \hat{z}_1$ is then quantized by $\mathcal{C}_2$, and so on through $N_q$ codebooks. The total code per frame is the tuple $(\hat{z}_1, \hat{z}_2, \ldots, \hat{z}_{N_q})$, each index in $\{1,\ldots,K\}$. The decoder $D$ reconstructs from the summed quantized vectors: $\hat{x} = D(\hat{z}_1 + \hat{z}_2 + \cdots + \hat{z}_{N_q})$.

The crucial property is the hierarchy: coarser codebooks (RVQ-1) capture global pitch, rhythm, and broad timbre; finer codebooks (RVQ-4, 8) capture subtle acoustic details. This hierarchy enables coarse-to-fine generation strategies.

Key Systems

Google's SoundStream (Zeghidour et al., 2021) was the first high-quality neural codec, trained end-to-end with a combination of reconstruction, commitment (VQ-VAE), and adversarial losses from a multi-scale discriminator. It operates at 6 kbps with $N_q = 8$ codebooks of size $K = 1024$, encoding 24 kHz audio at 75 frames/second. Meta's EnCodec (Défossez et al., 2022) followed a similar design but added a language-model entropy coder and a perceptual loss term; it became the codec of choice for AudioLM and MusicGen due to its open weights. DAC (Descript Audio Codec, Kumar et al., 2023) further improved quality with improved discriminator architectures and can encode 44.1 kHz stereo audio at ~8 kbps.

EnCodec / SoundStream pipeline. The encoder maps audio to a continuous latent; RVQ produces N discrete token indices per frame; the decoder sums the looked-up codebook embeddings and reconstructs audio. Language models operate on the discrete token stream, not the raw waveform.

Bitrate Note

With 8 codebooks of 1024 codes each at 75 frames/second: $8 \times \log_2(1024) \times 75 = 6000$ bits/second = 6 kbps. Using only the first 4 codebooks halves the bitrate but retains sufficient quality for many language-model applications. MusicGen uses 4 codebooks at 50 frames/second (32 kHz).

Section 06

Language Models over Audio Tokens

AudioLM showed that a hierarchical cascade of language models — semantic tokens first, acoustic tokens second — could generate coherent long-form audio without any explicit musical supervision.

AudioLM (Google, 2022)

AudioLM (Borsos et al.) introduced a two-stage token hierarchy that separates semantic content from acoustic detail. The key insight: a self-supervised speech model (HuBERT or w2v-BERT) produces discrete tokens that capture phonetic and musical structure without encoding acoustic surface details like speaker timbre. An acoustic codec (EnCodec) captures the surface. By modeling these separately and in sequence, a cascade of transformers achieves both long-range coherence and high perceptual quality.

The three-stage pipeline is: (1) Semantic modeling — a transformer generates k-means-quantized HuBERT tokens; (2) Coarse acoustic modeling — a second transformer predicts EnCodec RVQ-1 and RVQ-2 tokens conditioned on semantic tokens; (3) Fine acoustic modeling — a third transformer fills in RVQ-3 through RVQ-8 given the coarser tokens. All three are autoregressive LMs trained with next-token prediction.

AudioLM was demonstrated on piano continuations and speech without text supervision — it learned musical structure (chord progressions, phrase endings, repetition) purely from next-token prediction over the audio corpus. Piano continuations were judged by listeners to be in the same style and key as the prompt with no explicit key or harmony labels provided to the model.

SoundStorm (Google, 2023)

SoundStorm replaced the sequential coarse-to-fine acoustic modeling stages with a single parallel, iterative masked generation approach inspired by MaskGIT. Given semantic tokens, it generates all codec codebook levels simultaneously using multiple rounds of masked prediction — each round fills in the highest-confidence tokens, masking the rest for the next round. This reduces acoustic generation from $O(T \cdot N_q)$ sequential steps to $O(T \cdot R)$ where $R \approx 16$ rounds, achieving ~2× real-time generation on a single TPU.

AudioLM's three-stage hierarchy. Semantic tokens (HuBERT) capture musical structure; acoustic tokens (EnCodec RVQ) capture timbre and fine detail. Three cascaded LMs model each stage, conditioned on the previous.

Section 07

Text-Conditioned Music Generation

MusicLM, MusicGen, and Stable Audio established the text-to-music paradigm — mapping natural language descriptions to high-fidelity audio through learned cross-modal alignment.

MusicLM (Google, 2023)

MusicLM (Agostinelli et al.) extended AudioLM to text conditioning by introducing MuLan — a joint music-text embedding model trained contrastively on 44 million music-text pairs (audio clips paired with associated text from the web). MuLan produces a 128-dimensional embedding from either audio or text that lives in a shared space, analogous to CLIP for images. During generation, the MuLan text embedding is prepended as a conditioning token to the AudioLM hierarchy, guiding each stage of LM generation without changing the architecture.

MusicLM could generate coherent 30-second clips from prompts like "a calming jazz piano trio with a steady walking bass" or "80s synth-pop, upbeat, drum machine." It also supported melody conditioning: hum or whistle a melody, encode it with MuLan, and the model generates an arrangement in the described style that tracks the melodic contour. The paper introduced MusicCaps, a benchmark dataset of 5,521 music clips each annotated with detailed natural-language captions, which became the standard evaluation set for text-to-music systems.

MusicGen (Meta, 2023)

MusicGen (Copet et al.) simplified the multi-stage cascade by operating on EnCodec tokens with a single-stage language model using a novel codebook interleaving strategy. Rather than generating all $N_q$ codebook levels sequentially, MusicGen uses delay patterns: the $k$-th codebook's tokens at time $t$ are predicted alongside the first codebook's tokens at time $t + k$, flattened into a single interleaved stream. This eliminates the need for separate coarse and fine models while preserving the quality benefits of multi-codebook quantization.

Text conditioning uses a T5-Large encoder to produce a sequence of text embeddings; these are injected via cross-attention at every transformer layer. MusicGen was released in three sizes (300M, 1.5B, 3.3B parameters) trained on 20,000 hours of licensed music plus ShutterStock and Pond5 audio. The 3.3B model produces 30-second 32 kHz stereo clips and scores a mean opinion score significantly above AudioLDM-2 and Riffusion in blind listening tests on MusicCaps.

MusicGen architecture. A T5 encoder processes the text prompt; its output conditions a decoder-only transformer via cross-attention at every layer. The LM predicts interleaved EnCodec tokens across 4 codebooks using delay patterns, then the EnCodec decoder converts tokens back to audio.

Stable Audio and Flow-Matching Approaches

Stability AI's Stable Audio (Evans et al., 2024) took a different path: rather than an autoregressive LM, it uses a latent diffusion model over EnCodec latents, conditioned on CLAP text embeddings and additional metadata tokens (BPM, key, genre). Training on 800,000 licensed tracks from AudioSparx, it generates up to 95-second stereo audio at 44.1 kHz. Stable Audio 2 upgraded to flow-matching (rectified flow), reducing the number of function evaluations needed for high-quality samples and improving temporal coherence at long durations. Tango 2 (Majumder et al., 2024) further refined text-to-audio quality using direct preference optimization (DPO) on human-rated audio pairs, borrowing the RLHF-style alignment pipeline from LLM post-training.

Section 08

Diffusion-Based Audio Generation

Spectrogram diffusion and latent audio diffusion models offer an alternative to autoregressive token generation — trading sequential sampling for parallel denoising.

Riffusion

Riffusion (2022) demonstrated a remarkably simple idea: fine-tune a Stable Diffusion image model on mel spectrogram images, then convert generated spectrograms back to audio via Griffin-Lim or a neural vocoder. Because mel spectrograms are images in $[0,1]^{H \times W}$, the entire image-diffusion pipeline transfers directly. Riffusion could generate audio from text prompts using Stable Diffusion's CLIP conditioning, and produced surprisingly high-quality short clips. Its main limitation is short duration (5–10 seconds) and phase recovery artifacts.

AudioLDM and AudioLDM 2

Liu et al.'s AudioLDM (2023) trained a latent diffusion model over VAE-compressed mel spectrograms, conditioned on CLAP audio-language embeddings. The CLAP embedding serves as a shared representation space: at training, it takes the mel spectrogram's own CLAP embedding as conditioning (classifier-free guidance enables text-conditioned inference). AudioLDM 2 extended this by replacing CLAP conditioning with GPT-2-generated "AudioMAE" intermediate representations, improving compositional understanding (e.g., "a dog barking, then rain, then piano") and multi-event temporal ordering.

Moûsai

Moûsai (Schneider et al., 2023) applies a two-stage cascaded diffusion model directly on compressed audio: a diffusion prior in a compressed latent space produces a coarse representation, and a second diffusion model upsamples to full resolution. Training on 2,500 hours of music, Moûsai could generate 30-second stereo samples at 48 kHz, demonstrating that waveform diffusion — though slower than spectrogram-based approaches — produces the highest perceptual quality for music due to avoiding spectrogram phase estimation.

Diffusion vs. Autoregressive

Autoregressive LM generation over codec tokens scales well with sequence length and achieves strong coherence due to explicit causal structure, but sampling is sequential and slow. Diffusion models generate all timesteps in parallel (in latent time), can be steered by classifier-free guidance with a single weight $\omega$, and generalize well to arbitrary-length outputs with tiling, but may lack the global structural coherence of LMs. Current best systems (Stable Audio 2, MusicGen) excel at different aspects; no single approach dominates across all metrics.

Section 09

Music Information Retrieval

Before machines could compose music, they had to learn to read it — extracting tempo, chords, keys, and structure from raw audio signal.

Music Information Retrieval (MIR) is the older sibling of music generation: the collection of tasks that extract structured information from audio. MIR results feed directly into generation (conditioning signals, evaluation metrics) and into applications like automatic DJ mixing, music recommendation, and score following.

Beat and Tempo Tracking

Beat tracking estimates the positions of musical beats — the underlying pulse listeners tap their foot to. Classical methods combine onset detection (spectral flux, high-frequency content) with dynamic programming over a tempo grid (the DP-based dynamic time warping beat tracker, DBN-beat tracker). Deep learning approaches — notably BeatNet (Heydari et al., 2021) — use a bi-directional LSTM over onset features followed by a particle filter to track tempo and meter in real time. The Madmom library (Böck et al.) provides the most widely used open-source implementations. Evaluation uses F-measure at 70 ms tolerance and AMLt (allowing for tempo half/double errors).

Chord Recognition

Automatic chord recognition (ACR) maps audio to a sequence of chord labels (e.g., C:maj, G:min7). The standard pipeline extracts chroma features — a 12-dimensional vector representing energy in each pitch class collapsed across octaves — then classifies each frame. Deep architectures (BiLSTM-CRF, CNN-LSTM-CRF) trained on the McGill Billboard dataset achieve weighted chord symbol recall (WCSR) around 85% on the MIREX chord recognition task for major/minor vocabularies. BTC (Bilateral Temporal Convolution, Liu et al., 2019) uses self-attention over chroma to capture long-range harmonic context and sets the current state of the art on the Billboard dataset.

Key Detection

Key detection maps a piece (or segment) to one of 24 major/minor keys. The classical Krumhansl-Schmuckler key-finding algorithm correlates pitch-class profiles against the 24 key templates from music theory. Deep approaches (Kosta et al., 2022) use CNNs over chroma with global average pooling and reach ~80% weighted accuracy on the GiantSteps key dataset, which includes a wide range of electronic music styles. Key detection accuracy degrades significantly for chromatic or atonal music.

Structural Segmentation

Music structural analysis segments a piece into sections (intro, verse, chorus, bridge, outro) and labels repeated structures. The dominant approach uses self-similarity matrices (SSM): computing the pairwise cosine similarity of chroma or timbre features across time reveals repetitive block structure as off-diagonal stripes. Spectral clustering on the SSM (Mcfee & Ellis, 2014) then segments the piece. Deep approaches fine-tune audio transformers (MERT, MusicFM) on structure annotation datasets like RWC and SALAMI, achieving boundary detection F-measures near 0.60 at 3-second tolerance.

Automatic Music Transcription

Automatic music transcription (AMT) converts audio to a score or MIDI representation — the inverse of synthesis. Piano transcription is the most studied task. MT3 (Gardner et al., 2022) frames AMT as a sequence-to-sequence problem with a T5 encoder-decoder: audio spectrograms are encoded, and the decoder autoregressively generates MIDI event tokens. Trained on multi-instrument data, MT3 achieves piano F-measure around 87% on MAPS and is the first system to transcribe orchestral recordings with multiple simultaneous instruments. Omnizart provides open-source multi-instrument AMT with instrument-specific models.

MIR Task	Key Dataset	Best Metric	Reference System
Beat tracking	SMC, HJDB	F1 ~0.90 (pop), ~0.72 (complex)	BeatNet, DBN
Chord recognition	Billboard, RWC	WCSR ~85% (maj/min)	BTC, CNN-LSTM-CRF
Key detection	GiantSteps, MagnaTagATune	WA ~80%	CNN + chroma
Structural segmentation	SALAMI, RWC	F1 ~0.60 @ 3 s	Spectral clustering + SSM
Piano transcription	MAPS, MAESTRO	F1 ~0.87	MT3

Section 10

Evaluation

Evaluating generated music is genuinely hard — the metrics that correlate with human preference are expensive to collect, and automated proxies are imperfect surrogates.

Fréchet Audio Distance (FAD)

The Fréchet Audio Distance is the audio analogue of Fréchet Inception Distance (FID) for images. Audio clips are embedded with VGGish (a CNN trained on AudioSet), and the Fréchet distance between the Gaussian-fitted distributions of real and generated embeddings is computed:

$$\text{FAD} = \|\mu_r - \mu_g\|^2 + \text{tr}\!\left(\Sigma_r + \Sigma_g - 2\left(\Sigma_r\Sigma_g\right)^{1/2}\right)$$

Lower FAD indicates that generated audio has a distribution closer to real music. FAD is fast to compute and correlates moderately with human quality judgments on aggregate, but it is sensitive to the reference set and does not capture text-prompt adherence or long-range structural quality. Recent work by Gui et al. (2024) proposes FADTK with improved embeddings (MERT, CLAP) that correlate better with listening tests.

CLAP Score

For text-conditioned systems, the CLAP score measures semantic alignment: embed both the generated audio and the text prompt with a CLAP model, and compute cosine similarity. Higher CLAP scores indicate the audio content matches the prompt description. It is analogous to CLIP score for text-to-image. The limitation is that CLAP measures the coarse semantic similarity captured by the CLAP training distribution, missing stylistic and genre subtleties not well represented in the training data.

KL Divergence on PaSST Features

Following AudioLM and MusicGen, KL divergence between the label distributions predicted by a pre-trained audio classifier (PaSST or PANN) on real vs. generated audio measures distributional overlap at the category level. Lower KL$_{\text{passt}}$ indicates category-level similarity between generated and real music distributions.

Human Evaluation: MUSHRA and MOS

MUSHRA (MUlti Stimulus test with Hidden Reference and Anchor) is the gold-standard perceptual test for audio quality. Listeners rate multiple stimuli on a 0–100 scale relative to a hidden reference; anchors are included to calibrate ratings. MUSHRA is expensive but captures nuanced quality and artifact judgments. For music generation, MOS (Mean Opinion Score) over two dimensions — overall quality and text relevance — is commonly reported alongside automated metrics. MusicCaps listening tests (the benchmark from MusicLM) use a panel of professional musicians, making them among the most reliable but also most costly evaluations.

Metric	Measures	Range	Direction
FAD (VGGish)	Distributional quality	0 – ∞	↓ lower is better
CLAP Score	Text–audio alignment	−1 to 1	↑ higher is better
KL (PaSST)	Category distribution	0 – ∞	↓ lower is better
MUSHRA	Perceptual quality vs. reference	0 – 100	↑ higher is better
MOS quality	Absolute quality opinion	1 – 5	↑ higher is better
MOS relevance	Prompt adherence opinion	1 – 5	↑ higher is better

Evaluation Gap

No existing automated metric reliably captures long-range musical coherence, stylistic authenticity, or emotional effect — properties that human listeners notice immediately. FAD rewards staying close to the training distribution; CLAP score rewards surface-level keyword matching. Investing in listening studies with musically trained participants remains the only robust evaluation for these higher-level properties.

Section 11

Ethics, Copyright & Creative Practice

Music AI raises questions about whose creative labor trains these systems, who benefits, and what it means to compose in an age when models can imitate any style on demand.

Training Data and Copyright

Most large music generation models are trained on vast quantities of copyrighted recordings without explicit licensing. The legal status of this practice is actively contested. In the United States, several lawsuits (including cases against Suno and Udio filed in 2024 by major record labels) argue that training on copyrighted audio without a license constitutes infringement, regardless of whether individual training samples are reproduced in outputs. The fair use defense — commonly invoked by AI developers — is uncertain in this context, as courts have not yet ruled definitively on whether training-on-copyrighted-data constitutes transformative use.

Europe's AI Act and the text-and-data mining exception under the EU Copyright Directive require that training data be obtained from lawful sources and that rights-holders have a meaningful opt-out mechanism. Commercially released systems (MusicGen, Stable Audio) have moved toward licensed training sets (AudioSparx, Pond5, Shutterstock) specifically to reduce legal exposure, though the coverage and quality of these datasets differ substantially from internet-scraped alternatives.

Style Imitation and Artist Identity

Music AI enables on-demand stylistic imitation: a user can prompt for "a track in the style of [Artist]." While musical style is not copyrightable per se, the combination of voice timbre cloning, style replication, and commercial deployment raises serious concerns about artist identity and economic harm. Several platforms have implemented artist name filters; others argue that stylistic imitation is no different from a human musician learning from their influences. The tension is unresolved. The human professional music community — session musicians, composers, sound designers — is among the most economically exposed to automation in the creative sector.

Synthetic Media and Provenance

Deepfake audio — cloned voice performances attributed to real artists — presents a distinct and more immediate harm than style imitation. Several jurisdictions have passed or are considering legislation specifically targeting non-consensual synthetic voice replication. Technical solutions include audio watermarking (imperceptible signals embedded during generation, detectable by a trained classifier) — Stable Audio and Meta's AudioCraft embed watermarks using AudioSeal — and provenance metadata standards (C2PA, Content Authenticity Initiative) that attach cryptographically signed generation logs to audio files.

Music AI as Creative Tool

The productive framing for creative practitioners is music AI as an instrument rather than a replacement: a tool for rapid prototyping, for exploring harmonic or orchestral ideas, for generating reference tracks, or for creating stems that human producers then shape. Systems like MusicGen's open weights, Stable Audio's API, and MIDI generation tools like Anticipatory Music Transformer are already in active use in professional studios. The field is developing rapidly enough that the creative and economic equilibrium will shift substantially before any regulatory framework can fully adapt.

Key Takeaway

Music AI is technically impressive and moves fast. The community of practice — practitioners, researchers, musicians, and policymakers — is actively negotiating norms around attribution, licensing, and authorship that will shape which creative and economic configurations become stable. Staying technically literate about what these systems can and cannot do is a prerequisite for participating usefully in that negotiation.

Music Generation & Music AI, when models learn to sing.

Chapter Preface

The Music Generation Landscape

Symbolic vs. Audio

Task Taxonomy

Music Representations

Symbolic Representations

Audio Representations

Classical & Neural Symbolic Generation

Statistical Precursors

MuseNet and the GPT Pivot

Music Transformer

The Long-Range Structure Problem

Relative Global Attention

Neural Audio Codecs

The Codec Problem

Residual Vector Quantization (RVQ)

Key Systems

Language Models over Audio Tokens

AudioLM (Google, 2022)

SoundStorm (Google, 2023)

Text-Conditioned Music Generation

MusicLM (Google, 2023)

MusicGen (Meta, 2023)

Stable Audio and Flow-Matching Approaches

Diffusion-Based Audio Generation

Riffusion

AudioLDM and AudioLDM 2

Moûsai

Music Information Retrieval

Beat and Tempo Tracking

Chord Recognition

Key Detection

Structural Segmentation

Automatic Music Transcription

Evaluation

Fréchet Audio Distance (FAD)

CLAP Score

KL Divergence on PaSST Features

Human Evaluation: MUSHRA and MOS

Ethics, Copyright & Creative Practice

Training Data and Copyright

Style Imitation and Artist Identity

Synthetic Media and Provenance

Music AI as Creative Tool

Further Reading

Foundational Papers

Audio Codecs & Audio Language Models

Diffusion & Flow-Matching Audio

Music Information Retrieval

Tools & Libraries