Part VIII · Speech, Audio & Music · Chapter 03

Text-to-Speech & Voice Synthesis, the art of giving machines a voice.

Text-to-speech is the inverse of ASR: given a sequence of characters, produce a waveform a listener cannot distinguish from a human. The problem sounds straightforward but encompasses prosody, pronunciation, style, speaker identity, and the physics of the vocal tract — and the field has been through four architectural revolutions in under a decade, from concatenative and formant synthesis to WaveNet, Tacotron, FastSpeech, VITS, and now large language models generating discrete audio tokens. This chapter traces that arc, explains each approach's trade-offs, and shows where the technology stands today.

How to read this chapter

The chapter proceeds from fundamentals to state of the art. Sections 1–3 set up the problem: what TTS actually has to solve, how the classical pipeline is structured, and what the text front end (normalization, phonemization, prosody prediction) contributes. Sections 4–9 cover the neural acoustic model arc — WaveNet, Tacotron and its descendants, neural vocoders, non-autoregressive FastSpeech, and the end-to-end VITS family. Sections 10–15 cover the surrounding ecosystem: voice cloning, expressive synthesis, multilingual TTS, the VALL-E-style codec language model approach, evaluation methodology, and the deployment picture.

Notation: waveforms are sequences of samples $x_t$; mel-spectrograms are matrices $M \in \mathbb{R}^{T \times F}$ where $T$ is frames and $F$ is mel bins; text inputs are token sequences. The prerequisite is Chapter 01 (Audio Signal Processing) — particularly the mel-spectrogram and vocoder concepts. Familiarity with attention mechanisms (Part V, Chapter 06) and autoregressive models helps in Sections 6 and 13.

What does TTS actually need to do?The problem, the components, the evaluation criteria
Classical synthesis: concatenative and formantUnit selection, formant synthesis, the pre-neural baseline
The text front endNormalization, grapheme-to-phoneme, stress, prosody prediction
WaveNetDilated causal convolutions, autoregressive waveform modelling, 2016
Tacotron and the seq2seq acoustic modelAttention over characters, mel prediction, Tacotron 2
Neural vocodersWaveGlow, MelGAN, HiFi-GAN, BigVGAN — mel-to-waveform
FastSpeech and non-autoregressive TTSDuration prediction, parallel decoding, 50× speedup
VITS and end-to-end TTSVAE + normalizing flow + GAN, jointly trained acoustic model and vocoder
Voice cloning and speaker adaptationSpeaker embeddings, zero-shot cloning, SV2TTS, YourTTS, XTTS
Expressive TTS and prosody controlStyle tokens, GST, pitch and energy conditioning, emotion transfer
Multilingual and cross-lingual TTSShared phoneme sets, language embeddings, low-resource approaches
Neural codec language models: VALL-E and beyondRVQ tokens, in-context speaker cloning, SoundStorm, VoiceCraft
EvaluationMOS, UTMOS, intelligibility, speaker similarity, naturalness vs diversity
Deployment and operational considerationsLatency, streaming, real-time factor, Coqui, OpenTTS, cloud APIs
TTS in the broader ML ecosystemConnections to ASR, speaker recognition, LLMs, multimodal models

Section 01

What does TTS actually need to do?

Converting text to speech sounds like a lookup problem — match phonemes to sounds — but the perceptual reality is far richer. A system that merely gets phonemes right will sound robotic; a system that sounds natural must also handle prosody, rhythm, speaker identity, style, and the acoustic physics of a human vocal tract.

The goal of text-to-speech synthesis (TTS) is to generate a waveform $\hat{x}$ from a text input $t$ such that a listener cannot reliably distinguish $\hat{x}$ from a natural recording of the same content. That simple criterion hides an enormous scope of sub-problems, and it is worth enumerating them before diving into any particular architecture.

The three-layer decomposition

Classical TTS decomposes the problem into three layers. The text front end converts raw text into a linguistic specification: normalize numerals and abbreviations, predict pronunciation, assign word stress, and estimate a prosody target (which words are emphasized, how the pitch should rise and fall, where pauses should land). The acoustic model converts that linguistic specification into an acoustic representation — classically a sequence of mel-spectrogram frames. The vocoder converts the acoustic representation back into a time-domain waveform. Each layer has been revolutionized independently by deep learning.

The classical three-stage TTS pipeline. Modern end-to-end systems collapse some or all of these stages into a single model.

Why naturalness is hard

Human speech is not simply phonemes strung together. The same phoneme sounds different depending on its neighbors (coarticulation), its position in a word, the word's stress, and the speaker's intent. A sentence spoken as a question, a statement, or a command carries different prosody on top of identical words. Emotional state, speaking rate, vocal effort, and individual voice quality further modulate the signal. A TTS system that ignores these factors will be intelligible but obviously synthetic — it passes the transcription test but fails the listening test.

Key idea

The gap between intelligible and natural is the central challenge of TTS. Intelligibility (can a listener transcribe the words?) has been effectively solved since the 1980s. Naturalness (does it sound human?) has only been approached at human parity with neural models since 2016–2018, and even then only for specific speakers and styles.

The evaluation criteria triangle

TTS systems are evaluated along three axes that often trade off against each other. Naturalness measures how human-like the output sounds, typically via Mean Opinion Score (MOS) from human raters. Intelligibility measures the word-error rate of a reference ASR system on the TTS output. Speaker similarity measures how well the generated voice matches a target speaker, scored via a pre-trained speaker verification model. End-to-end systems trained on large datasets tend to maximize naturalness; streaming systems on edge devices trade naturalness for low latency; voice-cloning systems prioritize similarity.

Section 02

Classical synthesis: concatenative and formant

Before neural networks, two paradigms dominated: rule-based formant synthesis, which models the vocal tract with explicit filters, and unit-selection synthesis, which stitches together recordings of a real speaker. Understanding them clarifies what neural models actually improved.

Formant synthesis

The human voice is produced by glottal airflow shaped by the vocal tract — a tube of varying cross-section. Vowels and consonants correspond to different resonant frequencies (formants) of this tube. Formant synthesizers model this with a bank of resonance filters: a cascade or parallel configuration of second-order IIR filters, each tuned to one formant. Given control parameters for the fundamental frequency $F_0$, formant frequencies $F_1, F_2, F_3$, and bandwidths, the synthesizer produces an intelligible signal from first principles. The DECtalk system of the 1980s and the Festival/MBrola family of the 1990s used this approach.

Formant synthesis is fully parametric, extremely fast, and requires no audio data from a real speaker — it can be deployed on a microcontroller. But the output is unmistakably robotic: mapping text to formant trajectories requires an intricate set of hand-crafted rules, and the synthesis model captures only a coarse approximation of real vocal tract dynamics.

Concatenative and unit-selection synthesis

Unit-selection synthesis avoids explicit acoustic modelling by recording a large amount of speech from a single speaker, segmenting it into units (typically diphones — pairs of adjacent phonemes spanning their boundary), and finding the best sequence of recorded units that covers the desired text. At synthesis time, the system searches the unit inventory to minimize a cost function balancing phonetic context similarity (target cost) and acoustic discontinuity at joins (join cost).

Large unit-selection voices (from companies like Nuance, AT&T, SVOX) store tens of hours of speech and can sound remarkably natural within the style of the training recordings. The limitations are equally clear: the voice is fixed to the recording speaker, adding a new speaking style requires re-recording the corpus, and edge cases that fall outside the inventory produce audible glitches.

Statistical parametric synthesis (SPSS), pioneered by the HTS (HMM-based TTS) system, sits between the two extremes: a hidden Markov model generates acoustic parameters frame-by-frame, giving a compact voice model that generalizes better than unit selection but sounds smoother and less natural. SPSS was the state of the art circa 2010–2015 and remains the conceptual precursor to neural acoustic models.

Note

Unit selection remains competitive for certain deployed systems, particularly when the application domain matches the training recordings closely and low latency is paramount. Not every TTS problem requires a neural model.

Section 03

The text front end

Even the most powerful acoustic model fails if it receives bad input. The text front end converts raw orthographic text into a clean linguistic specification — a task that is both harder than it looks and more important than practitioners often admit.

Text normalization

Raw text contains tokens that cannot be read aloud without interpretation: numerals ("1,234" → "one thousand two hundred thirty-four"), currency ("$50" → "fifty dollars"), dates ("03/07" → "March seventh" or "the third of July" depending on locale), abbreviations ("Dr." → "Doctor" in titles, "Drive" in addresses), acronyms ("NASA" → "NASA" as a word vs "N.A.S.A." as initials), and URLs. Text normalization resolves these ambiguities into spoken-form words before phonemization. It is a classification or sequence-to-sequence problem in its own right and remains a source of hard-to-catch errors in production systems — a medical abbreviation misread aloud can be embarrassing or dangerous.

Grapheme-to-phoneme conversion

Once text is normalized, each word must be mapped to a sequence of phonemes — the abstract sound units of the language. In English this is non-trivial: "read" is /riːd/ in present tense and /rɛd/ in past, "bass" is /bæs/ (music) or /beɪs/ (fish), and "Colonel" has no phonetic relationship to its spelling. A grapheme-to-phoneme (G2P) model learns this mapping, traditionally from a pronunciation lexicon (CMU Pronouncing Dictionary for English) supplemented by a sequence model for out-of-vocabulary words. Modern systems use character-level or subword-level sequence-to-sequence transformers for G2P, achieving accuracy above 95% on standard English benchmarks. For agglutinative languages and languages with shallow orthography (Finnish, Spanish, Italian), G2P is easier; for Chinese it requires word segmentation and polyphone disambiguation first.

Prosody prediction

Given a phoneme sequence, a prosody model predicts the target fundamental frequency ($F_0$) contour, phoneme durations, energy envelope, and pause structure. These targets become conditioning inputs or training targets for the acoustic model. Classical approaches used rule-based ToBI (Tone and Break Indices) annotation; modern approaches learn prosody directly from data, either as a separate module or jointly inside the acoustic model. Explicit prosody prediction is most important for expressive TTS — for neutral, single-speaker synthesis on simple text, current end-to-end systems learn reasonable prosody implicitly.

Section 04

WaveNet

In 2016, DeepMind published WaveNet — a deep autoregressive model that generates raw audio one sample at a time. Its output was instantly recognizable as a qualitative leap over everything that came before it.

WaveNet (van den Oord et al., 2016) models the joint distribution of a waveform as a product of per-sample conditionals:

$$ p(x) = \prod_{t=1}^{T} p(x_t \mid x_1, \ldots, x_{t-1}) $$

Each conditional is predicted by a deep neural network conditioned on all previous samples. Rather than an RNN (which would struggle with very long sequences), WaveNet uses dilated causal convolutions — 1D convolutions with a receptive field that grows exponentially with depth by skipping samples at increasing intervals.

Dilated causal convolutions in WaveNet. Each layer doubles the dilation, so the receptive field grows exponentially with depth while parameters grow only linearly.

The causal constraint means each $x_t$ sees only past samples. Dilation lets a shallow network have a very wide effective receptive field — a key requirement for audio, where correlations span thousands of samples. Audio samples are µ-law quantized to 256 categories and predicted as a categorical distribution, so cross-entropy loss applies directly.

For TTS (as opposed to unconditional audio generation), WaveNet is conditioned on a linguistic or acoustic feature sequence via upsampled local conditioning vectors $\mathbf{h}_t$ added to the activations at each layer:

$$ z = \tanh(W_f * x + V_f * \mathbf{h}) \odot \sigma(W_g * x + V_g * \mathbf{h}) $$

This gated activation unit is borrowed from PixelCNN. The resulting model produced audio quality that listeners rated as indistinguishable from human speech on a MOS scale — a first for any TTS system. The catch: autoregressive generation at 24 kHz means generating 24,000 samples per second sequentially, which is parallelizable only during training (teacher forcing), not at inference. Original WaveNet required 90 minutes of CPU time to generate one second of audio.

Key idea

WaveNet proved that raw waveform autoregressive modelling could reach human-level naturalness. It did not provide a practical TTS system — the inference bottleneck took years of follow-up work to resolve (WaveRNN, WaveNet distillation, Parallel WaveNet, Parallel WaveGAN, and eventually GAN-based vocoders).

Section 05

Tacotron and the seq2seq acoustic model

WaveNet required hand-crafted linguistic features as input. Tacotron (Google, 2017) replaced the entire text front end with an end-to-end sequence-to-sequence model that learns to attend over characters and emit mel-spectrogram frames directly.

Tacotron (Wang et al., 2017) frames TTS as a sequence-to-sequence task: the encoder reads a character sequence; the decoder autoregressively emits mel-spectrogram frames; attention aligns encoder outputs to decoder steps. The architecture combines a CBHG encoder (convolution bank + highway network + bidirectional GRU), a location-sensitive attention mechanism, and an RNN decoder that predicts multiple mel frames per step and emits a "stop token" to end generation.

Tacotron 2

Tacotron 2 (Shen et al., 2018) streamlined the architecture considerably: a simpler encoder (embedding + convolutions + bidirectional LSTM), location-sensitive attention, an LSTM decoder predicting 80-bin mel-spectrogram frames, and a separate WaveNet vocoder trained to invert the predicted mel. The result set a new naturalness record of 4.526 MOS vs 4.582 for human speech on a matched single-speaker corpus — the closest any system had come to human parity.

Tacotron 2 architecture: an encoder–attention–decoder stack predicts mel-spectrogram frames, which a separate vocoder converts to a waveform.

Tacotron-family models have two persistent weaknesses. First, because the decoder is autoregressive, generation speed scales linearly with output length — long utterances are slow. Second, attention can fail: on out-of-distribution text or unusual word sequences, the attention head sometimes repeats, skips, or loses alignment, producing doubled phonemes or garbled words. This robustness problem motivated the non-autoregressive approaches in Section 7.

Section 06

Neural vocoders

A vocoder converts an acoustic representation — usually a mel-spectrogram — back into a waveform. The neural vocoder is the component that most determines perceived quality, and it has undergone the most dramatic evolution: from slow autoregressive models to fast GAN-based generators that run in real time.

Autoregressive vocoders: WaveNet and WaveRNN

Original WaveNet vocoders produce high-quality audio but are slow. WaveRNN (Kalchbrenner et al., 2018) replaces dilated convolutions with a single-layer RNN and achieves near-real-time generation on GPU by predicting two bytes of audio simultaneously (the "dual softmax" trick). Parallel WaveNet and Parallel WaveGAN use knowledge distillation from a teacher WaveNet into a flow-based student that generates all samples in parallel, recovering speed at a small quality cost.

Flow-based vocoders: WaveGlow

WaveGlow (Prenger et al., 2019) is a normalizing flow trained to transform a Gaussian noise vector into an audio waveform conditioned on a mel-spectrogram. The flow is invertible, so training maximizes the exact log-likelihood. Generation is fully parallel: sample noise, run the inverse flow, get audio. WaveGlow produces quality competitive with WaveNet at a fraction of the latency.

GAN vocoders: MelGAN, HiFi-GAN, BigVGAN

GAN-based vocoders proved to be the best practical trade-off. MelGAN (Kumar et al., 2019) uses a generator with residual dilated convolutions and multiple discriminators operating at different temporal scales, achieving real-time generation. HiFi-GAN (Kong et al., 2020) improved further: its generator uses multi-receptive-field fusion (MRF) blocks combining dilated convolutions with different dilation rates, and its discriminator adds a multi-period discriminator (MPD) that assesses waveform patterns at different periodicities. HiFi-GAN became the de facto standard vocoder for most TTS research after 2020.

BigVGAN (Lee et al., 2022) scales HiFi-GAN — larger model, anti-aliased activations (snake functions instead of LeakyReLU), and training on diverse speech data — to achieve near-human MOS scores even on unseen speakers and recording conditions. It is the current reference vocoder for high-quality offline TTS.

Intuition

A GAN vocoder's discriminator is asking: "does this waveform look like real audio at multiple timescales and periodicities?" The MPD in HiFi-GAN captures pitch-periodicity structure (period 2, 3, 5, 7, 11) while the multi-scale discriminator captures timbral texture at different resolutions. Together they enforce that both fine phonetic detail and long-range prosodic structure are realistic.

Diffusion vocoders

DiffWave and PriorGrad apply score-matching diffusion to waveform generation, conditioned on a mel-spectrogram. Diffusion vocoders match or exceed GAN vocoder quality in controlled evaluations but are slower at inference (requiring many denoising steps), making them a quality ceiling benchmark rather than a practical deployment choice.

Section 07

FastSpeech and non-autoregressive TTS

Tacotron is slow and occasionally misaligns. FastSpeech (2019) eliminates both problems by predicting phoneme durations explicitly and running the acoustic decoder entirely in parallel.

FastSpeech (Ren et al., 2019) replaces the attention-based autoregressive decoder with a two-stage transformer. The encoder produces one hidden vector per input phoneme. A length regulator replicates each phoneme vector by its predicted duration, expanding the sequence to match the mel-spectrogram length. The decoder then maps the expanded sequence to mel-spectrogram frames in a single parallel forward pass.

$$ \text{Mel} = \text{Decoder}\left(\text{LR}\left(\text{Encoder}(\text{phonemes}),\ d\right)\right) $$

Duration targets for training are extracted from a pre-trained Tacotron model by reading off attention alignment. At inference, the duration predictor controls speech rate: multiplying predicted durations by a scalar slows down or speeds up the voice continuously, with no quality degradation — a feature that was impossible to achieve cleanly in attention-based models.

FastSpeech 2 and related work

FastSpeech 2 (Ren et al., 2021) adds predictors for pitch and energy alongside duration, conditioning the decoder on all three. Pitch is extracted with a high-resolution pitch tracker (continuous wavelet transform); energy is RMS per frame. These explicit conditioning variables make prosody more controllable and improve naturalness. The same architecture with minor modifications was adopted by dozens of follow-on systems (LightSpeech, PortaSpeech, SpeedySpeech, NaturalSpeech).

FastSpeech 2: encoder → variance adaptor (duration, pitch, energy predictors) → parallel decoder → vocoder. No autoregressive loop; all mel frames generated simultaneously.

Non-autoregressive TTS trades some naturalness for speed and robustness. The parallel decoder is 50–200× faster than Tacotron 2 at inference. Attention failures are impossible because alignment is handled by the explicit duration predictor. The residual quality gap relative to autoregressive models narrowed substantially with FastSpeech 2 and essentially vanished with VITS (Section 8) and codec LM approaches (Section 12).

Section 08

VITS and end-to-end TTS

FastSpeech and Tacotron both treat acoustic modelling and vocoding as separate stages. VITS (2021) trains a VAE encoder, a normalizing flow, and a GAN vocoder end-to-end in a single model, eliminating the two-stage pipeline and its error-propagation problem.

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech, Kim et al., 2021) combines three components into one jointly trained system. A posterior encoder takes a waveform and encodes it to a latent variable $z$ via a VAE. A prior conditioned on text and duration predicts the distribution of $z$ given the text, using a normalizing flow for expressiveness. A GAN decoder (HiFi-GAN generator) decodes $z$ directly to a waveform. The training objective is:

$$ \mathcal{L} = \mathcal{L}_{\text{recon}} + \mathcal{L}_{\text{KL}} + \mathcal{L}_{\text{dur}} + \mathcal{L}_{\text{adv}} $$

where the reconstruction loss compares mel-spectrograms, the KL loss aligns posterior and prior, the duration loss supervises the length regulator, and the adversarial loss pushes the generated waveform toward the real distribution. At inference only the prior and decoder are used: sample $z$ from the text-conditioned prior via the flow, decode to a waveform.

VITS surpassed Tacotron 2 + HiFi-GAN on MOS while eliminating the separate vocoder. Because the entire pipeline is differentiable end-to-end, errors in mel prediction cannot cascade into vocoder artifacts — the decoder sees the true latent representation. VITS also supports variable-rate synthesis via duration predictor scaling.

VITS 2 and descendants

VITS 2 (Kong et al., 2023) improves duration modelling with a transformer-based duration predictor and normalizes the architecture for multi-speaker training. NaturalSpeech 2 (Microsoft, 2023) replaces the GAN decoder with a latent diffusion model and trains on large diverse corpora, achieving MOS above human recordings on held-out speakers. The VITS line of models — along with its direct derivatives YourTTS, Coqui TTS XTTS, and MMS-TTS — represents the current practical state of the art for non-codec TTS systems.

Section 09

Voice cloning and speaker adaptation

A TTS system trained on one speaker produces one voice. Voice cloning extends a model to reproduce an arbitrary speaker's voice from a short audio sample, ranging from a few seconds to a few hours of reference audio.

Speaker embeddings

The simplest multi-speaker TTS approach adds a speaker embedding — a learned vector per speaker — as a conditioning input to the acoustic model. Speaker embeddings are concatenated to encoder outputs or added to decoder inputs. Training on $N$ speakers produces $N$ embeddings; at inference the desired embedding is selected. This approach (used in Tacotron 2 with speaker embeddings, FastSpeech 2 multi-speaker) works well when the target speaker is in the training set.

Speaker encoder and zero-shot cloning

Zero-shot voice cloning synthesizes a new speaker's voice from a short enrollment recording without any model retraining. The SV2TTS pipeline (Jia et al., 2018) adds a pre-trained speaker encoder (trained on a speaker verification task) that maps a reference audio clip to a fixed-dimension embedding. This embedding replaces the learned speaker embedding, conditioning both the Tacotron acoustic model and the WaveNet vocoder. At inference, providing a 5-second recording of any speaker produces an imitation of that voice. Quality correlates strongly with the representativeness of the reference — a 30-second sample is substantially better than a 5-second one.

Later systems (Meta Voicebox, Microsoft VALL-E, ElevenLabs, Coqui XTTS) improved zero-shot cloning dramatically. YourTTS adapts the VITS architecture for zero-shot cloning via a speaker encoder trained with angular prototypical loss. XTTS (Coqui, 2023) extends YourTTS with decoder cross-attention over the reference audio, achieving competitive naturalness and similarity with as few as 3 seconds of reference audio.

Warning

Voice cloning technology has serious misuse potential. Synthesizing someone's voice without consent for disinformation, fraud, or harassment is actively possible with open-source tools. Responsible deployment requires speaker consent verification, audio watermarking, and detection classifiers. Several jurisdictions are enacting laws specifically addressing voice-clone fraud.

Section 10

Expressive TTS and prosody control

Neutral, monotone TTS is no longer technically impressive. The frontier is expressive synthesis: systems that can speak with emotion, vary their style on demand, and transfer speaking style from a reference recording.

Human speech varies enormously in style, pace, loudness, and affect even when the words are identical. A news anchor reads with different prosody than a bedtime storyteller; an excited announcement sounds different from a hesitant apology. Capturing this variation requires conditioning the acoustic model on something beyond the text.

Global style tokens

Global Style Tokens (GST, Wang et al., 2018) add a bank of learned style embedding vectors to Tacotron. A reference audio encoder (style encoder) computes attention over the token bank, producing a style embedding that is added to the encoder output. At training time the reference is the target audio (so the model learns to ignore style in the text and encode it separately). At inference, style can be set by providing a reference recording or by manually setting the token weights. GST enables controllable speaking rate, stress, pitch height, and roughness without explicit labelling.

Pitch and energy conditioning

FastSpeech 2's explicit pitch and energy predictors enable a more direct route to expressiveness: provide explicit $F_0$ and energy contours as conditioning. Systems like ControlVAE and Prosody-TTS disentangle style into interpretable latent dimensions (speaking rate, pitch range, loudness), allowing fine-grained control via sliders or text prompts. Recent work (Suno, ElevenLabs, Hume AI) conditions expressive TTS on emotion labels or free-text descriptions ("speak with warmth and slight hesitation"), using a language model to map the description to a style embedding.

Section 11

Multilingual and cross-lingual TTS

Training a separate TTS model per language is expensive and produces poor quality for low-resource languages. Multilingual TTS shares a single model across languages, enabling cross-lingual voice transfer and quality on par with high-resource languages even for languages with limited data.

Multilingual TTS adds a language embedding alongside the speaker embedding. The phoneme inventory must be unified across languages: one approach uses International Phonetic Alphabet (IPA) symbols as a shared phoneme set; another uses byte-level or character-level inputs that require no language-specific preprocessing. The acoustic model learns language-specific phoneme realizations through the language conditioning while sharing all other parameters.

Cross-lingual voice transfer uses a multilingual model to synthesize speech in language B with a speaker's voice from language A. This is possible because speaker identity (encoded in the speaker embedding) and language (encoded in the language embedding or phoneme sequence) are disentangled in the model. The quality depends on phonetic overlap: transferring a native English voice to French is easier than to Mandarin, because French shares more phonemes with English.

Meta's MMS-TTS (Massively Multilingual Speech, 2023) trains a single VITS model on 1100+ languages using the New Testament recording corpus plus public-domain data, producing intelligible synthesis in languages that had no prior TTS system at all. At very low-resource regimes (minutes of training data), techniques from few-shot learning — model-agnostic meta-learning and adapter modules — allow rapid adaptation from a pre-trained multilingual backbone.

Section 12

Neural codec language models: VALL-E and beyond

The most recent architectural shift treats TTS as a language modelling problem over discrete audio tokens rather than a regression over spectrograms. VALL-E (2023) demonstrated that scaling this approach to 60,000 hours of data produces voice cloning from a 3-second prompt with no fine-tuning.

Residual vector quantization and audio tokens

Neural audio codecs (EnCodec, SoundStream, DAC — covered in Chapter 01) compress a waveform to a sequence of discrete tokens using residual vector quantization (RVQ). A waveform at 24 kHz becomes roughly 75–150 discrete code sequences per second, where each position has 8 codebook levels. The first codebook captures coarse structure (speaker identity, prosody); later codebooks refine fine acoustic detail.

VALL-E

VALL-E (Wang et al., 2023) treats TTS as a conditional codec language modelling task. Given a 3-second speech prompt $\mathbf{a}_{1:T}$ and a text transcript $c$, the model predicts the codec tokens $\hat{\mathbf{a}}_{T+1:T+L}$ that would continue the prompt's speaker voice while saying the desired text. The first RVQ codebook is predicted autoregressively with a decoder-only transformer (standard language model); the remaining 7 codebooks are predicted in parallel with a non-autoregressive model conditioned on both the text and the first codebook. Trained on 60,000 hours of diverse speech, VALL-E achieves speaker similarity scores that exceed a zero-shot baseline by a large margin.

Key idea

VALL-E treats the problem the same way GPT treats text: model the distribution over tokens, condition on a prompt, and sample. The prompt (3 seconds of the target voice) acts like an in-context example, and the model generalizes speaker identity to new text without any gradient update. This is in-context learning for audio.

VALL-E 2, VoiceCraft, SoundStorm, and others

VALL-E 2 (Chen et al., 2024) adds grouped codec language modelling and repetition-aware sampling to address the stability problems of VALL-E (repeated words, skipped phrases), claiming human parity on LibriSpeech. VoiceCraft (Peng et al., 2024) adapts a codec LM for speech editing as well as TTS, enabling word-level edits of existing recordings. SoundStorm (Google DeepMind, 2023) uses a masked language model (à la MaskGIT) to fill all RVQ codebooks in parallel, achieving real-time generation with quality comparable to VALL-E. Voicebox (Meta, 2023) applies flow-matching to audio token generation, unifying TTS, denoising, style transfer, and cross-lingual dubbing in one model.

The codec LM paradigm has two practical challenges compared to VITS-style systems. First, autoregressive token prediction is slow for long utterances; second, the discrete bottleneck can lose fine spectral detail unless the codec has enough codebooks. Both are active areas of research, and hybrid systems combining codec LMs with diffusion vocoders are emerging.

Section 13

Evaluation

TTS evaluation is harder than ASR evaluation. There is no single ground-truth output; quality is perceptual, multidimensional, and partially subjective. Understanding the standard metrics — and their limitations — is essential for comparing systems honestly.

Mean Opinion Score (MOS)

The canonical TTS metric is Mean Opinion Score: human raters listen to samples and rate naturalness on a 1–5 scale. The mean across raters gives the MOS. A score of 4.5 is generally considered human-level naturalness; Tacotron 2 scored 4.53, WaveNet 4.52, human recordings 4.58 in the original evaluation. MOS is expensive (requires large crowd-sourcing panels) and suffers from listener inconsistency, ceiling effects near human parity, and systematic biases from context (longer evaluations make raters harsher).

Automatic MOS prediction: UTMOS

UTMOS (Saeki et al., 2022) and related models train a neural network to predict human MOS from audio, using listener-preference data from previous evaluations as training signal. UTMOS correlates strongly with crowd-sourced MOS (Spearman $\rho \approx 0.95$ on the UTMOS challenge data) and can be run as a free, fast, reproducible proxy. It has become the standard automatic quality metric in TTS papers.

Intelligibility and speaker similarity

Word Error Rate (WER) measures intelligibility: pass TTS output through a reference ASR system (usually Whisper large-v2) and measure the WER on a standard test set. Systems with WER above 5% on simple utterances are considered to have intelligibility problems. For voice cloning, speaker similarity is measured by extracting speaker embeddings (ECAPA-TDNN or WavLM-SV) from both the synthesized and reference audio and computing cosine similarity.

Prosody and naturalness decomposition

Beyond global MOS, researchers decompose naturalness into prosody quality ($F_0$ naturalness, duration naturalness), spectral quality (voice timbre, breathiness), and co-articulation quality (smoothness of phoneme transitions). Objective metrics for each exist but correlate imperfectly with perception. The gold standard remains a carefully designed listening test with multiple evaluation conditions balanced within subjects — time-consuming but necessary for fine-grained comparisons.

Section 14

Deployment and operational considerations

A TTS system that scores 4.5 MOS in an offline evaluation is not automatically deployable. Latency, streaming, memory footprint, and cost all impose constraints that reshape the architecture choices.

Latency and real-time factor

The real-time factor (RTF) is synthesis time divided by audio duration. RTF < 1 is necessary for interactive applications; RTF < 0.1 is desirable for conversational systems that must respond in under 200 ms. Non-autoregressive systems (FastSpeech 2 + HiFi-GAN) typically achieve RTF of 0.01–0.05 on a modern GPU. Autoregressive systems (Tacotron 2 + WaveNet) can be above RTF 1 without specialized hardware. Codec LM systems (VALL-E) depend on output length and token rate: at 75 tokens/s and beam search, RTFs of 0.5–2 are typical for long utterances, motivating non-autoregressive decoders like SoundStorm.

Streaming TTS

For voice assistants and interactive applications, synthesis must begin before the full text is known (text may still be streaming from an LLM) and audio must begin playing before the full waveform is generated. Streaming TTS requires the acoustic model to operate on partial sentences and the vocoder to produce audio in real-time chunks. HiFi-GAN is well-suited to streaming because it is locally conditioned — it can generate audio for a window of mel frames without seeing the full spectrogram. Autoregressive acoustic models (Tacotron) are poorly suited; non-autoregressive models (FastSpeech) are naturally streaming-compatible.

Open-source and commercial systems

Coqui TTS (now coqui.ai, models available at Hugging Face) provides VITS, XTTS, and several multi-speaker models under a permissive license — the dominant open-source TTS toolkit as of 2024. PaddleSpeech (Baidu) and ESPnet-TTS cover a wider model zoo. Commercial APIs from ElevenLabs, Microsoft Azure Cognitive Services (Neural TTS), Google Cloud TTS, and Amazon Polly provide production-grade voice synthesis with per-character pricing and voice library access. OpenAI's TTS API (based on a Whisper-adjacent architecture) supports 6 voices and streams audio with low latency.

Section 15

TTS in the broader ML ecosystem

TTS does not exist in isolation. It is the output layer of voice assistants, the inverse model of ASR, a component of audio foundation models, and an increasingly central tool for training data augmentation and accessibility infrastructure.

The most direct connection is to automatic speech recognition (Chapter 02). TTS is the inverse function: ASR maps waveforms to text, TTS maps text to waveforms. This symmetry is useful practically — TTS can generate unlimited synthetic training data for ASR (a technique called TTS-augmentation), and ASR-based WER is a standard TTS evaluation metric. The two pipelines share front-end components: mel-spectrograms computed for TTS training use the same parameters as those in ASR pipelines.

The connection to speaker recognition (Chapter 04) runs through speaker embeddings. The same speaker encoder trained for verification — ECAPA-TDNN, WavLM-SV — is used to condition multi-speaker TTS and to evaluate voice cloning similarity. A voice cloning attack is, from one angle, a speaker verification attack.

The connection to large language models is deepening rapidly. Codec LM TTS (VALL-E, Voicebox) is architecturally identical to text LMs operating on a different vocabulary. GPT-4o and Gemini 2.0 integrate TTS natively into the generation pipeline, producing spoken responses without a separate TTS API call. The emerging pattern is a unified speech-language model where text tokens and audio tokens co-reside in a single autoregressive decoder — a full circle back to the language modelling framing, now applied to both modalities simultaneously.

For practitioners, TTS is often a component rather than a product: a voice assistant needs TTS for responses, an audiobook pipeline needs TTS for narration, an accessibility tool needs TTS for screen reading. In each case the requirements differ enough that no single architecture is universally best — which makes understanding the trade-offs across the WaveNet, Tacotron, FastSpeech, VITS, and codec LM families the practical payoff of this chapter.

How to read this chapter

Contents

What does TTS actually need to do?

The three-layer decomposition

Why naturalness is hard

The evaluation criteria triangle

Classical synthesis: concatenative and formant

Formant synthesis

Concatenative and unit-selection synthesis

The text front end

Text normalization

Grapheme-to-phoneme conversion

Prosody prediction

WaveNet

Tacotron and the seq2seq acoustic model

Tacotron 2

Neural vocoders

Autoregressive vocoders: WaveNet and WaveRNN

Flow-based vocoders: WaveGlow

GAN vocoders: MelGAN, HiFi-GAN, BigVGAN

Diffusion vocoders

FastSpeech and non-autoregressive TTS

FastSpeech 2 and related work

VITS and end-to-end TTS

VITS 2 and descendants

Voice cloning and speaker adaptation

Speaker embeddings

Speaker encoder and zero-shot cloning

Expressive TTS and prosody control

Global style tokens

Pitch and energy conditioning

Multilingual and cross-lingual TTS

Neural codec language models: VALL-E and beyond

Residual vector quantization and audio tokens

VALL-E

VALL-E 2, VoiceCraft, SoundStorm, and others

Evaluation

Mean Opinion Score (MOS)

Automatic MOS prediction: UTMOS

Intelligibility and speaker similarity

Prosody and naturalness decomposition

Deployment and operational considerations

Latency and real-time factor

Streaming TTS

Open-source and commercial systems

TTS in the broader ML ecosystem

Further reading

Foundational papers

Surveys and tutorials

Going deeper