Part I · Mathematical Foundations · Chapter 08

Signals, the language of data in motion.

Modern machine learning inherited an enormous amount of its vocabulary — convolution, filter, frequency, bandwidth, Fourier, aliasing — from an older engineering discipline built around one central question: how do you extract structured information from functions of time or space? This chapter develops signal processing from first principles: signals, linear time-invariant systems, convolution, the Fourier transform and its fast implementation, sampling and aliasing, filter design, the Z-transform, time-frequency analysis, wavelets, and random signals. Every tool here reappears verbatim in modern audio, image, and sequence models.

How to read this chapter

Sections build on one another, so read in order the first time through. Prose carries the exposition; equations make it precise; diagrams keep it visual. The chapter draws on linear algebra (Chapter 01), calculus (Chapter 02), probability (Chapter 04), and information theory (Chapter 06) — a first pass will make sense without them, but the connections click faster if those earlier chapters are fresh.

Notation: continuous-time signals are written $x(t)$ with $t \in \mathbb{R}$; discrete-time signals are written $x[n]$ with $n \in \mathbb{Z}$. Frequencies in Hertz use $f$; angular frequencies use $\omega = 2\pi f$. The Fourier transform is denoted $X(f)$ for continuous and $X(e^{j\omega})$ for the DTFT; the $N$-point DFT uses $X[k]$. The imaginary unit is $j = \sqrt{-1}$ (the engineering convention used by every standard signal-processing library). Convolution is $*$ and multiplication remains juxtaposition. Code names appear in monospace; key terms appear in a slightly different color when first defined.

Why signal processing?Motivation
Signals, continuous and discreteThe raw material
Linear time-invariant systemsThe one structural assumption
ConvolutionLTI in one equation
Complex exponentials and frequencyWhy Fourier works
The Fourier seriesPeriodic signals
The continuous Fourier transformSpectrum of any signal
The DFT and the FFTThe algorithm that changed everything
Sampling and aliasingNyquist–Shannon
FilteringFIR, IIR, filter design
The Z-transform and stabilityPoles and zeros
Time-frequency: STFT, spectrograms, uncertaintyNon-stationary signals
Wavelets and multiresolutionAdaptive windows
Random signals and power spectraWiener–Khinchin
Where it shows up in MLPayoff

Section 01

Why signal processing?

Modern machine learning inherited an enormous amount of its vocabulary — convolution, filter, frequency, bandwidth, Fourier, aliasing — from an older engineering tradition. Knowing what those words meant in their original setting explains a great deal about why the modern tools work.

A signal is a function that carries information: a voltage over time, an audio waveform, a 2D image, a video, an EEG trace, a vibration sensor reading on a rotating shaft. Signal processing is the mathematics of extracting, transforming, and reasoning about such functions. It gives us a small number of remarkably powerful ideas — linearity, time-invariance, frequency decomposition, sampling, convolution — and a lot of machinery for deploying them.

Those ideas are inescapable in modern machine learning. Convolutional neural networks are literally named after convolution; the attention mechanism in transformers is a learnable time-frequency filter; audio pipelines from ASR to music generation start with spectrograms; diffusion models add and remove Gaussian noise in frequency space; state-space models like S4 and Mamba are signal-processing constructs with trainable parameters. Even the way we think about resolution, bandwidth, and the trade-off between time and frequency localisation came straight out of this field.

Key idea

Signal processing is the mathematics of functions as data. Where probability treats random variables and linear algebra treats vectors, signal processing treats functions of time (or space) as the first-class object — and develops a decomposition (Fourier), a composition (convolution), and a sampling theory that map almost perfectly onto modern deep learning.

The plan for this chapter: define what signals are, introduce the two key structural assumptions (linearity and time-invariance), build up convolution and Fourier analysis as the two sides of the same coin, sort out sampling and aliasing, cover filter design in enough detail to read a spectrogram, and end with the modern machine-learning applications that reuse this machinery verbatim.

Section 02

Signals, continuous and discrete.

Signals live in two worlds — the continuous world of physics and the discrete world of digital sampling — and almost every result in this chapter has a version in each.

A continuous-time signal is a function $x(t)$ where $t \in \mathbb{R}$. The sound pressure at your eardrum is a continuous signal. Voltage in a circuit is continuous. A photograph, as a pattern of light intensity, is a continuous 2D signal $x(u, v)$.

A discrete-time signal is a sequence $x[n]$ where $n \in \mathbb{Z}$. Your computer's audio file, EEG traces, stock closing prices, sensor readings — all discrete. In machine learning, the inputs to a model are almost always discrete signals, produced by sampling a continuous one.

Elementary signals

A handful of building blocks appear everywhere. The unit impulse $\delta[n]$ (discrete) is $1$ at $n = 0$ and $0$ everywhere else; its continuous counterpart is the Dirac delta $\delta(t)$. The unit step $u[n]$ or $u(t)$ is $0$ for $n < 0$ and $1$ for $n \geq 0$. The complex exponential $e^{j\omega t}$ — purely imaginary exponent — encodes a pure frequency and is the single most important signal in the chapter.

$$ e^{j\omega t} \;=\; \cos(\omega t) + j \sin(\omega t). $$

The identity is Euler's formula; the reason it matters is that complex exponentials are the eigenfunctions of linear time-invariant systems. Filtering a complex exponential produces a scaled complex exponential of the same frequency — the scalar is the filter's frequency response. Everything else in Fourier analysis is a consequence of that one fact.

Energy, power, and signal classes

Real signals fall into two categories. Finite-energy signals satisfy $\int |x(t)|^2 dt < \infty$; a short audio clip or a finite-support wavelet is in this class. Finite-power signals have finite average power but infinite energy; a steady sinusoid is the canonical example. The distinction matters because Fourier transforms behave differently on the two classes — transforms in the usual sense exist for finite-energy signals; the sinusoid's "transform" is a Dirac delta that only makes sense in a distributional setting.

A notational warning

Signal processing uses $j = \sqrt{-1}$ (the electrical engineering convention), not $i$ (the mathematician's convention). Both appear in ML papers, often on the same page. We follow the engineering convention throughout this chapter because it is the one the downstream libraries — NumPy, SciPy, PyTorch, librosa — all use in their documentation.

Section 03

Linear time-invariant systems.

Almost every tool in this chapter — convolution, the Fourier transform, transfer functions — exists because one particular class of systems is easy to analyse. That class is the LTI class, and understanding why it is special is the single most important idea in signal processing.

A system is an operator $T$ that maps an input signal $x$ to an output signal $y = T\{x\}$. The system is linear if

$$ T\{\alpha x_1 + \beta x_2\} \;=\; \alpha\, T\{x_1\} + \beta\, T\{x_2\}, $$

for any scalars $\alpha, \beta$ and inputs $x_1, x_2$. It is time-invariant if shifting the input shifts the output by the same amount: $T\{x(t - t_0)\} = y(t - t_0)$. A system that is both is a linear time-invariant (LTI) system.

The impulse response characterises everything

Here is the pivotal observation. Any discrete signal can be written as a sum of shifted, scaled impulses:

$$ x[n] \;=\; \sum_{k = -\infty}^{\infty} x[k]\, \delta[n - k]. $$

By linearity, the output of an LTI system applied to $x$ is the sum of the outputs applied to each impulse. By time-invariance, those outputs are shifted copies of a single response $h[n] = T\{\delta[n]\}$. Putting the two together:

$$ y[n] \;=\; \sum_{k = -\infty}^{\infty} x[k]\, h[n - k] \;=\; (x * h)[n]. $$

An LTI system is completely characterised by its response to a single impulse — the impulse response $h$. Knowing $h$, you can compute the output for any input. That is an enormous reduction in the amount of information you need.

Why LTI rules the world

Not every system of interest is LTI. But many real systems are approximately LTI over regions of interest, and the LTI analysis is dramatically simpler than anything else on offer. The Fourier transform exists because LTI systems are diagonal in the frequency domain — in a basis we will build up over the next five sections.

The continuous version is identical in spirit: the impulse response is $h(t) = T\{\delta(t)\}$, the output is the convolution integral $y(t) = \int x(\tau) h(t - \tau) d\tau$, and the system is again completely characterised by $h$.

Section 04

Convolution.

The operator hidden inside every LTI system, every CNN layer, and every probability-density sum. Once you read the definition geometrically, it becomes the most visual idea in the chapter.

The convolution of two signals is

$$ (x * h)(t) \;=\; \int_{-\infty}^{\infty} x(\tau)\, h(t - \tau)\, d\tau, $$ $$ (x * h)[n] \;=\; \sum_{k = -\infty}^{\infty} x[k]\, h[n - k]. $$

Geometrically: flip $h$ in time (that is the $h(t - \tau)$ term, read as a function of $\tau$), slide it across $x$, and at each shift compute the inner product of $x$ and the flipped-shifted $h$. The result at shift $t$ is the output at time $t$.

Core properties

Convolution is commutative — $x * h = h * x$ — which means there is no privileged "signal" and "filter"; either can play either role. It is associative, so cascading two LTI filters is itself an LTI filter with impulse response $h_1 * h_2$. It is distributive over addition. Its identity is the Dirac delta: $x * \delta = x$. And — this is the important one — it corresponds to multiplication in the Fourier domain.

$$ \mathcal{F}\{x * h\} \;=\; X(f)\, H(f). $$

This is the convolution theorem, and it is the reason the Fourier transform matters. Convolutions cost $O(N^2)$ naively and $O(N \log N)$ via FFT; in deep learning at large kernel sizes, the FFT implementation of a convolution is routinely faster than the direct one.

Convolution in CNNs and in probability

In convolutional neural networks, the "convolution" is usually cross-correlation (convolution without the flip) but the name stuck. The structural point is the same: each layer is an LTI operation applied locally across the input. The learned kernels $h$ play the role of impulse responses — they are the filters the network has discovered are useful.

In probability, the density of a sum of two independent random variables is the convolution of their densities: $p_{X + Y} = p_X * p_Y$. This is not a coincidence. Convolution is what adding becomes when the objects you are adding are distributions (or, more generally, impulse-response characterisations of linear dynamics).

Section 05

Complex exponentials and frequency.

The miracle that makes Fourier analysis possible: complex exponentials pass through LTI systems unchanged in shape, only scaled in amplitude and shifted in phase.

Apply an LTI system with impulse response $h$ to the signal $x(t) = e^{j\omega t}$. The output is

$$ y(t) \;=\; \int h(\tau) e^{j\omega(t - \tau)} d\tau \;=\; e^{j\omega t} \int h(\tau) e^{-j\omega \tau} d\tau \;=\; H(\omega)\, e^{j\omega t}. $$

The output is the same complex exponential, scaled by a complex number $H(\omega) = \int h(\tau) e^{-j\omega \tau} d\tau$. The scalar depends on frequency; it is the frequency response of the filter. The complex exponential is an eigenfunction of the LTI operator, and $H(\omega)$ is the corresponding eigenvalue.

This is the signal-processing version of a fact from linear algebra: a linear operator on a vector space acts particularly simply when written in a basis of its eigenvectors. For an LTI system on signals, the eigenvectors are complex exponentials, and Fourier analysis is the change-of-basis into that diagonalising basis.

Magnitude and phase

Write $H(\omega) = |H(\omega)| e^{j\angle H(\omega)}$. The magnitude response $|H(\omega)|$ tells you how much each frequency is amplified or attenuated. The phase response $\angle H(\omega)$ tells you how much each frequency is delayed. A pure delay of $\tau$ seconds has $H(\omega) = e^{-j\omega\tau}$ — magnitude $1$ at every frequency, phase linear in frequency. This is why linear-phase filters are coveted: they delay everything by the same amount, preserving waveform shape.

Why sinusoids are the "right" basis

You could, in principle, decompose signals into any basis. Polynomial bases, wavelet bases, learned bases — all have their uses. What is special about the sinusoidal basis is that LTI operators are diagonal in it. Multiplication of signals in time becomes convolution in frequency; convolution in time becomes multiplication in frequency; differentiation becomes multiplication by $j\omega$. Every structural property of signals becomes algebraic in the right basis, and that basis is the complex exponentials.

Section 06

The Fourier series.

Periodic signals admit an exact decomposition into a sum of sinusoids at integer-multiple frequencies. The result is a finite-energy, discrete-spectrum picture that bridges the finite-dimensional case and the infinite one.

A continuous-time signal $x(t)$ with period $T$ admits the Fourier series:

$$ x(t) \;=\; \sum_{k = -\infty}^{\infty} c_k\, e^{j 2\pi k t / T}, \qquad c_k \;=\; \frac{1}{T} \int_0^T x(t)\, e^{-j 2\pi k t / T}\, dt. $$

The coefficients $c_k$ are the Fourier coefficients. They form a discrete spectrum: the signal has nonzero content only at frequencies $k / T$ for integer $k$. For real signals, $c_{-k} = \overline{c_k}$, so the coefficients at negative frequencies are redundant — a one-sided spectrum suffices.

Convergence and Parseval

The series converges in mean-square whenever $x$ is square-integrable over one period. Pointwise convergence is more delicate (the Gibbs phenomenon near discontinuities is a famous example) but mean-square is usually what you care about in practice.

Parseval's identity equates energy in the time and frequency domains:

$$ \frac{1}{T}\int_0^T |x(t)|^2\, dt \;=\; \sum_{k = -\infty}^{\infty} |c_k|^2. $$

This is the signal-processing analogue of the Pythagorean theorem in Hilbert space, and it is the reason you can reason about signal energy by staring at either the waveform or its spectrum — whichever is more convenient.

Section 07

The continuous Fourier transform.

Extend the Fourier series from periodic signals to arbitrary finite-energy signals, and the sum becomes an integral. The spectrum goes from discrete to continuous, and the single most important decomposition in applied mathematics is complete.

For a continuous-time, finite-energy signal $x(t)$, the Fourier transform is

$$ X(f) \;=\; \int_{-\infty}^{\infty} x(t)\, e^{-j 2\pi f t}\, dt, $$

with inverse transform

$$ x(t) \;=\; \int_{-\infty}^{\infty} X(f)\, e^{j 2\pi f t}\, df. $$

Read $X(f)$ as "how much of frequency $f$ is present in $x$." The forward transform breaks a signal into sinusoidal components; the inverse reassembles it. Some authors use angular frequency $\omega = 2\pi f$ and pick up a $1/2\pi$ somewhere; both conventions are unavoidable and the literature is split roughly evenly.

Properties worth memorising

Time domain	Frequency domain
$a x(t) + b y(t)$	$a X(f) + b Y(f)$ (linearity)
$x(t - t_0)$	$X(f)\, e^{-j 2\pi f t_0}$ (time shift)
$x(t)\, e^{j 2\pi f_0 t}$	$X(f - f_0)$ (modulation)
$x(at)$	$\tfrac{1}{\|a\|} X(f/a)$ (scaling)
$(x * h)(t)$	$X(f)\, H(f)$ (convolution)
$x(t)\, h(t)$	$(X * H)(f)$ (multiplication)
$\tfrac{d}{dt} x(t)$	$j 2\pi f\, X(f)$ (differentiation)
$\int x(t)\, dt$ (over $\mathbb{R}$)	$X(0)$ (DC value)

Important pairs

A Gaussian transforms to a Gaussian — the single most important example in the book, since it explains why Gaussian smoothing is also Gaussian blurring in frequency. A rectangular pulse transforms to a sinc function, which is why brick-wall frequency cuts produce ringing in time. The Dirac delta transforms to a constant, and vice versa — the most localised signal in time has the least localised spectrum, and the other way around. This is the uncertainty principle we will see formalised in Section 12.

Parseval's identity, continuous version:

$$ \int_{-\infty}^{\infty} |x(t)|^2\, dt \;=\; \int_{-\infty}^{\infty} |X(f)|^2\, df. $$

Energy is conserved by the Fourier transform. It is a change of basis in Hilbert space, and orthonormal changes of basis preserve lengths.

Section 08

The DFT and the FFT.

The Fourier transform you actually run on a computer — on a finite sample of a signal — is the DFT. The algorithm that makes it practical is the FFT, and the speedup from $O(N^2)$ to $O(N \log N)$ is one of the most consequential algorithmic advances of the twentieth century.

Given an $N$-point discrete sequence $x[0], x[1], \dots, x[N-1]$, the Discrete Fourier Transform is

$$ X[k] \;=\; \sum_{n = 0}^{N-1} x[n]\, e^{-j 2\pi k n / N}, \qquad k = 0, 1, \dots, N - 1, $$

with inverse

$$ x[n] \;=\; \frac{1}{N} \sum_{k = 0}^{N-1} X[k]\, e^{j 2\pi k n / N}. $$

Read $X[k]$ as the content at discrete frequency $k / N$ cycles per sample, or equivalently $k \cdot f_s / N$ Hz if the sampling frequency is $f_s$. The DFT is implicitly periodic in both time and frequency: it treats the $N$ input samples as one period of an infinitely repeating signal.

The FFT in one sentence

The DFT as written is $O(N^2)$ to compute. The Fast Fourier Transform (Cooley-Tukey, 1965) exploits the structure of the twiddle factors $e^{-j 2\pi k n / N}$ to compute the same result in $O(N \log N)$ time, by recursively splitting an $N$-point DFT into two $N/2$-point DFTs. For $N = 10^6$, the speedup is roughly $50{,}000\times$, and essentially all practical frequency analysis is computed this way.

Why the FFT reshapes ML

The FFT turns $O(N^2)$ operations — convolutions, autocorrelations, cross-correlations — into $O(N \log N)$ ones by moving to the frequency domain, pointwise-multiplying, and inverse-transforming. In large-kernel CNNs, diffusion model score estimators, and any algorithm that needs long-range correlations in 1D or 2D data, an FFT-based implementation beats the direct one by orders of magnitude.

Leakage, windowing, zero-padding

Because the DFT assumes the finite block is one period of a periodic signal, discontinuities at the ends of the block cause spectral leakage — energy that should have been at one frequency gets smeared across many. Windowing — multiplying the signal by a smooth, tapering function (Hann, Hamming, Blackman, Kaiser) before transforming — controls leakage at the cost of frequency resolution. Zero-padding (appending zeros) interpolates the spectrum at finer frequency bins without changing its underlying information content. All three are standard operations in any spectrogram pipeline and are one-line calls in NumPy/SciPy.

Section 09

The sampling theorem and aliasing.

Every continuous signal in the real world becomes a digital one by being sampled at discrete time points. When sampling is done correctly, no information is lost. When it is not, the result is aliasing — and the damage is irreversible.

The Nyquist-Shannon sampling theorem is the single most important practical result in this chapter. Suppose a continuous signal $x(t)$ has no frequency content above $f_{\max}$ — it is band-limited. If we sample it at a rate

$$ f_s \;>\; 2 f_{\max} \quad (\text{the Nyquist rate}), $$

then $x(t)$ can be perfectly reconstructed from its samples by the sinc interpolation formula

$$ x(t) \;=\; \sum_n x[n]\, \operatorname{sinc}\!\left(\frac{t - n T_s}{T_s}\right), \qquad T_s = 1 / f_s. $$

Perfect means exactly equal, not approximately. No information is lost by sampling a band-limited signal at more than twice its maximum frequency. This is the result that made the entire digital audio, imaging, and telecommunications industry possible.

Aliasing: what goes wrong when you undersample

If $f_s \leq 2 f_{\max}$, frequencies above $f_s / 2$ do not disappear — they fold into the low-frequency range and become indistinguishable from real low-frequency content. The high-frequency signal $\cos(2\pi f_0 t)$ sampled at rate $f_s < 2 f_0$ looks exactly like the lower-frequency signal $\cos(2\pi (f_s - f_0) t)$ — this is aliasing, and the two signals cannot be told apart from the samples alone.

Once sampled, the two are identical. There is no algorithm, statistical or otherwise, that can separate them. The fix is to anti-alias filter — apply a low-pass filter before sampling, cutting off everything above $f_s / 2$ so that nothing aliases. This is why every modern ADC has an analogue low-pass filter in front of it, and every audio recording pipeline includes an anti-aliasing step as the very first thing.

The CD standard, worked out

Human hearing tops out around $20$ kHz. The Compact Disc specification samples at $44.1$ kHz, just over twice that, leaving a narrow guard band for the anti-aliasing filter to roll off gracefully. The $44.1$ kHz number is not arbitrary; it is the Nyquist rate for audible sound, padded by engineering margin. Every digital audio standard in use today is descended from this calculation.

Section 10

Filtering.

A filter is an LTI system whose frequency response shapes the spectrum in a useful way — passing some frequencies, suppressing others. It is the single most common signal-processing operation and the direct ancestor of every convolutional layer in deep learning.

Filters are classified by what they pass. A low-pass filter passes low frequencies and attenuates high ones; it smooths. A high-pass does the opposite; it sharpens. A band-pass keeps a middle range; a band-stop (or notch) removes one — the $60$ Hz hum filter on every power-line-adjacent recording is a band-stop.

An ideal filter has a brick-wall frequency response: $|H(f)| = 1$ in the passband, $0$ in the stopband, zero transition. It is unrealisable — its impulse response is an infinite sinc, which cannot be implemented in finite time and causes severe ringing when truncated. Real filter design is the art of trading off passband flatness, stopband attenuation, transition-band sharpness, group delay, and computational cost.

FIR vs. IIR

A finite impulse response (FIR) filter has $h[n]$ nonzero for only finitely many $n$. Its output is a weighted moving average:

$$ y[n] \;=\; \sum_{k = 0}^{M} b_k\, x[n - k]. $$

FIR filters are always stable, can be made exactly linear-phase (no waveform distortion), and are the standard tool for audio EQ, image smoothing, and anywhere phase behaviour matters. They are also what a "convolution layer" in a CNN actually is — a short FIR filter with learned taps $b_k$.

An infinite impulse response (IIR) filter has recursive feedback:

$$ y[n] \;=\; \sum_{k = 0}^{M} b_k\, x[n - k] \;-\; \sum_{k = 1}^{N} a_k\, y[n - k]. $$

IIR filters reach the same frequency response with far fewer coefficients than FIRs, at the cost of potential instability (poles outside the unit circle diverge) and non-linear phase. Classical Butterworth, Chebyshev, and elliptic filters are IIR, and they remain the default for anti-aliasing, anti-imaging, and any application where filter length matters.

A deep-learning link

A state-space model like S4, S5, or Mamba is, structurally, a learnable linear IIR filter with a diagonal-plus-low-rank state transition. The "state update" equation is exactly the feedback form above, generalised to matrix coefficients. When you read the Mamba paper and see $y_t = C h_t$, $h_t = A h_{t-1} + B x_t$, you are looking at an IIR filter with trained $A$, $B$, $C$.

Section 11

The Z-transform and stability.

For discrete signals, the Z-transform is what the Laplace transform is for continuous ones: it turns recursive difference equations into algebra, and it characterises filter stability in a single picture on the complex plane.

The Z-transform of a discrete-time sequence is

$$ X(z) \;=\; \sum_{n = -\infty}^{\infty} x[n]\, z^{-n}, $$

where $z$ is a complex variable. Restricting $z$ to the unit circle $z = e^{j\omega}$ recovers the discrete-time Fourier transform $X(e^{j\omega}) = \sum x[n] e^{-j\omega n}$. The Z-transform is the DTFT extended off the unit circle to the whole complex plane, and it converges only in certain regions (the region of convergence).

Poles, zeros, and stability

For a rational Z-transform (which is what every difference equation produces),

$$ H(z) \;=\; \frac{B(z)}{A(z)}, $$

the zeros are the roots of $B(z)$ and the poles are the roots of $A(z)$. An LTI system is stable (bounded input gives bounded output) if and only if all its poles lie strictly inside the unit circle.

This is the cleanest diagnostic in all of filter analysis. Want to know whether an IIR filter will blow up? Plot its poles. All inside the unit disk? Stable. One outside? Unstable. It is the discrete-time analogue of the left-half-plane criterion for Laplace transforms, and it is how every dynamics problem — including the stability of deep recurrent networks — gets analysed when the signal-processing lens is applied.

Section 12

Time-frequency analysis: STFT, spectrograms, and uncertainty.

Most real signals are non-stationary — their frequency content changes over time. The Fourier transform sees only the whole signal at once; to get a time-varying picture, we need to look at local pieces, and that introduces a fundamental trade-off.

The short-time Fourier transform (STFT) slides a window $w$ across the signal and takes the Fourier transform of each windowed segment:

$$ X(t, f) \;=\; \int_{-\infty}^{\infty} x(\tau)\, w(\tau - t)\, e^{-j 2\pi f \tau}\, d\tau. $$

The squared magnitude $|X(t, f)|^2$ is a spectrogram — a 2D image with time on one axis, frequency on the other, and colour indicating energy. Spectrograms are the single most common input representation for speech recognition, music analysis, bioacoustic monitoring, and any audio-ML pipeline.

The Heisenberg-Gabor uncertainty principle

Narrow window = good time resolution, poor frequency resolution. Wide window = good frequency resolution, poor time resolution. This is not a design choice; it is a theorem. For any signal of finite energy, if $\sigma_t$ measures the time-spread of $x$ and $\sigma_f$ the frequency-spread of $X$,

$$ \sigma_t \, \sigma_f \;\geq\; \frac{1}{4\pi}. $$

The Gaussian window achieves equality — it is the most "concentrated" window possible in both domains, and is the theoretical basis of the Gabor transform. This is the same mathematical uncertainty principle that appears in quantum mechanics; the identity is not a metaphor, it is the same inequality applied to different physical interpretations of the same Fourier pair.

Mel-spectrograms in ML

Almost every modern audio model — Whisper, AudioLM, MusicLM, ASR systems generally — operates on log-mel-spectrograms: STFT magnitudes, squared, mapped through a mel-scaled filter bank that mimics human hearing's log-frequency resolution, and log-compressed. The first layer of the model is a learned convolution on top of this fixed, hand-designed signal-processing representation. The front-end is a century of signal processing; the rest is a transformer.

Section 13

Wavelets and multiresolution.

The STFT uses the same window at every frequency, which is a blunt compromise. Wavelets use wider windows at low frequencies and narrower ones at high frequencies — a trade-off that matches the structure of most natural signals.

A wavelet is a short, oscillatory function $\psi(t)$ with zero mean. The continuous wavelet transform decomposes a signal using scaled and translated copies of $\psi$:

$$ W(a, b) \;=\; \frac{1}{\sqrt{|a|}} \int_{-\infty}^{\infty} x(t)\, \psi\!\left(\frac{t - b}{a}\right) dt. $$

The scale $a$ plays the role of an inverse frequency — large $a$ means a stretched, low-frequency wavelet; small $a$ a compressed, high-frequency one. The translation $b$ localises the analysis in time. The output is a 2D time-scale representation, similar in spirit to an STFT but with frequency-dependent window length.

Multiresolution analysis

The discrete wavelet transform (DWT) lives on dyadic scales — $a = 2^{-j}$, $b = k 2^{-j}$ — and Mallat's multiresolution framework gives it a fast, $O(N)$ implementation: successively split a signal into low- and high-frequency components, recurse on the low-frequency half, and stop when you have enough levels. The output is a small number of coefficients at each resolution level.

The engineering payoff is image and signal compression: natural images concentrate most of their energy in a few large wavelet coefficients, with the rest quantisable or outright dropped. JPEG 2000 uses wavelets for exactly this reason; so does a long line of neuroscience, seismology, and time-series anomaly detection.

Where ML quietly reuses wavelets

Image pyramids, Laplacian pyramids in diffusion models, multi-scale feature maps in U-Nets and Feature Pyramid Networks — all are engineering-grade implementations of the same multi-resolution idea. The deeper architectural choice of "operate at multiple spatial scales" in modern CNNs is the wavelet idea, with the fixed basis replaced by learned filters.

Section 14

Random signals and power spectra.

When the signal itself is random — noise, a stochastic process, a random telegraph signal, financial returns — the Fourier transform in the usual sense need not exist. The right analogue is the power spectrum, and the bridge between time and frequency is the Wiener-Khinchin theorem.

A wide-sense stationary random process $X(t)$ has a constant mean and an autocorrelation that depends only on the time lag:

$$ R_X(\tau) \;=\; \mathbb{E}[X(t)\, X(t + \tau)]. $$

The Wiener-Khinchin theorem says that the Fourier transform of the autocorrelation is the power spectral density:

$$ S_X(f) \;=\; \int_{-\infty}^{\infty} R_X(\tau)\, e^{-j 2\pi f \tau}\, d\tau. $$

$S_X(f)$ tells you the average power of the process at frequency $f$. It is always real and non-negative, and it satisfies $\int S_X(f)\, df = R_X(0) = \mathbb{E}[X(t)^2]$ — total power is the zero-lag autocorrelation. This is the frequency-domain description of a process that has no individual Fourier transform but whose statistics are Fourier-transformable.

White noise and coloured noise

White noise has flat power spectral density: $S_X(f) = \sigma^2$ at every frequency. Its autocorrelation is $\sigma^2 \delta(\tau)$ — uncorrelated across any nonzero lag. "White" is by analogy to white light; it contains all frequencies equally. Coloured noise (pink, brown, blue) has a spectrum that emphasises some frequencies; each colour encodes a different correlation structure, and each appears somewhere in time-series modelling.

Matched filters

Given a known signal buried in additive white noise, the filter that maximises output signal-to-noise ratio is the time-reversed complex conjugate of the signal itself — the matched filter. This is the foundation of radar, sonar, GPS correlator receivers, pulsar astronomy, and LIGO gravitational-wave detection. Everywhere you need to find a faint known pattern in noise, the matched filter is the statistically optimal tool, and it is a direct consequence of Wiener-Khinchin.

Section 15

Where signal processing shows up in ML.

Modern machine learning is signal processing in a trench coat. Here is the map — every entry below is an ML topic whose intellectual scaffolding came directly from the previous fourteen sections.

Convolutional neural networks. Each convolution layer is an FIR filter bank applied to 1D, 2D, or 3D input, with learned filter coefficients. The entire architecture is a cascade of LTI operations and nonlinearities. Padding, stride, and dilation are all signal-processing language.
Audio pipelines — Whisper, AudioLM, MusicLM, VALL-E, ASR systems. All start with STFT → mel-filterbank → log-compression. The neural network sits on top of hand-designed signal-processing front-end that has been refined for a century. No serious audio ML system skips the spectrogram.
Spectral normalisation and spectral regularisation. Control the maximum singular value of weight matrices — their spectral radius — to keep training stable. Stability analysis is borrowed verbatim from filter theory.
State-space models: S4, S5, Mamba. Linear IIR filters with structured state matrices and learnable parameters. The continuous-to-discrete conversion (bilinear transform), the stability requirements (eigenvalues in the left half plane or inside the unit disk), and the convolution-as-frequency-multiplication trick used to train them efficiently — all textbook signal processing.
Diffusion models. Add and remove Gaussian noise over many steps. The forward noise process is an SDE whose stationary distribution is white Gaussian noise; the reverse process is a learned denoising filter. Score matching is a form of Wiener filtering. Recent work ties diffusion to stochastic control and to Fourier-domain score estimation explicitly.
Fourier features and positional encodings. Transformers use sinusoidal positional encodings precisely because they make relative shifts linear and translation-equivariant. NeRF and many implicit neural representations use Fourier feature mappings to help networks learn high-frequency details that pure MLPs cannot.
Neural ODEs and continuous-time models. The forward pass is the solution of an ODE initial-value problem — the continuous analogue of a recurrent network. Stability, adjoint methods, and numerical integration are imported directly from classical control and signal processing.
Self-attention as global filtering. Attention computes $QK^\top$ pairwise similarities and averages values — structurally, a data-dependent, non-LTI filter. Recent efficient-attention methods (FlashAttention, Performers, kernelised attention) are approximations that restore locality or low-rank structure reminiscent of classical filters.
Signal augmentation for training. SpecAugment masks frequency and time bands of spectrograms during training; mixup blends signals in time or frequency; pitch shifting, time stretching, noise injection, and reverberation are all applied in the signal-processing layer before the network sees the data.
Image compression as a pre-processing step. JPEG's DCT, JPEG 2000's wavelets, and modern learned image codecs all share the same goal — decorrelate and concentrate energy. The transforms change; the signal-processing logic does not.
Anti-aliasing in GANs and diffusion upsamplers. Strided convolutions and transposed convolutions produce aliasing when not done carefully; StyleGAN3 and much subsequent work explicitly adds anti-aliasing filters to preserve translation equivariance in generative image models. This is Nyquist, 2021 edition.
Kalman filters and LQR. The classical recursive Bayesian filter for linear-Gaussian state-space models — a single step of which is an LTI update plus a covariance update — remains a building block in trackers, SLAM, sensor fusion, and the deterministic policy of linear-quadratic regulators in control-flavoured RL.
Signal-processing inductive biases. Group-equivariant CNNs, wavelet-structured networks, Fourier neural operators — entire research lines are devoted to baking signal-processing structure into learnable models. The more structure a domain has, the more signal processing there is to inherit.

A useful reading exercise: pick any recent deep-learning paper in audio, video, or physics-informed modelling, and try to name the signal-processing idea that its architecture is recapitulating. There is almost always one, often deliberately acknowledged, sometimes quietly rediscovered. The substrate of classical engineering mathematics that this chapter developed is the one every modern model silently builds on.

That closes Part I. Eight chapters — linear algebra, calculus, optimisation, probability, statistics, information theory, Bayesian reasoning, and signal processing — make up the mathematical foundation on which everything in the rest of this compendium rests.

Where to go next

Signal processing has a wonderfully stable canon — the textbooks below have been used for decades, with good reason. Pair one undergraduate text (Oppenheim & Schafer or Proakis & Manolakis) with one modern bridging book (Smith's free DSP guide, or Vetterli's signal-processing-for-data-science text) and you will own the field.

Standard textbooks

Discrete-Time Signal Processing

Oppenheim & Schafer · 3rd ed. 2009

The standard graduate-level DSP textbook. Comprehensive treatment of LTI systems, the Z-transform, sampling, DFT/FFT, filter design, and multirate processing. The book that most working DSP engineers learned from.

Pearson
Signals and Systems

Oppenheim, Willsky & Nawab · 2nd ed. 1996

The classical undergraduate introduction, covering both continuous and discrete signals. Slower-paced and more intuitive than its sequel. Best first textbook if you have not seen the material before.

Pearson
Digital Signal Processing: Principles, Algorithms, and Applications

Proakis & Manolakis · 4th ed. 2007

Comprehensive, applications-oriented competitor to Oppenheim & Schafer. Covers DFT/FFT, FIR/IIR design, multirate, adaptive filtering, and spectrum estimation in equal depth. Strong on practical implementation.

Pearson
Foundations of Signal Processing

Vetterli, Kovačević & Goyal · 2014 · free PDF

A modern, mathematically rigorous treatment that connects classical DSP to wavelets, sampling beyond Nyquist, and signal processing for data science. The free PDF (and accompanying Fourier and Wavelet Signal Processing) is the freshest take in the standard list.

fourierandwavelets.org (free)

Free practical references

The Scientist and Engineer's Guide to Digital Signal Processing

Steven W. Smith · 1997 · free online

The friendliest introduction in print. No prerequisites beyond high-school maths; concept-first explanations of FFT, convolution, filtering, and spectral analysis with worked code. Free PDF and HTML on the author's site, and still the book most working engineers recommend to newcomers.

dspguide.com
Think DSP

Allen B. Downey · 2nd ed. 2016 · free online

Programming-first DSP, with NumPy and SciPy. Each chapter is a notebook that builds intuition through experiments. The fastest way for a working Python programmer to get fluent in spectra, filters, and audio processing.

Green Tea Press (free)
An Introduction to Digital Filters

Julius Smith · CCRMA Stanford · free online

Smith's online textbook on FIR/IIR design, complete with derivations, examples, and the connections to audio applications he has worked on for decades at Stanford CCRMA. The reference that audio-DSP people actually use.

ccrma.stanford.edu

Adjacent and advanced

A Wavelet Tour of Signal Processing

Stéphane Mallat · 3rd ed. 2008

The canonical wavelets text, by the developer of multiresolution analysis. Goes much deeper than this chapter does on time-frequency, sparse representation, and the link to image compression and modern deep-learning architectures.

Elsevier
Statistical Digital Signal Processing and Modeling

Monson H. Hayes · 1996

The classical text on Wiener filtering, Kalman filtering, adaptive filtering, and spectrum estimation — all foundations of statistical signal processing that remain alive in modern tracking, sensor fusion, and time-series ML.

Wiley
Signal Processing for Communications

Prandoni & Vetterli · 2008 · free PDF

An EPFL textbook that uses communications systems as the running motivation. Particularly clear on multirate processing, sampling, and the bridge to information theory.

sp4comm.org

Free video courses

MIT 6.003 — Signals and Systems

MIT OpenCourseWare

The MIT undergraduate signals-and-systems course, taught from Oppenheim & Willsky. Lectures, problem sets, and exams all available. Still the standard for first-time learners.

MIT OCW
EPFL Coursera DSP specialisation

Prandoni & Vetterli · Coursera

A full DSP curriculum from EPFL, taught by the authors of two of the textbooks listed above. Covers everything from sampling theory through filter design with both continuous- and discrete-time treatment.

Coursera

Software

SciPy signal & FFT modules

scipy.signal, scipy.fft

The Python interface to almost every operation in this chapter. STFT, FFT, FIR/IIR design (Butterworth, Chebyshev, elliptic), windows, spectrograms, convolution. Reading the docs in order is a surprisingly good way to consolidate the practical vocabulary.

scipy.signal scipy.fft
librosa

Audio and music signal analysis in Python

The de-facto Python library for audio ML pre-processing — STFT, mel-spectrograms, MFCCs, pitch tracking, beat detection. Used in the Whisper, Bark, and most academic audio pipelines.

librosa.org
torchaudio

PyTorch audio toolkit

PyTorch-native, GPU-aware reimplementations of librosa-style operations, plus learnable filter banks and ASR/TTS building blocks. The right choice when audio pre-processing must be in the training graph.

torchaudio docs

With this chapter, Part I — Mathematical Foundations — is complete. Eight chapters: linear algebra, calculus, optimisation, probability, statistics, information theory, Bayesian reasoning, and signal processing. Together they form the mathematical backbone for everything that follows, beginning with Part II's tour of programming and software engineering.