Part XIII · Specialized ML Methods · Chapter 05

Graph Neural Networks, learning on the structure between things.

Most data sits naturally in tables or grids — but molecules, social networks, knowledge bases, recommendation systems, transport networks, and protein interactions don't. Graphs are the data structure for relationships, and graph neural networks are the family of architectures that learn directly on those relationships. This chapter develops the message-passing framework that unifies modern GNNs, the major architectures (GCN, GraphSAGE, GAT), the extension to heterogeneous graphs, and the frontier where graphs meet transformers and foundation models.

Prerequisites & orientation

This chapter assumes basic familiarity with linear algebra (Part I Ch 01), neural-network fundamentals (Part V Ch 01–02), and attention mechanisms (Part V Ch 06) for the GAT and graph-transformer sections. The probabilistic-graphical-models material of Part IV Ch 06 is helpful background but not required — graph neural networks reuse the graph as a data structure, not the conditional-independence semantics. No prior graph-ML experience is assumed.

Two threads run through the chapter. The first is the message-passing abstraction: nearly every modern GNN can be described as repeated rounds of "each node aggregates information from its neighbours, then updates its own representation." Once you understand message passing, the difference between GCN, GraphSAGE, and GAT is just the difference in how aggregation and update are parameterised. The second thread is the scalability and expressivity trade-off: simple GNNs scale to billion-node graphs but have known representational limits; expressive GNNs solve more problems but cost more to run. The chapter is organised so the simplest methods appear first and the architectural complications come in as their motivations become clear.

In this chapter

Why Graphs Need Their Own Architectures non-grid data · permutation invariance · the CNN/transformer mismatch
The Graph as Data: Notation and Tasks adjacency · features · node / edge / graph tasks
Message Passing: The Unifying Framework aggregate · update · the MPNN recipe · receptive fields
Graph Convolutional Networks (GCN) Kipf-Welling · spectral motivation · symmetric normalisation
GraphSAGE: Inductive Learning on Large Graphs neighbour sampling · mean / max / LSTM aggregators · scaling
Graph Attention Networks (GAT) attention over neighbours · multi-head · GATv2
Heterogeneous Graphs and Relational GNNs node types · edge types · R-GCN · HGT · metapaths
Expressivity, Oversmoothing, and Limits Weisfeiler-Lehman · oversmoothing · oversquashing · 1-WL
Graph Transformers and Modern Architectures Graphormer · GPS · positional encodings · long-range
Frontier: Foundation Models and Graph Applications graph foundation models · AlphaFold · LLMs + KGs

Why Graphs Need Their Own Architectures

Most of deep learning's success has been on data with a fixed structure: images live on regular pixel grids, text lives in linear sequences, audio lives on a one-dimensional time axis. Graphs have none of those properties — node count varies, neighbour count varies, there is no canonical ordering — and the architectures that won on grids and sequences fail directly on graphs. Graph neural networks exist because graphs are an irreducibly different data type, and learning on them well requires architectures designed around their structure rather than borrowed from a different one.

The structural mismatch with CNNs and transformers

A CNN works because images have a fixed grid: every pixel has exactly the same number of neighbours in exactly the same spatial relationship, and the same convolutional kernel applies everywhere. A transformer works because text has a fixed ordering: position embeddings give each token a coordinate that downstream attention can use. Neither property holds in a graph. Node A might have three neighbours, node B might have a thousand, and there is no reason to think A's "first neighbour" plays the same role as B's "first neighbour" — the neighbour set is fundamentally unordered. Any architecture for graphs must be permutation invariant over the neighbour set, which rules out the order-dependent operations that drive CNNs and transformers.

The structural difference between grid, sequence, and graph data. CNNs exploit the fixed grid; transformers exploit the linear order. Graphs have neither property — node degree varies and the neighbour set is fundamentally unordered — which is why architectures designed for grids or sequences fail when applied to graphs without modification.

The variable-size, variable-shape problem

A second mismatch: graphs vary in size and shape across examples in a dataset. A single image-classification dataset has all images at the same resolution; a single text dataset has documents of different lengths but the same vocabulary. A graph dataset can have a 7-atom molecule and a 10,000-atom protein in the same training batch, with completely different connectivity. Any architecture for graphs must produce sensible outputs across orders-of-magnitude size variation, and any training procedure must handle the batching pathology that comes with it.

Where graphs come from in practice

The applications that drive graph-ML research and deployment include: molecular and protein modelling (atoms as nodes, bonds as edges; the AlphaFold lineage and most modern computational chemistry); recommendation (users and items as nodes, interactions as edges; Pinterest's PinSage, Twitter's GraphJet, every modern recommender); knowledge graphs (entities as nodes, relationships as edges; used as memory for retrieval-augmented LLMs and as substrate for question answering); fraud detection and finance (accounts and transactions as a heterogeneous graph; the dominant approach to detecting transaction-network fraud); physical simulation (particles or mesh nodes connected by physical interactions; modern GNN-based weather and fluid simulators); and combinatorial optimisation (problem instances as graphs, learned heuristics on top). Each application has different scale, different homogeneity, and different evaluation criteria, but the underlying GNN machinery is shared.

The Permutation-Invariance Constraint

Every legitimate operation on a graph's neighbour set must produce the same output regardless of how the neighbours are listed. This sounds obvious but is the single hardest constraint to design around — most natural operations (concatenate, multiply by a fixed-size weight matrix, treat as a sequence) violate it. The major GNN architectures of this chapter are different ways of building expressive operations that still respect permutation invariance.

The Graph as Data: Notation and Tasks

Before any architecture, the data structure. A graph in machine learning is a small set of named pieces — node features, edge features, adjacency information — and the prediction task is one of three flavours depending on whether the answer is per-node, per-edge, or per-whole-graph. Naming these pieces clearly is the precondition for understanding every algorithm in the chapter.

Notation

A graph G = (V, E) consists of a set of nodes V (also called vertices) and a set of edges E. Each node v ∈ V typically carries a feature vector x_v ∈ ℝ^d; each edge (u, v) ∈ E may also carry a feature vector. The connectivity is encoded by the adjacency matrix A ∈ ℝ^|V|×|V|, where A_ij = 1 if an edge connects node i to node j, and 0 otherwise. The adjacency matrix is symmetric for undirected graphs and asymmetric for directed ones. Sparse storage formats (edge lists, CSR, COO) are essential in practice — most real graphs are sparse, and a dense adjacency matrix is infeasible at scale.

_ij = 1 if edge (i,j); symmetric for undirected

A small undirected graph and its adjacency matrix. The matrix encodes connectivity but not features; in a real application each node also carries a feature vector x_v and possibly each edge carries an edge feature. Sparse storage (edge lists or CSR) replaces the dense matrix at scale, since real graphs are typically sparse.

The three task types

Graph ML problems come in three flavours, distinguished by what gets predicted.

Node-level prediction: classify or regress on each individual node. Examples: paper categorisation in a citation network (the OGB datasets), user-segmentation in a social network, atom-type prediction in molecular graphs. The model produces a vector for each node; a downstream classifier or regressor maps each vector to its prediction.
Edge-level prediction: classify or regress on each pair of nodes, including pairs that may not currently have an edge. The most common version is link prediction — given a graph, predict which currently-absent edges are likely to exist. This is the substrate of recommendation, knowledge-graph completion, and drug-target interaction prediction.
Graph-level prediction: classify or regress on a whole graph. Examples: molecular property prediction (is this drug toxic?), social-network classification (is this an organic community or a botnet?), program-graph classification (does this code have a bug?). The model produces a single vector for the whole graph by pooling node representations after several rounds of message passing.

The same architectural backbone — the GNN layer — serves all three tasks. The difference is in what gets pooled and what loss is computed at the end.

Message Passing: The Unifying Framework

Nearly every modern graph neural network is an instance of the same recipe: each node aggregates information from its neighbours, then updates its own representation based on what it received. Repeat for several rounds, and information propagates through the graph. Once you understand message passing, the dozens of named GNN variants reduce to different choices of two functions.

The MPNN recipe

Gilmer et al. (2017) formalised the message-passing neural network (MPNN) framework that subsumes most graph architectures published before or since. Each layer of an MPNN updates every node's hidden state in two steps:

Message-passing layer (general form) aggregate: m v (k+1) = AGG {u \in N(v)} ( h u (k), h v (k), e uv) update: h v (k+1) = UPD( h v (k), m v (k+1)) N(v) is the neighbour set of node v. The AGG function pools messages from neighbours into a single vector m v; it must be permutation invariant over the neighbour set (sum, mean, max, attention-weighted average all qualify; concatenation does not). The UPD function combines the aggregated message with the node's previous representation, typically by an MLP or a gated unit. Different choices of AGG and UPD give different GNN architectures — this single template covers GCN, GraphSAGE, GAT, GIN, and most others.

_v AGG{h_u₁, h_u₂, h_u₃, h_u₄} UPD(h_v, m_v) MLP / GRU / linear RESULT new representation v h_v^(k+1) repeat for K layers Each layer extends each node's effective receptive field by one hop in the graph

The message-passing layer. Step 1: every node aggregates information from its neighbours via a permutation-invariant function (sum, mean, attention-weighted average). Step 2: every node updates its own representation by combining the aggregated message with its previous state. Stacking K layers lets each node "see" K hops out into the graph.

Receptive field and depth

After one MPNN layer, each node's representation reflects information from its immediate neighbours. After two layers, neighbours of neighbours. After K layers, the receptive field is the K-hop neighbourhood. This is the GNN analogue of receptive field in CNNs, and it has the same trade-off: deeper networks capture more global structure but are harder to train. As the next sections will show, naive depth in GNNs runs into specific pathologies — oversmoothing and oversquashing — that limit how deep useful GNNs can practically go. Most production GNNs use 2–4 layers; the architectural innovations of recent years are largely about pushing this number higher without the pathologies.

Choices that distinguish GNN architectures

Within the MPNN template, the architectural design space is two functions and a few choices. For AGG: sum, mean, max, attention-weighted, or some learned permutation-invariant function. For UPD: linear projection, MLP, gated recurrent unit, residual connection. Whether to use edge features, how to use them, and whether to make the layer spatial or spectral. The next four sections cover the most influential instances of these choices.

Graph Convolutional Networks (GCN)

The Graph Convolutional Network of Kipf and Welling (2017) is the canonical first instance of message passing — a single, simple layer formula that turned out to be surprisingly effective on a wide range of node-classification tasks. It is the GNN you implement first; everything else in the chapter is some refinement of it.

The GCN layer

The GCN layer can be written in matrix form, which is how it is usually implemented:

GCN layer (Kipf-Welling 2017) H (k+1) = σ( D̃ -½ Ã D̃ -½ H (k) W (k)) Ã = A + I (adjacency with self-loops), D̃ is the degree matrix of Ã, H (k) is the matrix of all node hidden states stacked as rows, W (k) is a learnable weight matrix, and σ is a non-linearity (ReLU). The crucial term is the symmetric normalisation D̃ -½ Ã D̃ -½, which scales each contribution by the degrees of both endpoints. Without normalisation, high-degree nodes accumulate disproportionate signal; with it, the layer is well-behaved across degree distributions.

Reading the formula as message passing: the matrix multiplication Ã H sums each node's neighbours' representations (with self-loop included). The normalisation by degrees is a per-edge weight that prevents high-degree nodes from dominating. The multiplication by W is a learned linear transformation applied uniformly to every node. The non-linearity is applied element-wise. Two functions — the mean-with-degree-correction aggregation and the linear-plus-ReLU update — together produce a layer that has a clean spectral interpretation, a clean message-passing interpretation, and surprisingly strong empirical performance.

The spectral motivation

GCN was originally derived from spectral graph theory as a first-order approximation to spectral convolution, where convolutions are defined via the eigenvectors of the graph Laplacian. The full spectral convolution is expensive (requires the eigendecomposition of the Laplacian), but Kipf and Welling showed that a Chebyshev-polynomial approximation truncated to first order recovers the simple symmetric-normalised aggregation above. The spectral derivation is mathematically pretty but practically incidental — the layer can be motivated entirely from the message-passing perspective, and most modern presentations skip the spectral derivation. What matters is that GCN works, and works well enough to remain a strong baseline for any new GNN method.

Practical use

GCN is the right starting point for any node-classification task on a moderate-sized graph. Two layers (so each node sees its 2-hop neighbourhood) is the standard configuration. The implementation in PyTorch Geometric or DGL is a few lines. On benchmark datasets — Cora, Citeseer, PubMed (small citation networks), and the OGB suite at larger scale — a properly-tuned GCN sets the baseline that subsequent methods need to beat. Many "improvements" reported in papers fail to beat a well-tuned GCN once the comparison is fair.

Limitations

GCN has known weaknesses. It is transductive — the model is trained on a specific graph and cannot directly handle new nodes added after training without re-running the whole training. It treats all neighbours equivalently weighted by degree, with no way to learn that some neighbours matter more. And it suffers from oversmoothing with depth: stacking many GCN layers makes node representations converge to a single homogeneous state, defeating the purpose of going deep. The next two sections introduce GraphSAGE (which addresses inductivity and scaling) and GAT (which addresses the equal-weighting issue); Section 8 returns to oversmoothing as a more general problem.

GraphSAGE: Inductive Learning on Large Graphs

GCN works on graphs that fit in memory and that don't change. For real-world deployments — recommendation systems with millions of users joining daily, social networks with constant new accounts, fraud-detection systems with new transaction nodes every second — GCN's transductive setup is a non-starter. GraphSAGE (Hamilton, Ying & Leskovec, 2017) introduced the inductive, sampling-based variant that turned GNNs into a production-deployable technology.

Inductive vs. transductive

A transductive model is fit on a fixed graph and produces predictions for nodes that were in that graph; it cannot generalise to new nodes without retraining. An inductive model learns a function — given a node's features and the features of its neighbours, produce a representation — that can be applied to any node, including new ones never seen during training. The distinction is the difference between "this model classifies the papers in this citation network" and "this model classifies any paper, including ones added next week."

GraphSAGE is inductive by construction: it learns aggregator and update functions that take feature vectors as input and produce representations as output, without ever using node identities directly. Apply the trained functions to a new node and they produce a representation as long as the node's features and neighbour features are available.

Neighbour sampling

The second contribution is more pragmatic: neighbour sampling. A real graph might have nodes with thousands or millions of neighbours; aggregating from all of them at every layer is infeasible. GraphSAGE samples a fixed-size random subset of neighbours at each layer, with separate sample sizes per layer. A common configuration: sample 25 neighbours at the first layer, 10 at the second. The total computational cost is bounded by the sample sizes, independent of the actual graph size — making GraphSAGE feasible on graphs with billions of nodes.

GraphSAGE neighbour sampling. The high-degree node v has many neighbours; rather than aggregating from all of them, GraphSAGE samples a fixed number (here 5) at each layer. Computational cost is bounded by the sample size rather than the actual degree, which is what makes the architecture feasible on billion-node graphs.

The aggregator zoo

GraphSAGE's third contribution was a flexibility about the aggregator function. The original paper introduced three options:

Mean aggregator: average the neighbour features. Closest to GCN; the simplest and often the strongest in benchmarks.
Pool aggregator: apply a small MLP to each neighbour, then take an element-wise max. More expressive on tasks where extreme features matter.
LSTM aggregator: feed the neighbours through an LSTM (after random shuffling, since LSTMs are order-dependent and we need permutation invariance in expectation). The most expressive but the most expensive.

In practice, the mean aggregator is the default for production deployments because of its simplicity and empirical strength. The pool and LSTM aggregators show up in research code and specialised applications.

Where GraphSAGE shipped

GraphSAGE became the foundation of PinSage (Ying et al. 2018), Pinterest's production recommendation system, which scaled GraphSAGE-style architectures to graphs with billions of nodes (pins, boards, users) and trillions of edges. PinSage was one of the first widely-publicised demonstrations that GNNs were not just an academic curiosity but could power production systems at the largest scale, and it remains the canonical case study for inductive GNN deployment. Most subsequent production GNN systems — at Twitter, Alibaba, Uber, and elsewhere — use some variant of the GraphSAGE pattern.

Graph Attention Networks (GAT)

GCN treats every neighbour as equally informative (modulo a degree correction); GraphSAGE treats them as equally informative within the sampled set. Both choices feel wrong: in real graphs, some neighbours genuinely matter more than others, and a good model should learn which. The Graph Attention Network (Veličković et al. 2018) is the answer — replace the fixed aggregation weights with learned, content-dependent attention scores, exactly the way transformer attention works on sequences.

Attention over neighbours

For a target node v and a candidate neighbour u, GAT computes an attention coefficient α_vu that depends on both endpoints' features. The coefficient is then used as the weight in the aggregation, so high-attention neighbours contribute more and low-attention neighbours contribute less. The mechanics are:

GAT layer (Veličković et al. 2018) e vu = LeakyReLU( a ⊤ [W h v ‖ W h u] ) α vu = softmax u \in N(v) ( e vu) h v (k+1) = σ( Σ u \in N(v) α vu W h u) ‖ denotes vector concatenation; a is a learnable attention vector and W a learnable projection. The softmax is over the neighbour set, ensuring the weights sum to 1. Compare with transformer attention on sequences: the structure is the same, but the softmax is over the local neighbourhood rather than over all positions.

GAT replaces uniform-or-degree-weighted aggregation with learned attention coefficients. Each edge carries an α value (summing to 1 over the neighbourhood) that determines how much that neighbour contributes. In practice this lets the model learn which neighbours are informative for the task — useful when graph structure is noisy or when neighbours have heterogeneous relevance.

Multi-head attention

As in transformers, GAT uses multi-head attention: K independent attention mechanisms run in parallel, each with its own learnable parameters, and their outputs are either concatenated (intermediate layers) or averaged (final layer). Multiple heads stabilise training and let different heads specialise in different aspects of the neighbourhood — one head might attend to structurally similar neighbours, another to feature-similar ones. Typical GAT configurations use 8 heads at intermediate layers and 1–8 heads averaged at the output.

GATv2: fixing static attention

The original GAT formulation has a subtle expressiveness limit, identified by Brody, Alon & Yahav (2022) and named static attention: the rank of attended-to neighbours is the same regardless of the target node. In other words, if neighbour A "wins" attention from query v, it wins from any query w on the same graph — the attention mechanism cannot express "A matters to v but B matters to w". GATv2 fixes this by reordering the operations: apply the LeakyReLU before the attention vector dot-product rather than after, which restores full dynamic attention. GATv2 is the recommended default in modern implementations; the original GAT remains in production largely for backward-compatibility.

When GAT helps and when it doesn't

GAT shines on graphs where neighbour relevance is heterogeneous — knowledge graphs, where edges have semantic meaning; social networks, where some connections are stronger than others; biological networks, where some interactions are more functional than others. On homogeneous graphs with low feature variance among neighbours (citation networks, simple molecular graphs), the gains over a well-tuned GCN are modest and the extra computational cost of attention may not be worth it. The empirical pattern across benchmarks is that GAT and GCN are within a couple of percentage points on most node-classification tasks; the bigger reasons to choose GAT are interpretability (the attention weights are inspectable) and robustness to noisy edges.

Heterogeneous Graphs and Relational GNNs

Real-world graphs almost never have a single node type and a single edge type. A user-product graph has users and products as different node types, with "purchased", "viewed", and "rated" as different edge types. A knowledge graph has entities of dozens of types (Person, Place, Organisation, Drug, Gene...) and hundreds of relation types. The vanilla GNN architectures of the previous sections all assume a single homogeneous graph; making them work on heterogeneous data requires explicit machinery.

The heterogeneous-graph data model

A heterogeneous graph (also called a knowledge graph in some contexts) has multiple node types and multiple edge types. Formally: a graph G = (V, E, T_V, T_E) where every node has a type in T_V and every edge has a type (or relation) in T_E. Different node types typically have different feature schemas — a user node has age and country features; a product node has price and category — and the model must handle this gracefully.

A heterogeneous graph from a recommendation setting. Three node types (User, Product, Brand) and three edge types (purchased, viewed, rated, made_by). Each node type carries its own feature schema, and each edge type warrants its own message-passing parameters — a key insight that R-GCN and HGT both formalise.

R-GCN: relation-specific weights

The Relational Graph Convolutional Network (Schlichtkrull et al. 2018) is the simplest serious answer: use a separate weight matrix per relation type. The aggregation for a node v becomes a sum over relations of the relation-specific aggregation:

R-GCN layer h v (k+1) = σ( Σ r \in R Σ u \in N r (v) (1/c v,r) W r (k) h u (k) + W 0 (k) h v (k)) N r (v) is the set of neighbours of v under relation r; W r is the relation-r weight matrix; c v,r is a normalisation constant. The number of parameters scales linearly with the number of relations, which becomes a problem for knowledge graphs with hundreds of relations — the paper introduces basis-decomposition tricks to share parameters across relations.

R-GCN was the first widely-adopted GNN for knowledge-graph completion (predicting missing edges of a specific relation type) and remains a strong baseline. Its limitations are mostly the GCN limitations carried over: same uniform aggregation within each relation type, no neighbour-attention, transductive in its standard form.

HGT: heterogeneous graph transformer

The Heterogeneous Graph Transformer (Hu et al. 2020) is the more modern answer: combine R-GCN's relation-specificity with GAT's attention, producing a model where attention coefficients depend on both the node-type pair and the edge type. HGT typically outperforms R-GCN by clear margins on the Open Graph Benchmark heterogeneous datasets, and is the default for new heterogeneous-graph deployments in 2026. Its parameter count is larger, but the attention-based design tends to scale better with the number of relations than naive R-GCN.

Metapaths and heterogeneous embeddings

An older but still relevant family of methods uses metapaths — predefined sequences of node and edge types that capture meaningful patterns in the heterogeneous graph. For a recommendation graph, a metapath like User → purchased → Product → made_by → Brand defines a User-Brand affinity. Models such as metapath2vec sample random walks following metapaths and produce embeddings; the metapath-GNN family extends this by aggregating along metapath instances rather than uniformly. Metapath approaches require domain knowledge to specify the paths but produce highly interpretable representations and remain competitive on small-to-medium heterogeneous graphs where good metapaths are easy to enumerate.

Expressivity, Oversmoothing, and Limits

By 2019 the GNN field had a stable of architectures (GCN, GraphSAGE, GAT, GIN, ...) but a deeper question loomed: what can GNNs actually represent, and what can't they? Two parallel literatures answered this — one from theoretical computer science about graph isomorphism, one from empirical practice about why deep GNNs collapse. Both are essential context for choosing and training GNN architectures today.

The Weisfeiler-Lehman connection

The Weisfeiler-Lehman test (1-WL) is a classical algorithm for testing whether two graphs are isomorphic. It works by iteratively refining node "colours": initially every node has the same colour; at each step, each node's new colour is a hash of (its current colour, the multiset of its neighbours' colours). If at some step the colour distributions of two graphs differ, the graphs are not isomorphic. The 1-WL test is incomplete — there exist non-isomorphic graphs that 1-WL cannot distinguish — but it is the standard yardstick against which GNN expressivity is measured.

Xu et al. (2019) proved a striking equivalence: most message-passing GNNs (including GCN, GraphSAGE, GAT) are at most as powerful as 1-WL. Two graphs that 1-WL cannot distinguish, no MPNN can distinguish either. The proof uses the fact that aggregation-via-mean or aggregation-via-max collapses information in ways that limit which functions of multisets can be expressed. Their architecture, the Graph Isomorphism Network (GIN), uses a sum aggregator and an MLP update that together provably match 1-WL exactly — making GIN the maximally expressive MPNN.

The classical example of two non-isomorphic graphs that 1-WL — and therefore standard MPNNs — cannot distinguish. Both have six nodes of degree 2; from any node, the local view is "I'm a degree-2 node connected to two degree-2 nodes," recursively. Higher-order GNNs (k-WL variants, subgraph GNNs) and graph transformers with positional encodings address this limitation explicitly.

Beyond 1-WL: higher-order GNNs

The expressivity-extension literature builds GNNs that match higher orders of the WL hierarchy: k-WL tests use k-tuples of nodes rather than single nodes and distinguish strictly more graphs. Architectures such as k-GNN, Provably Powerful Graph Networks, and various subgraph GNNs deliver more expressivity at higher cost. They mostly remain in the research literature; production GNN systems still rely on 1-WL-equivalent architectures plus features that supply the missing structural information (node identifiers, positional encodings, distance-to-target features).

Oversmoothing

Oversmoothing is the empirical pathology that makes deep GCNs fail. As you stack more message-passing layers, every node's representation drifts toward an average of the whole graph — after many layers, all nodes look the same and node-classification accuracy collapses. Li, Han & Wu (2018) showed that stacking GCN layers is mathematically equivalent to repeated application of a low-pass graph filter, which provably converges to a single fixed-point representation. Two layers is the practical sweet spot for GCN; more is rarely better.

Mitigations include residual connections (let the deep representation include the input directly), skip-jumping knowledge networks (concatenate representations from multiple layers and choose per-node), and graph normalisation tricks (PairNorm, NodeNorm) that explicitly preserve representation distinctness across layers. In modern architectures these tricks let depth go to 8–32 layers, but rarely deeper.

Oversquashing

The complement of oversmoothing is oversquashing: information from distant nodes has to be compressed through bottleneck regions of the graph, and the compression destroys signal. A node's representation can in principle reflect K-hop information after K layers, but in practice the signal from each distant node is exponentially attenuated by the graph's expansion properties. Alon & Yahav (2021) gave the canonical analysis. Fixes include graph rewiring (add edges to short-circuit bottlenecks) and graph-transformer architectures (Section 9) that bypass the locality constraint entirely.

Practical depth recipe

For most node-classification problems, 2–3 message-passing layers, possibly with residual connections, is the right choice. If you need more layers, add residuals or jumping connections rather than just stacking. If long-range dependency seems important — molecules, code, mesh simulation — consider a graph transformer instead of more depth.

Graph Transformers and Modern Architectures

Transformers came late to graph learning — natural, since attention-over-everything seems to throw away the structure that GNNs exist to exploit. By 2021–2022, however, several papers showed that a transformer with the right structural augmentations matches or beats specialised GNNs on the hardest graph benchmarks. The graph transformer is the dominant architecture for molecule and protein-scale problems in 2026.

The naive failure and the structural fix

Apply a standard transformer to a graph by treating nodes as tokens, and attention is global: every node attends to every other node, ignoring edges entirely. This destroys the structural signal that makes graphs useful. The graph-transformer literature found two routes to put the structure back: encode it in the input (positional encodings), or restrict it in the attention (structural attention biases).

Positional encodings for graphs

Sequence transformers use sine/cosine or learned positional encodings to give each token a position. Graphs need an analogue, and the literature has converged on Laplacian positional encodings: the leading eigenvectors of the graph Laplacian provide a per-node coordinate that captures global structure. Random-walk-based encodings (the kth-step return probability) and shortest-path-distance features are alternatives. Architectures like Graphormer (Ying et al. 2021) and the GPS framework (Rampášek et al. 2022) combine these with a standard transformer block, plus an additional bias term in attention that depends on the structural distance between nodes.

Graphormer and shortest-path bias

The Graphormer architecture (which won the 2021 Open Catalyst Project competition and the OGB-LSC benchmark) introduced three additions to the standard transformer: a centrality encoding that adds learned embeddings for each node's degree; a spatial encoding that biases attention scores by the shortest-path distance between nodes (so attention falls off with distance, like in MPNNs but smoothly); and an edge encoding that adds edge-feature information to the attention bias. These three modifications turn a transformer into a structure-aware model competitive with the best GNNs on molecular property prediction.

The GPS framework

The General, Powerful, Scalable framework (GPS) generalises the graph-transformer recipe: every layer combines a local MPNN block (capturing nearby structure efficiently) with a global attention block (capturing long-range dependencies that MPNN cannot reach). The MPNN block can be GIN, GAT, or any other; the attention block can be standard transformer, Performer, or BigBird-style. GPS provides the right hybrid and is the foundation for many recent graph-foundation-model efforts.

When graph transformers are worth it

Graph transformers excel on tasks with long-range dependency: molecular properties that depend on global topology, protein folding where distant residues interact, mesh-based physical simulation where distant nodes propagate force. They underperform pure MPNNs on tasks where local structure is sufficient — citation networks, simple recommendation graphs — and they are more expensive (quadratic in graph size for full attention, though linear-attention variants exist). The deployment rule of 2026: use a fast MPNN for cheap, dense node-classification on huge graphs; use a graph transformer when the underlying task genuinely needs long-range information.

Frontier: Foundation Models and Graph Applications

For most of the deep-learning era graph models were trained from scratch on a single graph and a single task. The last few years have seen the first serious attempts at graph foundation models — pretrained, transferable representations for molecules, proteins, and knowledge graphs — and the integration of GNNs with the rest of the modern AI stack, especially LLMs. This is where the most active research lives in 2026.

Pretraining on graphs

The dominant pretraining tasks for graph foundation models are contrastive (different views of the same graph should embed similarly), masked (predict masked node features or edges), and denoising (recover graph properties under random perturbation). On molecular graphs, models like ChemBERTa, GROVER, and the more recent UniMol use self-supervised pretraining on tens of millions of unlabelled molecules and produce embeddings that transfer across downstream tasks (toxicity, solubility, activity). On protein structure, the ESM family of language models implicitly learn the contact graph from sequence; explicit-graph pretrained models such as ProteinMPNN and the various AlphaFold-derived embeddings are the structural counterpart.

AlphaFold and the protein-structure revolution

AlphaFold 2 (Jumper et al. 2021) and its successors are the most spectacular application of graph-attention machinery to date. The architecture treats a protein as a graph of amino acid residues, with attention modules that update both per-residue features and per-pair (edge) features through a deep stack of "Evoformer" blocks. The structural-prediction performance achieved approached experimental accuracy on most proteins, transforming structural biology and downstream drug discovery. The 2024 AlphaFold 3 extended the model to protein-protein, protein-ligand, and protein-DNA complexes, with an associated diffusion-model output head. The architectural lessons (graph attention plus structural priors plus deep stacks) inform every subsequent biological GNN.

Recommendation at scale

Pinterest's PinSage remains the canonical case study, but every major recommendation system in 2026 uses a GNN-based backbone for some component: candidate generation, re-ranking, or feature engineering. Twitter's GraphJet, Alibaba's EGES, Uber's order-graph embeddings, LinkedIn's connection-suggestion model — all are large-scale GNNs running on heterogeneous graphs of users, items, and interactions. The deployment patterns share themes: GraphSAGE-style sampling for inductivity, mixed-precision training to fit at scale, and aggressive caching of intermediate embeddings.

LLMs and knowledge graphs

The integration of GNNs with large language models is a major active area. Several lines of work: retrieval over knowledge graphs (an LLM uses a GNN-encoded knowledge graph as a memory/retrieval source); knowledge-graph-augmented generation (the LLM emits structured outputs that update a graph, which then conditions further generation); and graph-aware reasoning (the LLM explicitly traverses graph structure during chain-of-thought). The 2024–2026 literature on LLM-plus-graph hybrids is large and unsettled, but the convergence on hybrid retrieval+generation systems with explicit graph backbones looks durable.

What's next

The current frontiers worth watching include: graph foundation models that pretrain on diverse graph types and transfer across domains (still nascent, mostly limited to molecules); physical simulation using GNN-based mesh networks (DeepMind's MeshGraphNets, ECMWF's GraphCast for weather); combinatorial optimisation via learned graph heuristics (TSP, scheduling, theorem proving); causal discovery on heterogeneous networks (Section IX of Part XIII Ch 03 connects); and differentiable database operations where the graph is a learned data structure used inside larger models. The unifying theme is that "structure between things" is a primitive that classical neural architectures cannot represent natively, and the GNN family is the design space for building it in.

What this chapter does not cover

Several adjacent or specialised areas are deliberately out of scope. Spectral GNN theory beyond the brief mention in Section 4 — the Chebyshev expansions, ChebNet, the spectral graph signal processing literature — is well-developed but rarely needed for application work. Dynamic and temporal graphs, where edges appear and disappear over time (TGAT, TGN, the JODIE family), are a research field of their own and warrant separate treatment. Hyperbolic GNNs for hierarchical data, signed-graph methods for trust networks, and the graph generative family (GraphRNN, MolGAN, diffusion-on-graphs) all merit their own chapters. The classical graph-embedding methods that preceded GNNs (DeepWalk, node2vec, LINE) are historically important but have been largely superseded by GNN methods in production. And the combinatorial-optimisation-via-RL literature, while heavily GNN-flavoured, is best treated through the reinforcement-learning lens rather than the GNN lens.