Protein Science & AI, from amino acids to AlphaFold and beyond.
If DNA is the data and RNA is the messaging, proteins are the mechanism — they are what does, builds, regulates, and reacts. This chapter develops the working vocabulary an AI reader needs to engage with computational protein science (Sections 2–8 — amino acids, structure, functional classes, experimental techniques, evolution, families, modifications) and then turns to the AI methodology that has made the protein sub-field a frontier of computational biology since 2020: AlphaFold and structure prediction (Sections 9, 11), protein language models (Section 10), generative protein design (Sections 12, 13), function and variant-effect prediction (Sections 14, 15), antibody and enzyme engineering (Section 16), and the open-science evaluation infrastructure (Section 17). The single chapter combines what the field treats as inseparable: the biology that frames the problems and the AI methods that have substantially reshaped what the field can do about them.
Prerequisites & orientation
This chapter is both a domain primer and an AI-methods chapter. The first half (Sections 1–8) assumes no biology background beyond what most readers retain from high school, and it builds on the chemistry vocabulary developed in Ch 02. The second half (Sections 9–17) assumes the working machinery of modern deep learning (Part VI), the equivariance methodology of Ch 01 Section 8 (SE(3)-equivariant networks are the architectural backbone for many protein methods), the diffusion-model material of Part X (essential context for Sections 12–13), and the foundation-model material of Part X (which informs Section 10). Readers with a strong biology background can skim Sections 2–8; readers with strong ML but no biology should take their time with the first half before engaging with the second.
Three threads run through the chapter. The first is the sequence-determines-structure-determines-function hierarchy: a protein's amino-acid sequence (encoded by its gene) determines its three-dimensional fold, which in turn determines what the protein does — and each layer is its own prediction problem with its own AI methodology. The second is the functional diversity of proteins: from the same 20-letter amino-acid alphabet, biology constructs catalysts, structural materials, transporters, sensors, motors, signals, and recognition systems, which is why proteins are the dominant drug-target class. The third is the open-vs-closed tension that has shaped the post-AlphaFold AI methodology: AlphaFold 2 was released as open source with weights; AlphaFold 3 was initially restricted; ESM, RoseTTAFold, RFdiffusion are open. Section 17 returns to this. Together the threads frame both why proteins are the most-developed AI-for-biology subdomain and how the methodology has matured.
Why Proteins, and Why Protein-AI
If DNA is the data and RNA is the messaging, proteins are the mechanism — they are what does, builds, regulates, and reacts. Almost every cellular process you can name happens because of protein activity, and almost every disease you can name involves protein dysfunction. This chapter develops both the working vocabulary an AI reader needs to engage with computational protein science (Sections 2–8) and the AI methodology that has substantially reshaped the field since 2020 (Sections 10–18). Section 9 frames what makes protein-AI methodologically distinctive from an ML perspective; this section maps the biology itself.
Proteins are the cellular workforce
Almost every cellular process happens because of protein activity. The metabolism that turns food into energy is enzyme-catalysed; the immune response that fights infection is antibody-mediated; the muscle contractions that let you move are motor-protein-driven; the signalling that lets cells communicate is receptor-and-kinase-based. To understand any biological process at a mechanistic level is, almost always, to understand which proteins are involved and how. Section 4 surveys the major functional classes — enzymes, structural proteins, transporters, signalling proteins, antibodies, motors, regulatory proteins — to give a feel for the range.
Disease as protein dysfunction
The medical implications follow directly. The vast majority of human diseases involve protein dysfunction. Sickle cell disease is caused by a single amino-acid change in haemoglobin that makes red blood cells deform. Cystic fibrosis is caused by mutations in CFTR, a chloride-channel protein. Cancer is, at the molecular level, dysregulation of growth-controlling proteins (oncogenes, tumour suppressors). Alzheimer's disease involves misfolded proteins (amyloid beta, tau) accumulating in the brain. The cumulative result is that proteins are also the dominant drug target class: roughly 90% of all approved small-molecule drugs work by binding to a specific protein and modulating its activity. Understanding proteins is therefore not just a basic-science exercise but the substrate of essentially all of medicine.
The sequence-structure-function hierarchy
The conceptual spine of protein science is a three-tier hierarchy. The amino-acid sequence is encoded in DNA; the sequence determines a three-dimensional structure through folding (Anfinsen's principle); the structure determines biochemical function. Each layer is a distinct prediction problem with its own AI methodology — sequence-to-structure (Sections 10, 12), sequence-to-function and structure-to-function (Section 15), and the inverse problems of designing sequences for desired structures or functions (Sections 13, 14). The hierarchy is the reason proteins have been the most-tractable AI-for-biology subdomain: the layers are well-defined, the data is rich at each layer, and the inverse problems have practical importance for therapeutic-protein development.
How this chapter is organised
Sections 2–8 develop the working vocabulary an AI reader needs to engage with computational protein science: the amino-acid alphabet, the four levels of structure, the major functional classes, the experimental techniques that produce structural data, sequence alignment and evolution, the family-and-domain organisation of protein space, and the post-translational modifications that regulate protein activity. Section 9 turns to the AI methodology proper, framing how protein-AI looks from a machine-learning perspective — its data substrate, the structure-as-target oddity, multiple training signals, the rapidity of progress, and the empirical-vs-mechanistic tension. Sections 10–18 develop the methods: AlphaFold (10) and structure-prediction successors (12), protein language models (11), de novo design (13), inverse folding (14), function and property prediction (15), variant-effect prediction (16), antibody and enzyme design (17), and the open-science evaluation infrastructure plus the active frontier (18).
Proteins do almost everything that requires action in a cell: catalysis, structure, transport, signalling, recognition, motion, regulation. They are also the dominant drug-target class — roughly 90% of approved small-molecule drugs work by binding a specific protein and modulating its activity. The biology vocabulary developed in Sections 2–8 is the prerequisite for understanding why protein-AI has been the most-developed AI-for-biology subdomain since 2020, which Section 9 onward turns to.
The Amino-Acid Alphabet
Proteins are polymers of amino acids — twenty distinct building blocks linked by peptide bonds — and the chemistry of those amino acids drives essentially everything about how proteins behave. This section introduces the alphabet.
What proteins are made of
An amino acid is a small molecule with a consistent core architecture and one variable region. The core is a central carbon (the "alpha carbon") bonded to four things: an amino group (NH₂), a carboxyl group (COOH), a hydrogen atom, and a variable side chain (the "R group"). The twenty standard amino acids found in essentially all proteins differ only in their side chains, which range from a single hydrogen (glycine, the smallest) through bulky aromatic rings (phenylalanine, tryptophan, tyrosine) to charged groups (lysine and arginine carry positive charge; aspartate and glutamate carry negative). Two amino acids link together when the carboxyl group of one reacts with the amino group of the other, forming a peptide bond and releasing a water molecule. Long chains of amino acids linked this way are called polypeptides, and proteins are typically polypeptides of 50 to several thousand amino acids.
Side-chain chemistry shapes everything
The chemistry of the side chains drives almost everything that follows. Hydrophobic side chains (alanine, valine, leucine, isoleucine, phenylalanine) are oily and avoid water; in a folded protein they cluster on the inside, away from the aqueous cellular environment. Hydrophilic side chains (serine, threonine, asparagine, glutamine) like water and tend to sit on the protein's surface. Charged side chains form salt bridges and engage in electrostatic interactions, often at active sites or binding interfaces. A handful of amino acids play special roles: cysteine can form disulfide bonds with another cysteine, creating covalent links that stabilise structure; proline is rigid and breaks regular structural patterns; glycine is so small it provides flexibility wherever a chain needs to turn sharply. The 20-letter amino-acid alphabet, with its mix of hydrophobic, hydrophilic, charged, and special characters, gives proteins enormous functional range from a relatively small chemical vocabulary.
Structure: The Four Levels and the Folding Problem
A linear amino-acid chain in solution does not stay linear — it folds, in milliseconds to seconds, into a specific three-dimensional shape determined by the sequence. The folded structure is what does the work; an unfolded protein is essentially useless.
The four levels of structure
Biologists describe protein structure at four levels. The primary structure is the linear amino-acid sequence (read directly from the gene). The secondary structure is the local folding pattern — chains commonly form alpha helices (right-handed spirals stabilised by hydrogen bonds along the backbone) and beta sheets (extended strands aligned in parallel or antiparallel to form pleated surfaces). The tertiary structure is the complete 3D fold of a single polypeptide chain — how the helices, sheets, and connecting loops pack together. The quaternary structure describes how multiple folded polypeptide chains (subunits) assemble into a functional complex; haemoglobin, for example, is a quaternary assembly of four subunits.
The folding problem and AlphaFold
The folding problem — predict the three-dimensional structure from the amino-acid sequence — was articulated in the early 1960s and resisted solution for half a century. The difficulty is that the search space is astronomical (a 100-residue chain can in principle adopt 10¹⁰⁰ configurations), and the energy landscape that selects the correct fold is exquisitely sensitive to sequence. Experimental methods (X-ray crystallography, NMR, cryo-electron microscopy) could solve individual structures at substantial cost; computational prediction was poor until DeepMind's AlphaFold 2 system effectively solved the problem in 2020. The result is that as of 2026, every protein in UniProt has a predicted structure available in the AlphaFold Database — a transformation in structural biology that has fundamentally reshaped how researchers work with proteins. Section 10 develops the AlphaFold story in detail.
The Major Functional Classes
Proteins do almost everything that requires action in a cell. The major functional classes:
Enzymes
Enzymes catalyse chemical reactions — they speed up reactions that would otherwise take seconds to days, often by factors of 10⁶ to 10¹². Without enzymes, life as we know it would be too slow to exist. Examples: DNA polymerase (copies DNA), hexokinase (first step of glucose metabolism), trypsin (digests proteins in your gut). The 2024 estimate is that humans have roughly 5,000 enzymes; bacteria like E. coli have ~2,000.
Structural proteins
Structural proteins give cells and tissues shape, strength, and elasticity. Collagen is the most abundant protein in mammals — it forms the matrix of skin, tendons, bone, and cartilage. Keratin makes up hair, nails, and the outer layer of skin. Actin and tubulin form the cytoskeleton that gives cells their shape and lets them move. The mechanical properties we associate with biological materials (tensile strength of tendons, hardness of nails, flexibility of skin) come from the architecture of structural proteins.
Transport proteins
Transport proteins move molecules around. Haemoglobin carries oxygen from lungs to tissues. Membrane channels and pumps ferry ions and small molecules across cell membranes — sodium-potassium pumps maintain the electrical gradients that nerves use to signal; aquaporins let water across membranes; glucose transporters import sugar. Lipoproteins (LDL, HDL) carry cholesterol and fats through the bloodstream. Membrane proteins as a class (transporters, channels, receptors) account for a substantial fraction of all proteins and the majority of small-molecule drug targets, even though they crystallise poorly and Section 12 returns to the AI-prediction challenges they pose.
Signalling proteins
Signalling proteins let cells communicate. Hormones (insulin, growth hormone, thyroxine) are protein or peptide messengers that travel between tissues. Receptors sit on cell surfaces and detect external signals — when a hormone binds a receptor, it triggers a cascade of intracellular events. Kinases and other enzymes propagate signals inside cells by adding phosphate groups to other proteins, switching them on or off. The signalling apparatus is a major drug-discovery target: many drugs work by binding receptors (agonists) or blocking them (antagonists).
Antibodies
Antibodies (also called immunoglobulins) are the immune system's recognition proteins. Each antibody has a distinctive shape that binds to a specific molecular feature on a pathogen (a viral coat protein, a bacterial surface protein, a toxin). The immune system can produce antibodies against essentially any foreign molecule, which is the basis of vaccination and the modern monoclonal-antibody drug class (the various "-mab" drugs, including substantial COVID and cancer therapeutics). Section 17 develops antibody design as an AI application.
Motor proteins
Motor proteins generate mechanical force. Myosin walks along actin filaments and is the basis of muscle contraction. Kinesin and dynein walk along microtubules and transport cargo through cells. ATP synthase is a literal rotary motor that produces the cell's energy currency by spinning. Motor proteins are the closest biology comes to designed mechanical systems, and their study has yielded substantial insights into how molecular machines work.
Regulatory and other classes
Regulatory proteins control which genes are expressed when. Transcription factors bind specific DNA sequences and turn nearby genes on or off. Chromatin remodelers reposition nucleosomes to expose or hide DNA from the transcription machinery. The regulatory layer is a major source of cell-type identity — every cell has the same genome, but different cells express different proteins because different regulatory proteins are active. Other classes include storage proteins (ferritin stores iron, ovalbumin stores amino acids in egg white), defensive proteins beyond antibodies (defensins, antimicrobial peptides), and scaffold proteins that organise other proteins into functional complexes. The total number of distinct proteins in a typical human cell is estimated at 10,000–20,000 expressed at any given time, drawn from the ~20,000 protein-coding genes plus splice and post-translational variants.
Experimental Techniques: X-ray, NMR, Cryo-EM
Before AlphaFold, every protein structure had to be determined experimentally. The three dominant techniques — X-ray crystallography, NMR spectroscopy, and cryo-electron microscopy — produced essentially all of the ~220,000 structures in the Protein Data Bank, and they remain the gold standard for atomic-resolution validation. Understanding what they measure, what they can and can't do, and how their data feeds into modern AI methods is essential context for the AI methodology of Sections 10–18.
X-ray crystallography
X-ray crystallography has been the dominant technique for protein structure determination since the 1950s and accounts for roughly 85% of structures in the PDB. The methodology: purify a protein, crystallise it (a substantial art — many proteins resist crystallisation), shoot the crystal with a powerful X-ray beam, record the diffraction pattern as the crystal rotates, mathematically invert the diffraction pattern to recover the electron density, and interpret the density to build an atomic model. The diffraction pattern is the Fourier transform of the electron density; the inversion requires phase information that is not directly measured (the "phase problem"), which historically required experimental techniques (multiple isomorphous replacement, anomalous scattering) or molecular replacement using related known structures. Modern AlphaFold predictions are increasingly used as molecular-replacement templates, substantially shortening the experimental workflow.
X-ray crystallography produces the highest-resolution protein structures available — sub-1.0 Å resolution is achievable for well-ordered crystals, with typical resolutions of 1.5–3 Å for routine work. The technique works for a wide range of protein sizes, from small peptides to massive ribosomes. Its limitations are real: many proteins (intrinsically disordered ones, membrane proteins, large flexible complexes) are hard or impossible to crystallise; the crystal environment may distort biological structures (lattice contacts, non-physiological pH or salts); only a static structure is captured (no information about dynamics). Despite these limitations, X-ray crystallography remains the gold standard for atomic-resolution structures, and the techniques continue to mature with synchrotron and X-ray-free-electron-laser sources providing ever-shorter exposures and higher-quality data.
NMR spectroscopy for structure
NMR spectroscopy for protein structure determination uses many of the same nuclear-spin principles introduced in Ch 02 Section 9, but applied at scale to determine atomic-resolution structures of proteins in solution rather than in crystals. The methodology measures distances between specific atomic pairs (typically through Nuclear Overhauser Effects between hydrogens within ~5 Å of each other) and dihedral angles, then computationally assembles these constraints into a 3D structure. Modern protein NMR uses isotopically-labelled samples (¹³C, ¹⁵N, sometimes ²H) and multidimensional pulse sequences to disentangle the thousands of overlapping signals.
NMR has substantial advantages: structures are determined in solution, closer to physiological conditions; dynamic information (which residues are flexible, on what time scales) is accessible; ligand-binding studies can be done in real time. The principal limitation is size — solution NMR works well for proteins up to ~30 kDa (~250 residues) and becomes increasingly difficult above that, though TROSY-based methods extend the range somewhat. NMR represents about 10% of PDB structures and remains particularly important for intrinsically-disordered proteins (which crystallise poorly), small ligand-binding studies, and dynamics characterisation.
Cryo-electron microscopy (cryo-EM)
Cryo-electron microscopy has been the most-transformative protein-structure-determination technique of the 2010s and 2020s. The methodology: flash-freeze a thin film of protein solution into a glassy ice (preserving native conformation), image the frozen sample with a transmission electron microscope, computationally average tens of thousands to millions of individual particle images to produce a high-resolution 3D reconstruction. The 2017 Nobel Prize in Chemistry recognised Henderson, Frank, and Dubochet for the technique's development; the empirical performance reached "atomic resolution" in the 2014–2017 "resolution revolution" driven by direct-electron-detector cameras and improved image-processing algorithms.
Cryo-EM excels precisely where X-ray crystallography struggles: large complexes (ribosomes, membrane-protein complexes, viral capsids), membrane proteins (which crystallise poorly), and conformationally heterogeneous samples (where the imaging-and-classification workflow can resolve multiple coexisting states). Modern cryo-EM accounts for ~5% of PDB structures and is growing rapidly; the proportion of new high-resolution structures coming from cryo-EM exceeded X-ray's contribution for some classes of target by 2022. The technique is computationally intensive — modern workflows use ML methods extensively for particle picking, classification, and reconstruction, and the methodology of cryo-EM data processing is itself an active AI-application area.
The Protein Data Bank
The Protein Data Bank (PDB), established in 1971, is the world's repository for experimentally-determined macromolecular structures. The PDB holds atomic coordinates, experimental conditions, validation data, and rich metadata for each deposition. The format (originally fixed-width text PDB files; increasingly the more-extensible mmCIF format) is the lingua franca of structural biology. As of 2026 the PDB holds ~220,000 structures, with weekly deposition rates that reflect both the maturity of established techniques and the cryo-EM revolution. The database is free, openly accessible, and the foundation of essentially all structural-biology AI training data — AlphaFold 2's training set was the PDB minus held-out test cases; ESM is trained on UniProt with PDB-derived structural supervision; the various 2024–2026 protein foundation models all build on the PDB substrate.
How AI complements rather than replaces experiment
A common misconception is that AlphaFold has made experimental structure determination obsolete. The reality is more nuanced. AlphaFold predictions are accurate on average but routinely make specific errors that matter for downstream applications — wrong loop conformations, missed conformational states, incorrect ligand-binding-site geometry. Experimental structures remain the gold standard for atomic-resolution validation, particularly for drug-discovery applications where small geometric errors can mean the difference between active and inactive compounds. The post-AlphaFold workflow has shifted: structural biologists increasingly start from AlphaFold predictions, identify the regions where prediction is uncertain (AlphaFold's pLDDT confidence scores help with this), and focus experimental effort on those regions. The complementarity is real, and the discipline has reorganised around it rather than replaced one approach with the other.
Sequence Alignment, Evolution, and MSAs
Proteins evolve. Most proteins in modern organisms are descended, with modifications, from ancestral proteins in earlier organisms. The methods of sequence alignment compare protein sequences to each other to identify shared ancestry, conserved features, and functional sites. The methodology predates modern AI by decades but feeds directly into AlphaFold's evoformer, ESM's training data preparation, and essentially every modern protein-AI method.
Why alignment matters
If two proteins share a common ancestor, their sequences typically share substantial similarity even after hundreds of millions of years of independent evolution. The fraction of residues that match — called sequence identity — varies from near-100% for recently-diverged orthologs to ~25% for distantly-related members of the same superfamily. Below ~25% identity, alignment becomes unreliable using sequence alone, and structural information helps. Sequence alignment is the bridge that lets researchers transfer functional annotations from well-characterised proteins to less-studied ones, identify the functionally-critical residues conserved across evolution, and reveal the evolutionary history of protein families. Modern AI methods exploit alignments extensively — often more thoroughly than humans do — because the patterns of co-variation across alignments encode information about which residues interact in 3D space.
Pairwise alignment
The foundational algorithms for protein sequence alignment are dynamic-programming methods that find the optimal alignment of two sequences under a scoring scheme. Needleman-Wunsch (1970) finds the optimal global alignment — aligning the full length of both sequences. Smith-Waterman (1981) finds the optimal local alignment — the best matching subsequences regardless of overall length. Both algorithms run in O(MN) time for sequences of lengths M and N, building up a scoring matrix and tracing back to recover the alignment. The scoring schemes — substitution matrices like BLOSUM62 (Henikoff & Henikoff 1992) or PAM (Dayhoff 1978) plus gap penalties — encode evolutionary observations: certain amino-acid substitutions are common (Asp↔Glu, Lys↔Arg) because the substituted residues have similar chemistry; others are rare. The methodology is mature, mathematically elegant, and the substrate for everything that follows.
BLAST and database searching
Pairwise alignment of two specific sequences is fast; searching one query sequence against millions of database sequences is harder. BLAST (Basic Local Alignment Search Tool, Altschul et al. 1990) is the dominant heuristic for the latter: identify high-scoring "seed" matches between query and database sequences, then extend them into local alignments using Smith-Waterman-style logic. BLAST sacrifices guaranteed optimality for speed (it runs in time linear in database size rather than quadratic), and the resulting alignments are typically as good as exhaustive search would produce. BLAST has been the workhorse sequence-search tool for thirty-five years; it is the first thing most biologists do with a new sequence.
Modern variants include PSI-BLAST (Position-Specific Iterative BLAST, which iterates by building a profile from initial hits and re-searching), HHblits (using profile-vs-profile alignment for sensitive remote-homology detection), and MMseqs2 (a substantially faster successor that uses optimised k-mer indexing). The 2024–2026 generation of search tools increasingly incorporates ML methods, with DIAMOND for fast protein searches and the various neural-network-based homology detectors (DeepBLAST, the various 2023–2025 successors) for ultra-remote homology.
Multiple sequence alignments
A multiple sequence alignment (MSA) aligns three or more sequences simultaneously, producing a matrix where each column represents a single position across all sequences. MSAs reveal patterns invisible to pairwise alignment: which positions are universally conserved (likely functionally critical), which positions vary in correlated ways (likely interacting in 3D), which insertions and deletions distinguish subfamilies. The standard MSA-construction tools — Clustal, MUSCLE, MAFFT, the various successors — use progressive alignment heuristics: build a guide tree of sequence similarity, then align sequences in tree order, accumulating alignments at each tree node.
For protein-AI applications, MSAs are the central data structure. AlphaFold 2's evoformer trunk takes an MSA as input and uses the column-wise patterns to predict 3D structure. ESM and other protein language models are essentially trained to model the same patterns implicitly, by learning from billions of individual protein sequences. The 2023 wave of "MSA-free" methods (including ESMFold and OmegaFold) showed that protein language models can sometimes match MSA-based methods even without an explicit MSA at inference time — but the underlying signal that the language models exploit is still evolutionary co-variation, just learned implicitly during pretraining.
Conservation, co-variation, and contacts
An MSA column where the same amino acid appears in essentially every sequence is conserved — typically because that residue is functionally critical and mutations are rapidly purged by selection. Conservation analysis identifies likely active sites (catalytic residues in enzymes, binding residues in receptors) and structurally critical positions (buried hydrophobic core residues, disulfide-bond cysteines).
More subtly, two columns may co-vary: when one position changes in evolution, another position tends to change correspondingly. The most common reason is that the two residues are physically in contact in 3D space and one mutation needs to be compensated by another to maintain the contact. Direct Coupling Analysis (Morcos et al. 2011) and successors (PSICOV, GREMLIN, EVcouplings) extract co-variation patterns from MSAs to predict residue contacts — and this approach was the most-effective pre-AlphaFold strategy for structure prediction. AlphaFold 2's evoformer formalised the same insight as a learnable architecture and substantially exceeded the explicit co-variation methods. The methodology connects AI-for-protein-science directly to the deep evolutionary record encoded in protein sequences.
Phylogenetics in brief
A phylogenetic tree reconstructs the evolutionary history of related sequences. Methods include distance-based approaches (UPGMA, neighbour-joining) that build trees from pairwise distances, parsimony-based methods that minimise total mutations, and probabilistic methods (maximum-likelihood, Bayesian) that explicitly model evolutionary rates. Phylogenetics has its own large literature; for an AI reader, the relevant points are that phylogenetic trees provide the structure on top of MSAs (which sequences are most closely related, which subfamilies exist, when groups diverged), and that this structure is increasingly used by modern protein-AI methods to weight or partition training data appropriately.
Protein Families and Domains
Proteins are not random sequences; they are organised into families, superfamilies, and a relatively limited set of structural folds. Understanding this organisation is essential for transferring functional information across related proteins, predicting function from sequence, and grounding the AI methods that operate at the family level.
The domain as the unit of evolution
Most proteins are not single uniform structures but assemblages of domains — independently-folding structural units of typically 50–250 amino acids that recur across many proteins. A single protein may contain one domain (small enzymes, many transcription factors) or many (large signalling proteins, multi-functional enzymes). The domain is roughly the protein-science analogue of the function in software: a self-contained unit with defined inputs, outputs, and behaviour, composable with other domains to build larger systems. Evolution operates extensively at the domain level — domains get duplicated, fused, swapped between proteins, mutated to acquire new functions — and the resulting domain architectures are often more informative about function than full sequence comparison.
Fold space and structural classification
How many distinct three-dimensional folds exist in nature? Surveys of the PDB suggest the answer is in the low thousands — roughly 1,000–1,500 distinct folds based on hierarchical classifications. Two databases dominate the field. SCOP (Structural Classification of Proteins, originally Murzin, Brenner, Hubbard, Chothia 1995, with subsequent SCOP2 and SCOPe descendants) classifies structures hierarchically: class (alpha, beta, alpha+beta, etc.), fold, superfamily, family. CATH (Class, Architecture, Topology, Homology, Orengo et al. 1997) uses a similar hierarchical framework with somewhat different criteria. Both produce roughly comparable counts of distinct folds at the top of their hierarchies.
The fold-space classification has substantial implications. Most newly-discovered proteins fold into already-known folds; truly novel folds are rare. This means that for a new sequence, the most likely structural outcome can be inferred from sequence-similarity searches into the existing fold catalogue. AlphaFold 2's success substantially supports this view — the system was trained on existing PDB structures and generalises remarkably well to new sequences, suggesting that most new structures are recombinations of known structural patterns rather than fundamentally novel folds.
Pfam and InterPro
The dominant sequence-level domain database is Pfam (Protein Families, originally Sonnhammer, Eddy, Durbin 1997, now part of InterPro). Pfam organises proteins into families based on shared evolutionary origin, with each family represented by a hidden Markov model (HMM) profile built from a curated multiple sequence alignment. As of 2025 Pfam covers ~20,000 families spanning ~80% of UniProt sequences. InterPro integrates Pfam with several other classification resources (PROSITE, SMART, CDD, TIGRFAMs, the various others) into a unified framework. The HMM-based methodology is mature and substantially more sensitive than direct BLAST searches, particularly for distantly-related family members.
For an AI reader, Pfam serves several roles. It is a source of training data: many protein-AI methods are trained or evaluated using Pfam family annotations. It is a source of features: domain composition is a powerful per-protein feature. It is a source of biological priors: the family structure encodes substantial functional information that pure sequence approaches may not extract automatically. The 2024–2026 generation of protein-AI methods increasingly integrates Pfam-style family information explicitly, with substantial empirical gains in some settings.
Common folds worth knowing
A small number of folds account for most known structures, and an AI reader benefits from knowing the most common ones by name. Alpha-beta barrels (the TIM barrel is the canonical example) consist of an inner beta barrel surrounded by alpha helices and are common in enzymes. Rossmann folds bind nucleotide cofactors and appear in many oxidoreductases and kinases. Immunoglobulin folds are the antibody scaffold and recur in many cell-surface and signalling proteins. Globin folds bind heme groups for oxygen transport (haemoglobin, myoglobin). Beta-propeller folds (typically 4–8 blades arranged radially) appear in WD40 repeats, GTPase regulators, and many protein-protein interaction modules. The fold catalogue is finite and recognisable; structural biologists can usually classify a structure by inspection.
How domain architecture predicts function
Many proteins are multi-domain, with the function of the whole determined by the combination and arrangement of its domains. A typical receptor tyrosine kinase has an extracellular ligand-binding domain, a single transmembrane helix, and an intracellular kinase domain — the architecture itself indicates "membrane-spanning kinase that is regulated by extracellular signals." Knowing the domain composition often substantially constrains function even when the specific kinase substrates aren't yet identified. Modern function-prediction methods (Section 15) increasingly use domain composition as a feature, and the methodology of remote homology detection often boils down to identifying conserved domain architectures even when full-length sequence similarity is undetectable.
The post-AlphaFold reorganisation
A specific consequence of AlphaFold 2 worth flagging: with predicted structures available for essentially every protein, the boundary between sequence-based and structure-based protein classification has blurred. The 2024 wave of "structural alignment" tools (Foldseek being the most-cited) can perform fast structure-vs-structure searches across the entire AlphaFold database, identifying structural homologues that sequence-based methods miss. This has substantially extended the reach of protein-family detection — proteins that look unrelated by sequence often share clear structural similarity, revealing distant evolutionary relationships and previously-hidden functional connections. Tools like Foldseek (van Kempen et al. 2023) now process structural-similarity queries at speeds comparable to BLAST, which has begun to reorganise how protein-family databases are constructed.
Modifications and Regulation
A protein's behaviour in cells is rarely determined by its sequence alone. After translation, proteins are extensively modified, regulated, trafficked, and ultimately degraded — and these post-translational layers constitute most of the cell's actual control system. Understanding them is essential for understanding why proteins behave differently in different contexts, why drug-target activity varies across cell types, and why functional prediction from sequence alone has fundamental limits.
Post-translational modifications
Post-translational modifications (PTMs) are chemical modifications to amino-acid side chains made after a protein is synthesised. They are essentially universal — most cellular proteins carry at least one PTM, and many carry dozens. The major classes by abundance and functional importance:
Phosphorylation (addition of a phosphate group to serine, threonine, or tyrosine residues) is the dominant signalling modification. Roughly one-third of human proteins are phosphorylated at some point; the modifications are dynamic, reversible, and used to switch protein activities on or off. The enzymes that add phosphate (kinases) and remove it (phosphatases) form an integrated regulatory layer — the human genome encodes ~520 kinases (the "kinome") and ~150 phosphatases (the "phosphatome"). Kinases are among the most-studied drug targets — Imatinib (Gleevec) for chronic myeloid leukaemia, Sorafenib for kidney cancer, and dozens of other "-nib" drugs all target kinases.
Ubiquitination attaches the small protein ubiquitin to lysine residues. Single ubiquitin molecules signal trafficking or activity changes; chains of ubiquitins (typically polyubiquitin chains attached at specific lysines) target the protein for degradation by the proteasome (see below). The 2004 Nobel Prize in Chemistry recognised the discovery of the ubiquitin-proteasome system. Ubiquitination has become a major drug-discovery target through PROTACs (Proteolysis-Targeting Chimeras) — small molecules with two binding domains that physically link a target protein to a ubiquitin ligase, forcing the cell to degrade the target. The methodology has produced approved drugs and a substantial pipeline of candidates.
Glycosylation attaches sugar chains (glycans) to specific residues, typically asparagine (N-linked) or serine/threonine (O-linked). Glycosylation is essentially universal on proteins that are secreted from cells or sit on the cell surface, and the glycans determine cellular trafficking, protein folding, and recognition by other proteins (the immune system reads glycan patterns to distinguish self from non-self). Glycosylation is heterogeneous — the same protein backbone can carry many different glycan structures depending on cell type, conditions, and enzymatic context — which substantially complicates structural and functional characterisation.
Methylation and acetylation are smaller modifications that fine-tune protein activity. Histone methylation and acetylation are the canonical examples — these modifications on histone tails encode the histone code that governs chromatin state and gene expression (Ch 04 develops the genomic context). Beyond histones, methylation and acetylation modify many other proteins, often regulating their interactions and stability.
The full PTM catalogue includes dozens of additional modifications — sumoylation, neddylation, palmitoylation (lipid modification), GlcNAcylation, ADP-ribosylation, oxidation, nitration, the various others. Resource databases like PhosphoSitePlus and UniProt's PTM annotations catalogue them comprehensively. From an AI-prediction perspective, PTM prediction is a substantial sub-field with its own methods (the various "PhosphoNet"-style classifiers, the modern transformer-based PTM predictors), and PTM-aware structure prediction is an active 2024–2026 research area.
Localisation and trafficking
Proteins do not freely wander cells; they are actively localised to specific compartments. Proteins destined for the cell membrane or for secretion carry an N-terminal signal peptide that targets them to the endoplasmic reticulum, where they are processed, often glycosylated, and routed toward their destinations. Proteins destined for the nucleus typically carry a nuclear localisation signal (NLS) — a short basic-residue motif recognised by import machinery. Mitochondrial proteins typically have N-terminal cleavable mitochondrial-targeting sequences. The methodology of localisation prediction from sequence is mature (DeepLoc and successors) and important for understanding what a protein actually does in a cell.
Protein degradation
Proteins do not last forever. Most cellular proteins are degraded within hours to days, and the rate of degradation is regulated, signal-dependent, and tightly coupled to PTMs. The two major degradation systems are the ubiquitin-proteasome system (UPS) — polyubiquitinated proteins are recognised by the 26S proteasome and chopped into peptides — and the autophagy-lysosome system — entire regions of cytoplasm are engulfed by autophagosomes, fused with lysosomes, and broken down. Both systems are essential for cellular homeostasis; failures cause specific diseases (Parkinson's involves accumulating alpha-synuclein that should have been degraded; certain cancers exploit altered proteasome activity for survival). Drug development targeting these systems (proteasome inhibitors like bortezomib for myeloma; autophagy modulators in clinical trials for various conditions) is an active area, and AI-for-drug-discovery methods (Ch 07) increasingly engage with degradation as a target.
Allosteric regulation and conformational change
Beyond PTMs, many proteins are regulated through allosteric mechanisms — binding of a regulator at one site changes the protein's activity at a distant site by inducing conformational change. The classical example is haemoglobin's cooperative oxygen binding (binding one O₂ increases affinity for the next) mediated by allosteric transitions. Allosteric drugs (which bind sites distinct from a protein's active site) have become an important pharmaceutical category — they often offer better selectivity than active-site inhibitors because allosteric sites are less conserved across protein families. The methodology of allosteric site identification and allosteric drug design is an active AI-for-drug-discovery area.
Why this matters for AI
The post-translational layer matters for AI methods because most current protein-AI tools operate on canonical sequences without explicit modelling of PTMs, localisation, or degradation. Predictions of "function" from canonical sequence are necessarily incomplete — the same protein in different cell types or developmental stages may have substantially different effective behaviours due to PTM and localisation differences. The 2024–2026 frontier of AI-for-protein-science increasingly engages with this — PTM-aware foundation models, dynamic-state prediction methods, and integration with cellular-context models — are active research areas. An AI reader entering protein-AI work should know that the canonical-sequence layer captured by AlphaFold and ESM is one layer of a multi-layer cellular reality, and that the additional layers are increasingly tractable for ML methods.
From Protein Science to ML: An Orientation
The previous eight sections established the protein science. This one is the bridge to the methodology that follows. Most ML practitioners come to protein science assuming the methods will transfer cleanly — it's just sequences and structures, just classification and regression. The methods do transfer, but several properties of the protein-AI subfield make it methodologically distinctive: a particular shape of public-data substrate, the unusual character of structure-as-a-prediction-target, the availability of multiple complementary training signals, the unusual rapidity of progress that has made eighteen-month-old methods feel dated, and the empirical-vs-mechanistic tension that shapes how results are interpreted. This section orients the ML practitioner; Sections 10–18 develop the methods within that frame.
The data substrate, from an AI perspective
Protein science has the largest, cleanest, most-accessible labelled data of any AI application area. UniProt's ~250M annotated sequences are essentially a free, curated, high-quality training corpus — the protein-science analogue of Common Crawl, but with substantially better quality control. PDB's ~220K experimental structures provide structural ground truth at atomic resolution. UniRef clusters at 50%, 90%, and 100% identity provide pretraining-data deduplication out of the box. BFD and MGnify add metagenomic sequences that substantially expand training-data diversity beyond the well-studied model organisms. MAVE-DB aggregates deep mutational scanning experiments providing labelled variant-effect ground truth. PhosphoSitePlus, UniProt's functional annotations, and the various specialised databases provide labels for downstream tasks. From a pure ML perspective, the data substrate is enviable: the corpus is large, the labels are high-quality, the evaluation benchmarks are well-established, and the data is genuinely open and downloadable.
Structure-as-a-target is unusual
Most ML applications predict scalar values, classifications, or token sequences. Protein structure prediction is unusual because the target is a set of 3D atomic coordinates with rotational and translational symmetries — predicting a structure is fundamentally a regression problem in SE(3), the special Euclidean group of rigid-body transformations. This forces the methodology toward equivariant architectures (Ch 01 Section 8) that respect the symmetries by construction. AlphaFold 2's structure module uses an SE(3)-equivariant attention mechanism for this reason; subsequent methods (RFdiffusion, the various 2024–2026 successors) all use related equivariant constructions. The methodology connects directly to the broader equivariant-network literature in physics, chemistry, and computer graphics, and protein-AI has been one of the field's most-productive testbeds for equivariance methods.
Multiple complementary training signals
A specific methodological richness worth flagging: protein-AI methods can exploit several distinct training signals simultaneously. Sequence-only signals from masked-language-modelling on UniProt produce substantial structural and functional information without any structure data (ESM-1 demonstrated this in 2019; subsequent work refined and extended). Structure signals from the PDB train end-to-end structure prediction directly. Evolutionary signals from multiple sequence alignments encode co-variation patterns that imply contacts and conservation. Functional signals from labelled databases provide downstream supervision. Experimental fitness signals from deep mutational scans provide direct readouts of how mutations affect function. Most modern protein-AI methods combine multiple of these signals — AlphaFold 2 uses sequence, structure, and MSA simultaneously; ESM-3 unifies sequence, structure, and function in a single model — and the methodology has matured substantially around how to integrate them.
The unusual rapidity of progress
The protein-AI subfield has progressed faster than nearly any comparable AI domain. From CASP12 (2016) to CASP14 (2020), structure-prediction accuracy went from ~50 GDT-TS to ~92 GDT-TS — roughly a doubling of the most-cited metric, on the same benchmark, in four years. Subsequent post-AlphaFold progress has been similarly compressed: ESMFold's 2022 release, RFdiffusion's 2023 release, AlphaFold 3 in May 2024, the 2024 Nobel Prize in Chemistry. The rapidity has practical consequences for an AI practitioner entering the field — methods released eighteen months ago may already be displaced, the literature is in continuous flux, and "best practice" in 2026 differs substantially from 2024. This chapter aims to capture the durable conceptual structure beneath the rapid surface change.
The empirical-vs-mechanistic tension
A specific methodological tension shapes the field. AI methods routinely produce predictions that work without producing mechanistic explanations for why they work. AlphaFold 2 predicts structures accurately, but the model's internal representation does not straightforwardly reveal the physical principles that govern folding. ESM's protein embeddings cluster by function, but the clustering is not interpretable as a set of biological rules. This pattern is familiar from other ML domains, but it has specific consequences in protein science where mechanistic understanding is the traditional goal. The methodology has accommodated this tension with substantial interpretability work (attention pattern analysis, in silico mutagenesis, embedding probing), but the underlying tension between predictive accuracy and mechanistic insight remains, and Section 18 returns to it.
Abundant clean data, structural-target problems with rich symmetry, multiple complementary training signals, and a watershed empirical result (AlphaFold 2). Protein-AI is the most-developed AI-for-Science subdomain, and the methodology developed here generalises to other scientific-prediction problems with similar structure: rich symmetries, abundant data, structural targets, multi-signal supervision.
The AlphaFold Story
No single result has shaped the modern AI-for-biology field more than DeepMind's AlphaFold. This section tells the story in detail — both because it is the technology most readers have already heard of and because understanding what AlphaFold did, how, and why it mattered, is the best on-ramp to the rest of the chapter.
A fifty-year problem
The protein folding problem — predict a protein's three-dimensional structure from its amino-acid sequence — was articulated in the early 1960s. Christian Anfinsen's Nobel-winning experiments showed that an unfolded ribonuclease enzyme would spontaneously refold to its functional shape from sequence alone, implying that the sequence completely determined the structure. The problem was clear; the solution was not. For half a century, computational biologists tried to find the rules that translated sequence into structure, and for half a century the problem resisted them. Experimental methods (X-ray crystallography from the 1950s onward, NMR from the 1980s, cryo-electron microscopy from the 2010s) could solve individual structures, but each took months and tens to hundreds of thousands of dollars. By 2020, the Protein Data Bank (PDB) held about 170,000 experimentally-determined structures while UniProt held over 200 million sequenced proteins — a 1,000-to-1 gap that grew every day.
The pre-AlphaFold computational landscape included homology modelling (use a related protein with known structure as a template), Rosetta and the various physics-based methods (David Baker's lab, the dominant academic effort), and a long tail of statistical and machine-learning approaches. Progress was real but incremental. The biennial CASP competition (Critical Assessment of protein Structure Prediction, run since 1994) provided a community benchmark: organisers collect protein sequences whose structures have just been solved but not yet released; teams submit predictions; the predictions are scored against the held-back structures. CASP results through CASP12 (2016) showed steady but slow improvement, with the best methods producing predictions of moderate accuracy (~50 GDT-TS, where 100 is perfect and 50 is the rough boundary between "useful" and "not") for novel proteins.
AlphaFold 1: the first wake-up call
DeepMind entered CASP13 (December 2018) with AlphaFold 1, a deep-learning system that combined convolutional networks operating on multiple-sequence-alignment-derived features (residue covariation across evolutionarily-related proteins) with a gradient-descent step on inferred inter-residue distance constraints. The result substantially outperformed every other entry — for the first time, a deep-learning system was the best in the world at protein-structure prediction. The community's reaction mixed admiration with unease. The architecture was not particularly elegant — it was effectively a CNN pipeline producing distance-restraint scores that an external optimisation step folded into structures — and many participants felt that AlphaFold 1 was a step toward the eventual solution rather than the solution itself. The 2019 CASP analysis paper called it "a major step forward" while reserving judgment on whether the trend would continue. It would.
AlphaFold 2: the watershed (CASP14, November 2020)
The November 2020 CASP14 result was the watershed. AlphaFold 2 produced predictions whose median accuracy across the entire CASP target set reached approximately 92 GDT-TS — for most targets, the predicted structure was indistinguishable from experimentally-determined structure within standard experimental error. The CASP14 organisers and participating scientists described it as effectively solving the long-standing problem; major newspapers (the New York Times, the Financial Times, The Economist) ran cover-story-level coverage of "a turning point in biology." The 2021 Nature paper that documented the system became one of the most-cited scientific papers of the decade.
The architecture — substantially redesigned from AlphaFold 1 — combined several innovations into a coherent system. The evoformer trunk iteratively refined two coupled representations: a multiple-sequence-alignment representation (capturing evolutionary co-variation across thousands of related sequences) and a pair representation (capturing inferred geometric relationships between residue pairs). The trunk's iterative refinement let early information about residue-pair distances inform the MSA interpretation, which in turn refined the pair representation, in a feedback loop unique to the architecture. The structure module then converted the refined pair representation into 3D atomic coordinates via an iterative SE(3)-equivariant attention mechanism that updated each residue's position and orientation while respecting rotational and translational symmetries. The whole system was trained end-to-end with a loss that combined per-atom coordinate accuracy with auxiliary structural losses (distogram prediction, MSA reconstruction). The methodology represented several years of dedicated engineering work; no single innovation explains the performance jump, but the combination did.
The AlphaFold Database and the structural-biology revolution
The most-far-reaching consequence of AlphaFold 2 was not the algorithm but the database. In July 2021, DeepMind and the European Bioinformatics Institute (EBI) jointly released the AlphaFold Protein Structure Database with predicted structures for ~365,000 proteins, including the entire human proteome and the proteomes of 20 model organisms. By July 2022, the database had grown to ~200 million predicted structures — essentially every known protein in UniProt. The release was free, open, and accessible by direct URL or programmatic API; researchers could simply look up any protein and see the predicted structure within seconds. The PDB, the crown jewel of structural biology since 1971, suddenly had a complementary resource three orders of magnitude larger.
The downstream impact has been substantial. Drug-discovery pipelines now routinely use AlphaFold structures as starting points for structure-based drug design (Ch 08 develops this in detail), with major pharmaceutical companies (Roche, AstraZeneca, Bayer) reporting that AlphaFold predictions have shortened early-stage target-selection timelines by months. Basic-biology researchers use the structures to generate hypotheses about protein function — given an uncharacterised protein, the predicted structure often reveals binding pockets, active sites, or domain architectures that suggest experiments. Structural biologists themselves now routinely use AlphaFold predictions as molecular replacement templates for crystallographic phasing — speeding up the experimental workflow rather than replacing it. The 2024 wave of academic publications that depend in some way on AlphaFold predictions is in the tens of thousands; the technology has become part of the basic infrastructure of biology.
AlphaFold 3 and the multi-component frontier
AlphaFold 2 solved single-protein structure prediction. The rest of biology involves proteins binding ligands (small-molecule drugs), nucleic acids (DNA, RNA), and other proteins; predicting these "complex" structures is essential for drug discovery and for understanding cellular machinery. AlphaFold 3 (Abramson et al., Nature, May 2024) extended the methodology to handle these problems with a single unified architecture. The key architectural change replaced AlphaFold 2's structure module with a diffusion-based decoder (drawing on the diffusion-model material of Part X), which generates atomic coordinates for arbitrary biomolecular components in a single pass. The empirical case was strong: AlphaFold 3 substantially outperformed previous best methods on protein-protein binding prediction, protein-DNA binding prediction, and protein-ligand docking benchmarks, with implications for both drug discovery and structural biology more broadly.
The 2024 release also marked a methodological tension in the field. AlphaFold 3 was initially released only via a web server with substantial usage limits (no open-source code, no released weights), which generated substantial community pushback — academics felt that a proprietary structure-prediction tool released by what was effectively a commercial entity (DeepMind, owned by Alphabet) ran against the open-science traditions that the field had previously relied on. DeepMind partially walked this back in November 2024 with the release of inference-only weights for non-commercial use, but the broader question of how foundational scientific tools should be governed remains contested. Section 18 returns to this.
The 2024 Nobel Prize and the broader implications
In October 2024, the Nobel Prize in Chemistry was jointly awarded to David Baker (University of Washington, for protein design — see Sections 13–14 below) and Demis Hassabis & John Jumper (DeepMind, for AlphaFold). The award explicitly framed AlphaFold as a transformative contribution to structural biology, the kind of recognition that the protein-folding problem received only after the methodology had matured. The Nobel committee's commentary explicitly acknowledged the broader stakes: AlphaFold demonstrated that deep learning could solve grand-challenge scientific problems where decades of physics-based and statistics-based methods had reached only modest progress. The 2020 watershed plus the 2024 Nobel together provide the empirical case that has catalysed the broader AI-for-Science wave that the rest of Part XV documents — a case that AI methods can produce not just incremental progress but qualitative breakthroughs in scientific problems that have resisted traditional methods.
What AlphaFold doesn't solve
A specific caveat worth stating clearly to avoid the over-confidence the technology can encourage. AlphaFold predicts a single static structure, but proteins are dynamic. They flex, partially unfold, oscillate between alternative conformations, and engage in induced-fit binding events involving substantial structural rearrangement. AlphaFold's predictions are accurate for the most-populated conformation in solution but are silent on dynamics, on alternative conformers, and on the rare configurations that often dominate drug-target interactions. The methodology is also less reliable for intrinsically disordered proteins (which lack stable structure by definition — and disordered regions account for ~30% of human proteome residues), for protein regions that interact only in specific cellular contexts, and for membrane proteins where the lipid environment shapes the structure in ways static prediction cannot capture. The 2024–2026 wave of "AlphaFold-with-dynamics" methods — Boltzmann generators, ensemble-AlphaFold variants, the various conformer-sampling tools, the AlphaFold-Multimer extension for multi-state prediction — represents an active frontier on top of the static-structure foundation, and the open problems are substantial. Section 12 develops this.
Relevance today
For an AI-for-biology practitioner in 2026, AlphaFold is not just a tool but a methodological paradigm. The architecture (iterative attention over coupled representations, evolutionary input via MSAs, equivariant geometry handling) has become a template that the broader field repeatedly applies — to RNA structure prediction (RoseTTAFold-RNA, AlphaFold 3 directly), to protein-protein interactions, to molecular-dynamics surrogates. The AlphaFold-derived AlphaMissense for variant pathogenicity (Section 16) is one of many spin-offs; the broader pattern is that AlphaFold's components are increasingly used as backbones for adjacent prediction problems. The remaining sections of this chapter develop the methodology in detail: protein language models (Section 11), structure-prediction successors (Section 12), the protein-design tools that complete the read-write cycle for proteins (Sections 13–14), and the function- and variant-prediction methods that build on top (Sections 15–16).
Protein Language Models
The most-productive analogy of the post-2018 protein-AI era has been treating proteins as language. Amino-acid sequences are sequences over a 20-letter alphabet; the toolkit developed for natural language (transformers, attention, masked-language-model pretraining, in-context learning) transfers to proteins with substantial empirical success. The resulting protein language models (PLMs) are the foundational substrate of much of modern protein-AI, sitting alongside structure-prediction methods (Section 12) and protein-design methods (Sections 13–14) as one of three pillars.
The biology-as-language framing
Why does the language analogy work for proteins? Several reasons converge. Proteins are linear sequences over a small discrete alphabet, like text. They have hierarchical structure (motifs, domains, fold-level architecture) analogous to morphemes, words, and phrases. They contain long-range dependencies (residues that interact across hundreds of positions in sequence) that attention-based architectures excel at. Most importantly, the evolutionary record provides a self-supervised training signal: the patterns of which residues co-occur in real proteins encode substantial information about which combinations are functionally viable. A model trained to predict masked residues from context (BERT-style masked language modelling) implicitly learns this evolutionary signal — and the resulting representations turn out to encode structure, function, and stability information that transfers to many downstream tasks without explicit supervision.
The early PLMs: UniRep, TAPE, ProtTrans
The first wave of protein language models predates AlphaFold. UniRep (Alley et al. 2019) used an LSTM trained on UniRef50 sequences with autoregressive next-token prediction, demonstrating that sequence-only pretraining produces representations useful for downstream tasks. TAPE (Rao et al. 2019) provided a benchmark suite of protein-prediction tasks for evaluating different architectures, including transformers and LSTMs trained with various objectives. ProtTrans (Elnaggar et al. 2021) systematically scaled transformer-based PLMs (ProtBERT, ProtT5, ProtXLNet, etc.) and demonstrated that scaling produces consistent improvements. By 2021, the case for transformer-based PLMs was empirically clear, even if the methodology had not yet produced a single dominant model.
ESM-1, ESM-1b, ESM-1v
The model family that became dominant was ESM (Evolutionary Scale Modeling) from Meta AI Research. ESM-1 (Rives et al. 2019/2021) was a 670M-parameter transformer trained with masked-language-modelling on UniRef50; the paper demonstrated that emergent representations encoded substantial structural information. ESM-1b (Rives et al. 2021) refined the training and scaled further. ESM-1v (Meier et al. 2021) showed that the same model could perform zero-shot variant-effect prediction with strong performance on deep mutational scanning benchmarks — without any task-specific fine-tuning, the model's likelihood of a variant served as a pathogenicity score. The empirical results were strong enough that ESM-1v became a standard baseline for variant-effect prediction (Section 16 develops the full landscape).
ESM-2 and ESMFold
The 2023 release of ESM-2 (Lin et al. 2023, Science) was the watershed scaling result for PLMs. The 15-billion-parameter ESM-2 was trained on ~65M sequences with masked-language-modelling, with intermediate model sizes (650M, 3B) for scaling-law analysis. The paper demonstrated clean scaling laws — perplexity, structure-prediction accuracy, and downstream task performance all improved smoothly with scale. The most-impactful follow-on was ESMFold, a structure-prediction system built directly on ESM-2's internal representations: the protein language model alone (no MSA, no template database) produced structure predictions competitive with AlphaFold 2 on much faster inference timelines (~6× faster), with the trade-off of slightly lower accuracy on harder targets where MSA-derived information matters most.
ESMFold's architectural insight is worth explicit attention. ESM-2's pretrained transformer learns representations whose pairwise dot products correlate with residue-residue distance — meaning the model has implicitly learned structure during sequence-only pretraining. ESMFold extracts these implicit features through a "folding trunk" (a smaller transformer mapping ESM embeddings to structure-relevant features) and a structure module (similar to AlphaFold 2's). The methodology proved that protein language models contain sufficient structural information for direct prediction, which has substantial consequences for downstream applications — fast, MSA-free structure prediction becomes feasible for high-throughput screening of designed proteins.
ESM-3 and the unification
The June 2024 release of ESM-3 (Hayes et al. 2024, EvolutionaryScale) marked another significant step: a unified model that handles sequence, structure, and function as three modalities of the same data. ESM-3's tokenisation includes amino-acid tokens (sequence), structure tokens (a vector-quantised tokenisation of local backbone geometry), and function tokens (annotations from InterPro and similar databases). The model is trained with masked-language-modelling across all three modalities simultaneously, which lets it generate (not just predict) at any modality given any subset of others. The largest release was 98B parameters; the 6.4B "open" version was released with non-commercial licensing.
The empirical case is substantial. ESM-3 generates novel functional proteins zero-shot, achieves state-of-the-art on standard prediction benchmarks, and demonstrated the design of esmGFP (a fluorescent protein with sequence ~58% identity to known fluorescent proteins, with the methodology team estimating it represents ~500 million years of evolutionary divergence). The unification of structure-and-sequence-and-function in a single foundation model is methodologically distinct from earlier PLMs, and the 2025 wave of follow-on work (improved tokenisation schemes, more efficient training, larger model variants) suggests this direction will dominate the next several years.
Pretraining objectives and scaling
The dominant pretraining objective for PLMs has been masked language modelling: hide ~15% of residues at random, predict them from context. The objective's strength is that it forces the model to use bidirectional context to disambiguate masked positions, producing rich representations. Autoregressive objectives (predict next token given previous) have been used in some PLMs (ProGen, Tranception) and are particularly natural for generative protein design. Span-masking objectives (mask contiguous spans rather than individual residues) are used in some variants. Multi-task pretraining that combines several objectives is increasingly common.
Scaling laws for PLMs follow patterns familiar from natural-language models, with some specifics. Loss scales roughly as a power law in compute, parameters, and data — the empirical relations are similar to language-model scaling laws but with somewhat different exponents (the protein "scaling exponents" tend to be smaller, partly because the protein corpus is more homogeneous than natural language). Downstream-task performance generally improves with model size, but the relationship is task-dependent — variant-effect prediction continues improving past 15B parameters, while basic structure prediction saturates earlier. The 2024–2026 frontier of PLM scaling is partly empirical (do bigger models keep paying off?) and partly methodological (are masked-language-modelling objectives optimal, or are alternatives better?).
What PLM embeddings are good for
PLM embeddings are useful as features for many downstream tasks. The standard recipe: take a pretrained PLM, run it on a protein sequence, extract per-residue embeddings (typically the final transformer layer's hidden states) or per-protein embeddings (typically a mean or special-token aggregate), and use these as input to a small task-specific head. The methodology produces strong baselines for nearly any per-residue or per-protein prediction task: secondary structure prediction, contact prediction, function annotation, stability prediction, binding-site identification, post-translational-modification site prediction, the various other classification and regression problems. Modern protein-AI pipelines treat PLM embeddings as the equivalent of pretrained image features in vision — a reusable substrate that any task-specific model can build on. The various PLM embedding services (Meta's API, Hugging Face's hosted models, the various commercial offerings) make this as straightforward as it is in NLP.
Structure Prediction Beyond AlphaFold
AlphaFold 2 was the watershed but not the only structure-prediction methodology. Section 10 told the AlphaFold story; this section surveys the broader landscape — the academic alternative (RoseTTAFold), the language-model-only methods (ESMFold, OmegaFold), the multi-component extensions (AlphaFold-Multimer), and the dynamics frontier (ensemble methods, Boltzmann generators, conformer sampling) where the static-structure foundation is being extended.
RoseTTAFold
The major academic alternative to AlphaFold is RoseTTAFold (Baek et al. 2021, Science) from David Baker's lab at the University of Washington. The architecture is conceptually similar to AlphaFold 2's — an MSA-aware iterative refinement trunk feeding a structure module — but with several distinctive design choices. RoseTTAFold uses a "three-track" architecture that operates simultaneously on 1D sequence features, 2D pairwise features, and 3D coordinates, with information flowing between all three tracks at each layer. The training methodology was different (Baker's lab used a smaller model with more aggressive multi-task training); the empirical accuracy in 2021 was slightly below AlphaFold 2 but well within useful range, and the open-source release made RoseTTAFold the de facto academic methodology.
RoseTTAFold's importance extends beyond structure prediction. Its open release made the methodology accessible to academic groups who needed to extend or modify the architecture, which became substantial. RoseTTAFold-RNA (Baek et al. 2024, Nature) extended the architecture to predict RNA structures, with results competitive with the much larger AlphaFold 3. RoseTTAFold All-Atom handles arbitrary biomolecular complexes. RFdiffusion (Section 13) was built on the RoseTTAFold backbone. The Baker lab's open-science philosophy, combined with the architectural flexibility of RoseTTAFold's three-track design, made it the workhorse of much of the post-2021 protein-AI research community.
ESMFold and language-model-only prediction
Section 11 introduced ESMFold; it deserves expansion here as a structure-prediction methodology in its own right. The key methodological claim — that a sufficiently-large protein language model contains enough structural information that explicit MSA-based prediction is unnecessary — was empirically validated by ESMFold's release (Lin et al. 2023, Science). The architecture extracts pairwise representations from ESM-2's internal layers, processes them through a much smaller "folding trunk," and decodes them with a structure module similar to AlphaFold 2's. The accuracy on standard benchmarks (CASP14 free-modelling targets, the various CAMEO evaluations) is below AlphaFold 2 but within ~5–10 GDT-TS points for most targets — useful enough that ESMFold is the dominant tool for high-throughput structure-prediction applications where speed matters.
The speed advantage is substantial. AlphaFold 2 requires MSA generation (often the bottleneck — building an MSA can take minutes per protein from large databases) plus inference (~10 seconds for moderate-sized proteins on a GPU). ESMFold skips the MSA step entirely; inference takes ~1 second per protein. For applications like screening millions of designed sequences (Section 13's design pipelines), this speed difference is decisive. ESMFold has become the default structure-prediction tool for design loops.
OmegaFold, MetaFold, and the language-only alternative landscape
Several other methods take similar approaches with different choices. OmegaFold (Wu et al. 2022) trained a transformer-based predictor directly on PDB structures without the language-model pretraining stage; it achieves competitive accuracy with similarly-fast inference. HelixFold-Single (Baidu) and several other industrial entries have explored the design space. RGN2 uses recurrent geometric networks for structure prediction. The landscape of "MSA-free structure prediction" is genuinely competitive, with each method making different tradeoffs between speed, accuracy, and methodological simplicity. Most production protein-AI pipelines choose between AlphaFold (highest accuracy, slowest) and ESMFold (fast, slightly lower accuracy) as the two serious options.
AlphaFold-Multimer and protein complexes
AlphaFold 2 predicts single-chain structures. Most biology happens in multi-chain complexes — receptors with their ligands, signalling complexes, ribosomes, viruses. AlphaFold-Multimer (Evans et al. 2022) extended the AlphaFold 2 architecture with training data and architectural modifications for multi-chain prediction. The methodology paired chains during training and added complex-specific features to the evoformer's input. The empirical performance on protein-protein complex prediction benchmarks (CAPRI, the various Heterodimer benchmarks) was substantially improved over single-chain methods used naively, but accuracy on novel complexes remained challenging — particularly for transient interactions, weak binders, and complexes where extensive conformational change accompanies binding. AlphaFold-Multimer was the dominant complex-prediction method until AlphaFold 3 (Section 10 covers the 2024 release).
Dynamics: the frontier
AlphaFold 2 and its successors predict static structures — a single point estimate of where the atoms are. Real proteins are dynamic: they breathe, flex, sample alternative conformations, and engage in induced-fit binding events that involve substantial structural change. Predicting these dynamic properties is the major frontier post-AlphaFold, and several methodological directions are active.
Subsampling-based ensemble methods reduce the number of input MSA sequences fed to AlphaFold, which produces more diverse predictions that approximate the conformational ensemble. The methodology is empirical (no theoretical guarantee that subsampled predictions correspond to physical conformers) but produces useful conformer sampling for many proteins. AlphaFlow and related methods extend AlphaFold with diffusion-based or flow-matching architectures to generate distributions over structures rather than point estimates. Boltzmann generators (Noé et al. 2019) train flow-based models to sample from the Boltzmann distribution of a molecular system at a target temperature, which in principle gives you the full conformational ensemble. The methodology is more rigorous than subsampling but harder to scale; the 2024–2026 wave of "Boltzmann generators for proteins" research has produced substantial progress. Distance Frame Pre-training and similar approaches extend protein language models to handle dynamics during pretraining.
Membrane proteins and intrinsically disordered proteins
Two specific protein categories deserve attention because they are systematically harder. Membrane proteins (Section 4 introduced them) are partly embedded in lipid bilayers, and their structure depends on the lipid environment in ways that AlphaFold's training data only partially captures. The empirical accuracy is lower than for soluble proteins, particularly for the orientation and packing of transmembrane helices. Specialised methods (membrane-protein-aware fine-tuning, lipid-bilayer-conditioned architectures) are active research areas. Intrinsically disordered proteins and regions (IDPs/IDRs) lack stable structure by definition; they exist as dynamic ensembles. AlphaFold typically predicts them as low-confidence regions (low pLDDT scores), which is at least honest, but predicting the actual conformational ensemble of an IDP is an active research frontier with methods drawing on both AlphaFold-derived approaches and dedicated IDR-specialised models.
The evaluation problem
A subtle but important point: structure-prediction accuracy on benchmarks doesn't always correspond to utility for downstream applications. AlphaFold predictions can be highly accurate on average but specifically wrong about features that matter — wrong loop conformations, missed alternative conformers, incorrect side-chain rotamer choices in active sites. For drug discovery applications, where small geometric errors can mean the difference between active and inactive compounds, this matters substantially. The methodology has developed substantial machinery around confidence estimation (AlphaFold's pLDDT scores, predicted aligned error, the various uncertainty quantification methods), but careful users still validate predictions experimentally for high-stakes applications. Section 18 returns to this evaluation problem.
Protein Design and RFdiffusion
Structure prediction (Sections 10, 12) solves the forward problem: given a sequence, predict its structure. Protein design is the inverse: given a desired structure or function, design a sequence that produces it. The methodology has gone from being a slow, expert-skilled discipline (Baker lab's Rosetta tradition) to producing usable results from generative diffusion models in a span of about three years, with substantial implications for therapeutic protein development, enzyme engineering, and synthetic biology. The 2024 Nobel Prize in Chemistry recognised both AlphaFold (Section 10) and David Baker's protein-design work, marking the field's arrival at scientific maturity.
The Rosetta tradition
Before the AI wave, the dominant approach to protein design was the Rosetta software suite developed by David Baker's lab over twenty-plus years. The methodology was physics-based: parameterise the energy of a protein structure (van der Waals, electrostatics, hydrogen bonding, solvation, conformational entropy) and search for sequences that minimise the energy in a target structure. The search space is combinatorially large (20 amino acids per position, 50–500 positions per protein), but careful sampling combined with a well-tuned energy function produced impressive demonstrations: novel protein folds (Top7 in 2003 was the first computationally-designed protein with a novel fold), self-assembling cages, mini-proteins binding therapeutic targets, and increasingly enzymatic catalysts. The methodology was slow (days to weeks of CPU time per design), required substantial expert tuning, and failed often — but the successes accumulated, establishing protein design as a real science.
The 2020–2022 design landscape
The post-AlphaFold years (2020–2022) saw a substantial methodological reorganisation. With AlphaFold 2 available, the design pipeline could be inverted: propose a sequence, predict its structure with AlphaFold, score it against the target. This design-by-AlphaFold-validation approach proved unexpectedly effective — the Rosetta lab's ProteinMPNN (Section 14 covers it in detail) plus AlphaFold validation produced binders to therapeutic targets at success rates orders of magnitude higher than prior methods. The methodology was empirical (proposing many sequences, AlphaFold-validating each) but it worked, and it set the stage for the full diffusion-based generative methods that would follow.
RFdiffusion: the watershed
The October 2023 release of RFdiffusion (Watson et al. 2023, Nature) was the watershed for generative protein design. The architectural insight was elegant: take RoseTTAFold's structure-prediction backbone, freeze it, and train a diffusion model that operates on protein backbones in 3D space. The model learns to denoise corrupted backbones step by step, eventually producing realistic protein backbones from pure noise. By conditioning the diffusion process on partial information — a target binding site, a desired fold, a specified active-site geometry — RFdiffusion can generate backbones that satisfy specified constraints. The methodology connects directly to the diffusion-model material of Part X, applied to a SE(3)-equivariant 3D structural domain rather than 2D images.
Architecturally, RFdiffusion uses a noise schedule that perturbs both backbone coordinates (Cα positions) and orientations (rotation matrices), running diffusion in the SE(3) Lie group rather than ordinary Euclidean space. The denoising network is a fine-tuned RoseTTAFold predicting clean structures from noisy inputs at each timestep. Training data comes from the PDB; the model learns to denoise structures it has seen during training, but the learned distribution generalises to producing novel structures with realistic protein-like geometry.
Empirical performance has been striking. RFdiffusion plus ProteinMPNN (for sequence design from generated backbones) plus AlphaFold validation produced binders to dozens of therapeutic targets in the months following release, often with higher affinity and better selectivity than antibody-based approaches. The success rates are substantially higher than prior methods — typical ratios are 10–50% of designs producing functional binders, compared to <1% for pre-RFdiffusion methods. The methodology has been adopted across academic and industrial protein-design groups and is the dominant approach as of 2026.
Backbone diffusion plus sequence design
RFdiffusion's standard pipeline has two stages. Stage 1: generate a backbone with RFdiffusion, conditioned on whatever constraints (a target binding site to dock against, a specified fold, a desired symmetry). Stage 2: use ProteinMPNN to design a sequence that folds to that backbone (Section 14 develops this). The two-stage approach has empirical advantages over end-to-end methods: backbone diffusion handles the geometric problem (proposing a sensible 3D structure), while sequence design handles the chemistry problem (finding amino acids that stabilise the geometry). The decomposition is also methodologically tractable — both stages have well-defined evaluation criteria and can be improved independently.
Conditional design
RFdiffusion's most-impactful capability is conditional generation. Given a target protein and a desired binding interface, RFdiffusion can generate backbones that physically dock against the target with appropriate complementary geometry. This is the canonical "design a binder for protein X" workflow, and it has been the basis for the substantial 2024–2026 wave of AI-designed therapeutic candidates entering clinical trials. Variants of conditional design include motif scaffolding (anchor a known functional motif and design a stable protein around it), symmetric design (generate self-assembling oligomers with specified symmetry), enzyme active-site design (place specified catalytic residues in the right geometry), and multi-state design (proteins that adopt different conformations in different conditions). Each is a substantial sub-research-area with its own published demonstrations.
Successors and the broader landscape
The post-RFdiffusion landscape has expanded rapidly. Chroma (Ingraham et al. 2023) is an alternative diffusion model with explicit programmable constraints. FrameDiff and Genie are academic alternatives with various architectural innovations. RFdiffusion All-Atom extends the methodology to handle non-protein components (small-molecule ligands, cofactors). The 2024–2026 wave of flow-matching alternatives to diffusion (which can be more sample-efficient and offer principled trade-offs between fidelity and diversity) is an active area. Industry deployments include the Generate Biomedicines, Cradle, A-Alpha Bio, EvolutionaryScale, and a substantial number of well-funded protein-design startups, all of which are deploying variants of the diffusion-plus-language-model methodology at scale.
The 2024 Nobel and the broader implications
The October 2024 Nobel Prize in Chemistry was jointly awarded to David Baker for protein design and to Demis Hassabis & John Jumper for AlphaFold. The pairing was significant: the prize recognised both the prediction problem (sequence→structure) and the design problem (function→sequence) as paired achievements that together constitute the modern AI-for-protein-science methodology. The Nobel committee's commentary explicitly acknowledged the broader stakes — these are general-purpose methods with substantial therapeutic, industrial, and basic-science implications. The 2025–2026 wave of protein-design demonstrations (AI-designed enzymes, AI-designed antibody alternatives, AI-designed assemblies for materials applications) suggests the methodology has reached deployment maturity, with the next several years likely to see substantial industrial and clinical impact.
Limitations and open problems
Despite the empirical successes, substantial challenges remain. Function prediction for designed proteins is harder than structure prediction — a designed protein may fold correctly but exhibit unexpected behaviour in cellular contexts. Solubility and aggregation are persistent failure modes, with many designs that look good in silico forming aggregates or precipitating in expression. Immunogenicity for therapeutic protein design is hard to predict in silico — designed proteins may trigger immune responses that natural proteins don't. Catalytic activity design has been substantially harder than binding design — designing enzymes with high catalytic efficiency remains an open frontier with mixed empirical results. The 2024–2026 generation of methods is making progress on each of these, but none is fully solved, and the gap between "AlphaFold-validated design" and "experimentally-functional protein" remains real.
Inverse Folding
Inverse folding is the structure-to-sequence problem: given a desired 3D backbone, design an amino-acid sequence that folds to it. This is the second half of the design pipeline (Section 13's RFdiffusion handles the structure; inverse folding handles the sequence), and it is also useful in its own right for protein engineering, antibody humanisation, and structural-biology workflows where redesigning portions of an existing protein is the goal.
The problem statement
Inverse folding is conceptually simpler than structure prediction: given the backbone (Cα, N, C, O coordinates for each residue), predict per-position amino-acid distributions that would fold to that backbone. The output is typically a categorical distribution over 20 amino acids per position, from which sequences can be sampled. The methodology is conditional generation in the sequence space, conditioned on structural information.
Why is this hard? The information flow is geometrically rich (a backbone defines local environments for each residue, with specific angles, distances, hydrophobic-vs-hydrophilic surroundings) but the answer is multi-modal — many different sequences can fold to the same backbone, with substantially different stability and other properties. The methodology has to balance recovering "the right answer" (sequences that actually fold to the target) against generating diversity (sampling different valid sequences for downstream selection).
ProteinMPNN
The dominant inverse-folding method is ProteinMPNN (Dauparas et al. 2022, Science) from the Baker lab. The architecture uses a graph neural network operating on the protein backbone: nodes are residues with positional features (Cα coordinates, backbone geometry), edges connect spatially-nearby residues with relative-position features. The network produces per-residue amino-acid logits using auto-regressive decoding (positions are sampled in random order, with each position conditioned on already-chosen residues). The methodology was elegant — the GNN architecture naturally handles the 3D structure, the autoregressive decoding produces coherent sequences, and the training data (PDB structures) provided substantial supervision.
The empirical case was strong. ProteinMPNN substantially outperformed prior inverse-folding methods (the various Rosetta sequence-design protocols, earlier neural-network attempts) on standard benchmarks. More importantly, ProteinMPNN-designed sequences had measurably better experimental success rates: when paired with structure prediction validation (run the designed sequence through AlphaFold, verify it folds to the target), success rates of 30–60% were common compared to <5% for prior methods. The combination of ProteinMPNN-for-sequence-design with AlphaFold-for-validation became the default protein-design pipeline within months of release.
ESM-IF and language-model approaches
ESM-IF (Hsu et al. 2022) is the language-model-based alternative to ProteinMPNN. Instead of a graph neural network, ESM-IF adapts the transformer-based ESM architecture to handle structural input — backbone coordinates are tokenised through learned embeddings, and the model produces sequences via autoregressive decoding. The empirical performance is comparable to ProteinMPNN on standard benchmarks; the architectural differences are relatively cosmetic from a results perspective. The choice between ProteinMPNN and ESM-IF in production is typically based on integration with other parts of the design pipeline (ProteinMPNN if working with RFdiffusion in the Baker-lab tradition; ESM-IF if working in an ESM-2/ESM-3 ecosystem).
The full design pipeline
The full design pipeline that emerged in 2023–2024 has four stages. (1) Specification: define what you want — a binder to a target, an enzyme catalysing a specific reaction, a stable scaffold around a known motif. (2) Backbone generation: use RFdiffusion (or alternatives) to generate candidate 3D backbones consistent with the specification. (3) Sequence design: use ProteinMPNN (or ESM-IF) to design sequences that fold to each backbone. (4) Validation: run candidate sequences through AlphaFold (or ESMFold for speed) to verify they fold to the target backbone, then filter by structural-similarity metrics.
The pipeline produces hundreds to thousands of candidate sequences in hours, of which typically 10–50% pass in-silico validation, and ~10–30% of validated candidates produce experimentally-functional proteins when synthesised and tested. The latter rates have improved dramatically — pre-RFdiffusion methods rarely exceeded 1% experimental success — but they are still well below 100%, and the methodology of effective design includes substantial empirical iteration.
Conditional and constrained variants
Inverse folding methods support several useful conditioning modes. Fixed-position design holds specified residues constant and designs the rest — useful for redesigning portions of existing proteins while preserving known active sites. Tied position design couples residues at specified positions (forcing them to be the same amino acid) — useful for symmetric oligomers where structurally-equivalent positions should match. Composition constraints bias the output toward specified amino-acid frequencies — useful for therapeutic applications where certain amino acids (cysteines, methionines) are problematic for stability. Region-specific objectives can prioritise different criteria in different parts of a protein (high stability in the core, high diversity in surface positions). Each of these has been extensively used in the 2024–2026 protein-design literature.
Antibody-specific inverse folding
Antibody design (Section 17 develops this further) has its own inverse-folding methodology. IgFold and AbLang are antibody-specific structure prediction tools; AbMPNN and similar methods adapt ProteinMPNN for antibody-specific constraints (preserving framework regions, designing CDR loops). The methodology exploits the substantial structural homology among antibody scaffolds (essentially all antibodies share the immunoglobulin fold) while focusing design effort on the hypervariable CDR-H3 loop where most binding-specificity differences live.
Limitations
Inverse-folding methods inherit several limitations from their training data and assumptions. They predict sequences likely to fold to the target backbone, but folding success in vitro depends on factors beyond sequence-backbone compatibility — translation efficiency, chaperone-assisted folding, post-translational modifications, expression-host-specific issues. They optimise for structural compatibility, not function — a designed sequence may fold correctly but not bind its target with sufficient affinity, or may bind off-targets in cellular contexts. They typically produce sequences within "natural" amino-acid frequency distributions, which means designs may underuse rare amino acids that could be optimal in specific contexts. The methodology continues to evolve, with 2025–2026 work explicitly addressing function-aware design and expressibility-aware design as extensions of the basic inverse-folding paradigm.
Function and Property Prediction
Predicting structure (Sections 10, 12) is one piece of protein-AI methodology; predicting function and various biophysical properties is the other major class of supervised problems. The methodology spans GO-term annotation, enzyme classification, stability prediction, solubility, expression, immunogenicity, and various binding-affinity prediction problems. Most of these tasks now use PLM embeddings (Section 11) as a substrate, with task-specific heads trained on labelled data.
The function-prediction problem
Most proteins in UniProt lack experimental functional characterisation. Of the ~250M sequences, only the ~570K Swiss-Prot subset has hand-curated functional annotations; the remaining ~99% have computational annotations transferred from homologues, often with substantial uncertainty. Function prediction aims to assign Gene Ontology (GO) terms, Enzyme Commission (EC) numbers, or other functional categories to uncharacterised proteins from sequence (and increasingly structure). The methodology is fundamentally a multi-label classification problem, with the wrinkle that the label space is hierarchically organised (GO has parent-child relations) and extremely large (~50K GO terms; functional annotations can include hundreds of terms per protein).
The CAFA challenges (Critical Assessment of Functional Annotation, run periodically since 2010) provide the standard benchmarking infrastructure. Methods are evaluated on hold-out proteins whose functions were experimentally determined after the prediction submission deadline — a temporal split that resembles real prospective deployment. CAFA results have shown steady improvement as PLM-based methods have replaced earlier homology-transfer approaches, though the empirical performance ceiling remains well below where structure-prediction accuracy has reached.
EC number prediction
For enzymes specifically, the Enzyme Commission number provides a hierarchical functional classification: the four-digit code identifies the catalysed reaction at increasing specificity. EC prediction is an important sub-task because enzymes are major drug-discovery targets and industrial biotechnology workhorses. Methods like DeepEC, CLEAN (Yu et al. 2023), and the various 2024–2026 successors achieve substantial accuracy on standard benchmarks, with the trade-off between precision and recall managed differently by different methods. CLEAN in particular uses contrastive learning to map enzymes into an embedding space where catalytic similarity is preserved, then performs classification by retrieval — an approach that handles the long-tail distribution of EC labels (most EC codes are rare) substantially better than direct multi-class classifiers.
Stability prediction
Protein stability — typically measured as the melting temperature (Tm) at which a protein unfolds, or as the change in Gibbs free energy of folding (ΔΔG) caused by a mutation — is essential for both protein engineering and pharmaceutical development. The methodology is ML applied to labelled stability data (FireProtDB, ProThermDB, and the various successor databases). Methods range from simple ones (ProThermDB-style sequence-and-structure features feeding a regressor) to PLM-based approaches that fine-tune ESM-2 or similar models on stability tasks. The state of the art predicts Tm to within ~5°C and ΔΔG to within ~1 kcal/mol on held-out test sets, which is useful but not yet at the level where in-silico stability prediction can fully replace experimental measurement.
Solubility, expression, and aggregation
Three properties matter for using designed or engineered proteins in practice. Solubility — whether a protein stays dissolved in solution rather than aggregating — is essential for therapeutic proteins and biotechnology workflows. Expression level — how much protein is produced when the gene is expressed in a host (E. coli, yeast, mammalian cells) — varies dramatically across proteins and constrains practical applications. Aggregation propensity — the tendency to form non-functional clumps — is the dominant failure mode for both therapeutic proteins and AI-designed proteins.
Each of these is a distinct prediction problem with its own benchmark data and methodology. SoluProt, SOLart, and various successor methods predict solubility from sequence with reasonable accuracy. Aggrescan and CamSol predict aggregation-prone regions. Production protein-design pipelines (Generate Biomedicines, Cradle, the various biotech operators) increasingly include solubility-and-aggregation filtering as a standard post-design step, with the methodology integrated into their design loops.
Binding-site prediction
Where on a protein does a small molecule bind? Where does another protein bind? Where do PTMs occur? These binding-site prediction problems have substantial drug-discovery and basic-biology importance, and they have been productive AI applications. P2Rank uses random-forest predictions on geometric and physicochemical features to identify ligand-binding pockets. DeepPocket and similar deep-learning successors improve on the methodology with CNN-based approaches operating on 3D voxel grids of protein surfaces. EquiPocket uses SE(3)-equivariant networks. The empirical performance is good enough that AlphaFold-predicted structures plus deep-learning binding-site predictions form a substantial part of the modern early-stage drug-discovery pipeline (Ch 08 develops the broader methodology).
Foundation models for protein properties
The 2024–2026 generation of methods increasingly uses pretrained protein foundation models as the backbone for property prediction. Saprot (Su et al. 2024) explicitly combines sequence and structure tokens during pretraining for stability and function prediction. ProstT5 uses sequence-to-structure-token translation as a pretraining objective. ESM-3 handles function prediction natively as one of its three modalities. The pattern across the literature is that scaling foundation models with multi-task pretraining produces representations useful across many property-prediction tasks, with task-specific fine-tuning providing further gains. The trade-off is computational — large foundation models are expensive to fine-tune — but parameter-efficient fine-tuning methods (LoRA, adapters) make this tractable for moderate-data tasks.
The labelled-data bottleneck
Despite the abundance of sequence data, labelled functional data is the bottleneck for property prediction. Stability has thousands to tens of thousands of labelled measurements; expression and solubility have similar; immunogenicity has substantially fewer. The 2020s wave of deep mutational scanning experiments (DMS — measure the functional consequence of every single-amino-acid mutation in a protein, in a high-throughput experiment) has substantially expanded labelled data for variant-effect-style prediction. MAVE-DB aggregates DMS data across hundreds of proteins. ProteinGym (Notin et al. 2023) provides a benchmark over the DMS data for variant-effect prediction methods. The methodology of effective property prediction increasingly involves combining limited labelled data with large pretrained foundation models — the protein-AI version of the broader transfer-learning paradigm.
Variant Effect Prediction
A central practical problem at the intersection of protein science and human genetics: given a specific genetic variant — a single-amino-acid change in a protein-coding gene — predict its functional consequences. The methodology has substantial pre-AI history (PolyPhen-2, SIFT, CADD; Ch 06 developed these in the genomics context) that AI methods extend, and the post-AlphaFold generation of variant predictors has substantially raised the empirical bar.
Why variant prediction matters
Each human carries roughly 5,000 protein-coding variants that differ from the human reference genome — most are benign, some predispose to disease, some affect drug response. The clinical-genetics workflow asks: when a patient is sequenced and a "variant of unknown significance" (VUS) is identified, is it pathogenic? The ACMG-AMP guidelines (Richards et al. 2015) provide a framework for combining multiple lines of evidence into a clinical interpretation, with computational variant-effect predictors as one input. AI-based predictions have substantially improved over the past five years and are increasingly central to clinical decision-making.
Zero-shot variant prediction with PLMs
The simplest AI-based variant-effect predictor is zero-shot: take a pretrained protein language model, compute the model's likelihood of the variant amino acid given the wild-type context, and use that likelihood (or related quantities) as a pathogenicity score. The intuition is straightforward — a protein language model trained on UniProt has implicitly learned what amino acids "fit" each position; variants that the model considers unlikely are likely deleterious. ESM-1v (Meier et al. 2021) demonstrated that this approach achieves competitive performance on deep-mutational-scanning benchmarks without any task-specific fine-tuning, and the methodology became a standard zero-shot baseline.
The empirical case for zero-shot PLM variant prediction is strong. On the ProteinGym benchmark (Notin et al. 2023), aggregating zero-shot predictions from large PLMs typically outperforms supervised methods that don't use PLMs, and approaches the performance of supervised methods that do use PLMs. The methodology is essentially "transfer learning from sequence pretraining" with no task-specific adaptation, which makes it both robust and computationally cheap. Production deployments at major sequencing companies (Illumina, Genomics England) increasingly include PLM-based zero-shot predictions as a standard variant-annotation feature.
AlphaMissense in detail
The most-impactful variant predictor of the post-AlphaFold era is AlphaMissense (Cheng et al. 2023, Science) from DeepMind. The architecture starts from AlphaFold 2's evoformer trunk — the iterative MSA-and-pair representation that drives structure prediction — but instead of decoding to 3D coordinates, AlphaMissense decodes to a per-position categorical distribution over the 20 amino acids. The output is a protein language model conditioned on structural context, where the "structure" is implicit in the evoformer's learned representations.
AlphaMissense's training combines two complementary signals. The first is a masked-language-modelling loss standard for PLMs: hide some positions in the input MSA, predict them. The second is a population-frequency signal derived from gnomAD (Genome Aggregation Database, ~750K humans sequenced as of 2024). The intuition: variants common in the population are presumed benign (selection has not eliminated them); rare variants matched against simulated mutations are presumed enriched for deleterious effects. AlphaMissense is trained to assign higher probability to common variants and lower to rare variants, which produces calibrated pathogenicity scores. The methodology converts a hard supervised problem (where labelled clinical data is scarce and biased) into a self-supervised problem (the entire human proteome and its variant landscape) plus a weak supervised signal from population genetics.
Empirically, AlphaMissense outperformed previous methods on ClinVar-derived benchmarks and produced calibrated pathogenicity probabilities for all 71 million possible human missense variants. The released catalogue covers the entire human proteome and has been incorporated into clinical pipelines globally — UK Biobank, Genomics England, the major commercial sequencing services all integrate AlphaMissense scores into their variant interpretation workflows. The downstream impact has been substantial: many VUS reclassifications, many improved diagnostic yields for rare diseases, and a substantial pull on subsequent methodology development.
Deep mutational scanning as ground truth
How do we evaluate variant-effect predictors when most clinical labels are themselves uncertain? Deep mutational scanning (DMS) provides cleaner ground truth. The methodology: construct a library of every possible single-amino-acid variant of a protein (or a defined region), express them all simultaneously, and measure functional consequences (binding, catalysis, fluorescence, expression, whatever the protein does) via high-throughput assays. The output is a position-by-amino-acid matrix of measured fitness consequences — typically tens of thousands of labelled mutations per protein. MAVE-DB aggregates such datasets across hundreds of proteins.
DMS data has substantial advantages for evaluating variant predictors. It is dense (every possible mutation is measured, not just the disease-associated ones in clinical databases). It is unbiased (no ascertainment bias toward already-suspected pathogenic mutations). It is quantitative (continuous fitness measurements, not binary pathogenic/benign labels). The trade-off is that DMS measurements are protein-specific — what's measured depends on the assay design — and may not perfectly correspond to clinical pathogenicity. The 2024–2026 wave of variant-effect methods routinely evaluates against ProteinGym (Notin et al. 2023, the standard DMS-based benchmark) in addition to traditional ClinVar evaluations.
Connection to GWAS and PRS
Section 4 of Ch 05 (AI for Biology) develops genome-wide association studies (GWAS) and polygenic risk scores (PRS) — methods that aggregate effects across many common variants to predict disease risk. The methodology connects to variant-effect prediction in subtle ways. GWAS identifies variants statistically associated with traits; AlphaMissense and related predictors identify variants likely to be functionally consequential. The two evidence types are complementary, and the 2024–2026 generation of clinical-genetics methodology increasingly combines them — using AlphaMissense scores to weight GWAS variants by predicted functional effect, or using PRS to identify protein-protective variants for design (e.g., the PCSK9 loss-of-function variants associated with reduced cardiovascular disease are now being explored as protein-design targets for cholesterol-lowering therapies).
Limitations and open problems
Despite substantial progress, several limitations remain. Most variant-effect predictors handle missense variants well but struggle with indels (insertions or deletions of one or more amino acids) and splice-affecting variants (which require sequence-context modelling specific to the splicing mechanism — Ch 06 covers SpliceAI). Loss-of-function variants (premature stop codons, frameshifts) are typically annotated by simpler rules rather than ML predictions. Gain-of-function variants (where the protein acquires a new harmful activity) are systematically harder to predict than loss-of-function variants because the prediction target depends on the specific gained activity, which the ML methods don't always capture. The 2024–2026 frontier addresses these via specialised methods for each variant class, plus increasingly sophisticated multi-modal models that handle structure, sequence, and clinical context together.
Antibody and Enzyme Design
Two specific protein-design application areas have substantial commercial and clinical importance and methodological richness worth dedicated treatment: antibody design (the dominant therapeutic-protein class, with substantial commercial demand for AI methods) and enzyme design (the engineering challenge of producing novel catalysts for industrial and pharmaceutical applications). Both build on the general methodology of Sections 13–14 with domain-specific extensions.
The antibody-design problem
Antibodies are the dominant class of biological therapeutics — the various "-mab" drugs accounted for the largest share of the pharmaceutical market by 2024, with approved antibodies for cancer (Herceptin, Keytruda, Opdivo), autoimmune diseases (Humira, Stelara), infectious diseases (Evusheld, Trogarzo), and many other indications. Designing antibodies — engineering them with desired binding specificity, affinity, stability, and developability — is a substantial commercial activity, with most major pharma companies running internal antibody-discovery groups and an ecosystem of dedicated antibody-discovery startups. The methodology has substantial pre-AI history (display-based selection methods like phage display and yeast display) that AI methods increasingly augment.
Antibodies have a specific structural architecture that shapes the methodology. The immunoglobulin fold is highly conserved — essentially all antibodies share the same scaffold of two beta-sheet sandwiches per variable domain. Binding specificity comes from six complementarity-determining regions (CDRs) — three on the heavy chain and three on the light chain — that form a binding loop at the antigen-contacting surface. The most important is CDR-H3, the hypervariable loop on the heavy chain whose length and sequence vary substantially between antibodies and accounts for most of the binding specificity. AI methods for antibody design focus most of their attention on CDR-H3.
Antibody-specific structure prediction
Standard structure-prediction methods (AlphaFold 2, ESMFold) work on antibodies but with substantial caveats. The CDR-H3 loop is hypervariable, so the MSAs that drive AlphaFold's evoformer have low information content for it; predictions are typically less accurate on CDR-H3 than on framework regions. Specialised methods address this. IgFold (Ruffolo et al. 2023) is an antibody-specific structure-prediction model trained on the substantial PDB subset of antibody structures; it predicts antibody structures faster than AlphaFold 2 with comparable or better accuracy on CDR loops. ABodyBuilder3 and similar methods provide commercial alternatives. AlphaFold-Multimer can predict antibody-antigen complexes when the antigen is known, with accuracy that has been steadily improving.
Antibody language models
Several PLMs are specialised for antibodies. AbLang (Olsen et al. 2022) is a BERT-style model trained on antibody sequences from the OAS (Observed Antibody Space) database; it provides antibody-specific embeddings and zero-shot likelihood estimates. IgLM (Shuai et al. 2023) is an autoregressive antibody language model useful for sequence generation. BERTtbVH and related models target heavy-chain variable regions specifically. The 2024–2026 generation increasingly uses paired heavy-and-light-chain training data (rather than treating chains independently), with substantial gains for predictions that depend on chain pairing (binding affinity, expression, stability).
Antibody design pipelines
Modern antibody-design pipelines integrate the components into end-to-end workflows. The general pattern: (1) specify the target antigen (a protein, an epitope on a protein, a small molecule); (2) generate candidate antibody backbones using RFdiffusion or domain-specific diffusion models; (3) design CDR sequences using ProteinMPNN-style or antibody-specific inverse-folding methods; (4) validate with AlphaFold-Multimer or specialised antibody-antigen complex predictors; (5) filter for developability properties (stability, solubility, expression, low immunogenicity) using ML-based predictors. The full pipeline produces hundreds to thousands of candidate antibodies, of which dozens may proceed to experimental validation.
Empirical performance has been substantial. The 2024–2026 wave includes published demonstrations of AI-designed antibodies entering clinical trials, with companies like Generate Biomedicines, Absci, Aionics, and others reporting success rates that substantially exceed traditional discovery pipelines. The commercial implications are large enough that essentially every major pharma company has a dedicated AI-antibody program by 2026.
The humanisation problem
A specific antibody-engineering problem worth mentioning is humanisation. Many therapeutic antibodies are originally derived from mice or other model organisms (the "-omab" suffix indicates mouse origin); using them as drugs in humans risks immunogenic responses against the foreign antibody framework. Humanisation grafts the CDR loops from the original antibody onto a human framework, preserving binding specificity while reducing immunogenicity. The methodology was originally developed in the 1980s with ad-hoc grafting protocols; modern AI methods (humanisation predictors based on PLMs, structure-aware grafting tools) have substantially improved the workflow. The "-zumab" and "-umab" suffixes indicate humanised and fully-human antibodies respectively.
Enzyme design
The other major design application is enzyme design. Where antibodies are designed for binding specificity, enzymes are designed for catalytic activity — making them substantially harder, because catalysis requires precise active-site geometry for transition-state stabilisation, and small geometric errors substantially reduce catalytic efficiency. The classical approach (David Baker's lab, decades of effort) used Rosetta-based design with elaborate active-site modelling, achieving modest catalytic activity for enzymes catalysing reactions like the Diels-Alder cycloaddition, the Kemp elimination, and various retro-aldol reactions. The activities of these designed enzymes were typically thousands of times below natural enzymes — useful as proof of concept but not industrially viable.
The post-RFdiffusion era has improved this. The 2023 wave of enzyme design with RFdiffusion + ProteinMPNN + experimental optimisation produced enzymes with substantially better catalytic activities, in some cases approaching natural enzyme rates. The methodology integrates: (1) specify the desired transition state geometry; (2) use RFdiffusion to scaffold proteins around it; (3) use ProteinMPNN to design sequences; (4) experimentally test and iterate. The Baker lab's 2024 retro-aldolase work showed the methodology could produce designs with catalytic activities competitive with directed-evolution-engineered enzymes.
De novo catalyst design as a frontier
The aspiration is genuinely de novo enzyme design — proposing a novel catalytic activity that doesn't exist in nature, designing the active-site geometry to accomplish it, and engineering a protein scaffold that places the active site correctly. This has been an aspiration for thirty years and has not yet been fully demonstrated, but the empirical landscape in 2026 is substantially more optimistic than in 2020. Specific industrial applications under active development include: enzymes for plastic degradation (cracking PET into recyclable monomers), enzymes for synthetic biology (incorporating non-canonical amino acids into proteins), enzymes for chemical manufacturing (replacing harsh chemical processes with mild biocatalysis), and enzymes for therapeutic applications (replacing missing enzymatic activities in inborn errors of metabolism). The 2024 Nobel Prize to Baker explicitly recognised the protein-design contribution including enzyme design, marking it as a genuine scientific maturation.
The data problem for design
A specific challenge worth flagging: design methods are trained mostly on natural protein structures (PDB) but are asked to produce novel proteins that may not resemble training data. This is a substantial generalization-out-of-distribution problem, and the methodology has accumulated empirical observations about when designs work and when they fail. Designs close to natural folds tend to fold correctly; designs far from natural folds fail more often. Designs that recapitulate evolutionary conservation patterns (PLM-likelihood-favoured) tend to fold correctly; designs that violate them fail more often. The 2024–2026 frontier of design methodology includes substantial work on quantifying and managing the design distribution — explicit conditioning on PLM likelihoods, retrieval-augmented design that leans on structural homologues, and active-learning loops that experimentally validate uncertain designs.
Evaluation, Open Science, and the Frontier
The previous sections developed the established methodology of AI for protein science. This final section turns to the evaluation infrastructure that grounds empirical claims, the open-vs-closed tensions that have shaped the field's recent development, and the open frontiers that will likely dominate the next several years.
The evaluation infrastructure
Several long-running benchmarks anchor empirical evaluation in different sub-areas. CASP (Critical Assessment of protein Structure Prediction) runs every two years since 1994 and is the canonical structure-prediction benchmark — the venue where AlphaFold 1 (2018), AlphaFold 2 (2020), and subsequent advances were demonstrated. Targets are sequences whose structures have just been solved but not yet released; teams submit predictions; the community evaluates them after the fact. CAMEO (Continuous Automated Model EvaluatiOn) provides weekly evaluation rather than biennial, with newly-deposited PDB structures used as hold-out targets immediately after deposition. CAFA (Critical Assessment of Functional Annotation) evaluates function-prediction methods. CAPRI (Critical Assessment of PRedicted Interactions) evaluates protein-protein docking. ProteinGym (Notin et al. 2023) provides a benchmark for variant-effect prediction over hundreds of DMS datasets.
The benchmark culture has substantially shaped methodological development. Methods are routinely evaluated against multiple benchmarks before publication, and the evaluation practice has matured along with the methods themselves. Common practices include reporting variance across multiple random seeds, careful temporal splitting to avoid data leakage from continually-updated databases, and explicit reporting of failure modes alongside aggregate accuracy. The 2023–2024 wave of papers includes substantially more rigorous evaluation than the 2018–2020 wave; the field has reached a level of evaluation discipline that compares favourably to other AI application areas.
The open-vs-closed tension
A specific methodological controversy has shaped the post-AlphaFold landscape: the tension between open and closed releases of foundational scientific tools. AlphaFold 2 (2021) was released open-source with weights and detailed architectural documentation. ESM-1, ESM-2, and most academic methods have followed similar open-release norms. RoseTTAFold and the Baker-lab tooling has been consistently open. RFdiffusion, ProteinMPNN, AlphaMissense are all open.
AlphaFold 3 (May 2024) broke this pattern. The initial release was a web server with substantial usage limits — no open-source code, no released weights. The justification cited commercial considerations (Isomorphic Labs, DeepMind's drug-discovery spinoff, has commercial interests in the technology) and competitive concerns. The community pushback was substantial — academics felt that a proprietary structure-prediction tool from what was effectively a commercial entity (Alphabet-owned DeepMind) ran against the open-science traditions the field had operated under, and that the resulting two-tier access risked creating an asymmetric playing field. DeepMind partially walked this back in November 2024 with the release of inference-only weights for non-commercial use, but the broader pattern — major scientific foundation models being released with restrictive licensing — has continued in subsequent releases (parts of ESM-3 are non-commercial; some of the 2024–2026 protein-design models are closed). The methodology of how to govern foundational scientific tools remains contested, and the resolution is unclear as of 2026.
Industrial vs academic deployment
The protein-AI field as of 2026 includes substantial industrial deployment. Beyond the major academic labs (Baker at UW, Anandkumar group at Caltech, Dauparas group, the various DeepMind-Isomorphic-EvolutionaryScale ecosystems) there are now dozens of well-funded protein-AI startups — Generate Biomedicines, Cradle, Absci, A-Alpha Bio, Aionics, Profluent, Nabla Bio, Cyrus Bio, Enveda, the various others. Major pharma companies (Roche, Pfizer, AstraZeneca, Bayer, Genentech, the various others) have dedicated AI-protein groups. The result is that "protein-AI" in 2026 is a commercial industry as much as an academic discipline, with substantial implications for talent flow, intellectual-property dynamics, and the scope of what gets published openly versus kept proprietary.
The active research frontier
Several frontier directions are particularly active in 2026.
Foundation models for protein-and-everything-else — extending the protein-foundation-model paradigm to include nucleic acids, small molecules, lipids, and the full biomolecular landscape. ESM-3 partly does this; AlphaFold 3 partly does this; the 2025 wave of multi-modal foundation models (e.g., Boltz-1, the various academic and industrial efforts) goes further. The aspiration is a single model that handles essentially any biomolecular structure-prediction or design task across modalities.
Dynamics-aware methods — going beyond static-structure prediction to handle conformational ensembles, induced-fit binding, and time-dependent behaviour. The 2024–2026 wave includes substantial work on Boltzmann generators, ensemble-AlphaFold variants, conformer-sampling methods, and explicit dynamics integration. The methodology is genuinely early but with active progress.
Cellular-context integration — connecting protein-level predictions to cellular-level outcomes. A protein's behaviour depends on which other proteins it interacts with, what PTMs it carries, where in the cell it sits, and what cellular conditions exist. The 2024–2026 wave of cellular-context-aware protein methods (integrating with the Ch 06 single-cell methodology) is an active frontier with substantial implications for drug discovery and basic biology.
Functional prediction at scale — annotating function for the millions of proteins in UniProt that lack experimental characterisation. ESM-3, AlphaMissense, and the various 2024–2026 functional foundation models have substantially expanded what's tractable, but the methodology continues to mature.
Therapeutic design at deployment scale — moving AI-designed proteins through clinical development. The 2024–2026 wave includes the first AI-designed therapeutic candidates entering clinical trials; the next several years will reveal whether AI-designed therapeutics achieve the success rates traditional approaches do, with implications for the broader pharma industry's adoption of AI methods.
The data flywheel
A specific feedback loop worth flagging. AI-designed proteins, when synthesised and tested, generate experimental data on what works and what doesn't. This data feeds back into training sets for the next generation of methods, which produce better designs, which produce more data, and so on. The major industrial deployments (Generate, Cradle, the various others) operate this loop at scale, with proprietary experimental data providing competitive moats around the better-known publicly-released methods. The methodology of effective deployment increasingly involves not just the AI methods but the experimental-data-generation pipeline, and the major commercial differences between competing protein-AI companies often live in the data infrastructure rather than the model architectures.
What 2028 might look like
Speculative but useful. By 2028, several developments seem likely. Foundation models that handle proteins, nucleic acids, and small molecules in a single architecture will be standard, with AlphaFold-3-style multimodal models displacing single-modality approaches. AI-designed proteins will routinely enter clinical trials, with a small number of approved AI-designed therapeutics setting precedents for regulatory pathways. The dynamics frontier will substantially mature, with AlphaFold-with-dynamics methods becoming routine. Cellular-context integration will close some of the gap between protein-level predictions and biology-level outcomes. The open-vs-closed tension will resolve in some direction — perhaps with regulatory frameworks for "scientific foundation models," perhaps with stable open alternatives that displace closed releases.
What this chapter does not cover
Several adjacent areas are out of scope. The substantial molecular-dynamics literature (which simulates protein behaviour at the atomic level using physics-based methods) intersects this chapter at several points but is mostly skipped. The biochemistry of specific protein classes (G-protein-coupled receptors, ion channels, kinases, the various others) has its own methodology that we touch on only briefly. Detailed pharmacology of antibody-drug conjugates, bispecific antibodies, and the various engineered therapeutic protein formats is largely covered in Ch 07–08. The substantial structural-biology methodology (cryo-EM image processing, the various other experimental-data analysis pipelines) connects to AI methods but is its own discipline. And the broader bioethics of AI-designed therapeutics, AI-designed enzymes, and protein-design dual-use concerns is a substantial topic that the chapter touches only briefly. The methodology developed here is the practical AI-for-protein-science discipline; the broader landscape is genuinely vast.
Further reading
A combined reading list for protein science and AI for protein science. The biology-foundation references — Anfinsen 1973, Brändén & Tooze, Berg/Tymoczko/Gatto/Stryer, Murzin et al. 1995 (SCOP), Sonnhammer et al. 1997 (Pfam), Altschul et al. 1990 (BLAST), van Kempen et al. 2024 (Foldseek), Yip et al. 2020 (cryo-EM), Hornbeck et al. 2015 (PhosphoSitePlus) — establish the protein-science substrate. The AI-methodology references — Jumper et al. 2021 (AlphaFold 2), Abramson et al. 2024 (AlphaFold 3), Lin et al. 2023 (ESM-2), Watson et al. 2023 (RFdiffusion), Dauparas et al. 2022 (ProteinMPNN), Cheng et al. 2023 (AlphaMissense) — establish the methodology. The field's literature evolves rapidly, and the 2024–2026 wave continues to add foundational results.
-
Principles That Govern the Folding of Protein ChainsThe foundational paper that established the central tenet of protein science: the amino-acid sequence completely determines the three-dimensional structure. Anfinsen's experiments on ribonuclease — denaturing the protein then watching it spontaneously refold — provided the empirical basis for everything that followed, including the protein-folding problem and ultimately AlphaFold. The 1972 Nobel Prize in Chemistry recognised this work. The reference for sequence-determines-structure.
-
Introduction to Protein StructureThe classic textbook on protein structure. Comprehensive coverage of the four levels of structure, common folds, the major structural families, and the experimental techniques that determine structure. Predates AlphaFold but remains the foundational reference for understanding the structural-biology landscape. The right starting reference for an AI reader engaging with the field seriously. The reference protein-structure textbook.
-
BiochemistryThe standard biochemistry textbook. Comprehensive coverage of amino acids and peptide bonds, the major functional classes of proteins, enzyme mechanism, regulatory mechanisms including PTMs, and the integration of proteins into broader cellular processes. The right reference for the chemistry-biology bridge that the early sections of this chapter operate on, and substantially deeper than Brändén & Tooze on functional and biochemical aspects. The reference biochemistry textbook.
-
SCOP: A Structural Classification of Proteins DatabaseThe original SCOP paper. Establishes the hierarchical structural classification (class, fold, superfamily, family) that has been the working organisation of fold space for thirty years. SCOP has been succeeded by SCOP2 and SCOPe with various refinements; the original 1995 paper remains the foundational reference for understanding how protein folds are catalogued. The reference for structural classification.
-
The Pfam protein families databaseThe original Pfam paper. Establishes the HMM-based protein-family classification that has become the dominant sequence-level domain database. Pfam has been continuously maintained and expanded for nearly thirty years and is now part of the InterPro consortium; the methodology has substantially shaped both the working vocabulary of protein bioinformatics and the training-data preparation of modern protein-AI methods. The reference for protein families.
-
Basic Local Alignment Search Tool (BLAST)The BLAST paper. The most-cited bioinformatics paper of all time, with millions of citations. Establishes the heuristic algorithm for fast sequence-database search that has been the workhorse of molecular biology for thirty-five years. The methodology has been refined extensively (PSI-BLAST, HHblits, MMseqs2, the various successors) but the original 1990 paper remains the foundational reference. The reference for sequence search.
-
Foldseek: fast and accurate protein structure searchThe Foldseek paper. The dominant 2024 tool for fast structure-vs-structure search across the AlphaFold Database, processing queries at speeds comparable to BLAST while comparing structures rather than sequences. Cited here as the methodology that has begun to reorganise how protein-family relationships are detected — structural homology often reveals connections that sequence similarity alone cannot. The reference for fast structural search.
-
Single-particle cryo-EM at atomic resolutionA landmark cryo-EM paper demonstrating sub-1.25 Å resolution for the iron-storage protein apoferritin — atomic resolution by any standard, achieved with cryo-EM rather than X-ray crystallography. Cited here as a marker of the cryo-EM resolution revolution that has transformed structural biology since the mid-2010s. The natural reading for understanding the post-2017 cryo-EM landscape. The reference for atomic-resolution cryo-EM.
-
PhosphoSitePlus, 2014: mutations, PTMs and recalibrationsThe reference paper for PhosphoSitePlus, the canonical curated database of post-translational modifications. Catalogues experimentally-observed phosphorylation, ubiquitination, acetylation, methylation, sumoylation, and other PTMs across the human proteome with detailed citations. The methodology of modern PTM-aware protein analysis routinely uses PhosphoSitePlus annotations. The reference for PTM data.
-
Highly accurate protein structure prediction with AlphaFoldThe AlphaFold 2 paper. The most-cited paper in the field, documenting the methodology that effectively solved the protein-folding problem at CASP14 in 2020. Covers the evoformer trunk, the SE(3)-equivariant structure module, the end-to-end training methodology, and the empirical results that made structure prediction a mature science. The 2024 Nobel Prize in Chemistry recognised this work. The watershed protein-AI paper.
-
Accurate structure prediction of biomolecular interactions with AlphaFold 3The AlphaFold 3 paper. Extends the AlphaFold methodology to handle protein-protein, protein-DNA, protein-RNA, and protein-ligand complexes via a unified diffusion-based architecture. The diffusion decoder replaces AlphaFold 2's structure module, and the model handles arbitrary biomolecular components with substantially improved accuracy on docking benchmarks. Marked the beginning of the open-vs-closed tension that has shaped the post-2024 field. The reference for multi-component structure prediction.
-
Evolutionary-scale prediction of atomic-level protein structure with a language model (ESM-2)The ESM-2 paper. Demonstrates that a 15-billion-parameter protein language model trained on 65M sequences with masked-language-modelling produces representations that contain sufficient structural information for direct prediction (ESMFold) without explicit MSA. Establishes scaling laws for protein language models and the methodology that has become the dominant PLM substrate. The reference for protein language models at scale.
-
De novo design of protein structure and function with RFdiffusionThe RFdiffusion paper. Establishes the diffusion-based methodology for generative protein backbone design that has become the dominant approach since 2023. The paper documents the architecture (RoseTTAFold trunk plus diffusion process on backbones in SE(3)), the training methodology, and a substantial range of design demonstrations (binders, symmetric oligomers, motif scaffolds, novel folds). The Baker lab's 2024 Nobel Prize partly recognised this work. The reference for generative protein design.
-
Robust deep learning–based protein sequence design using ProteinMPNNThe ProteinMPNN paper. Establishes the graph-neural-network-based methodology for inverse folding (designing sequences that fold to specified backbones) that has become the dominant approach. The paper documents the autoregressive decoding methodology, the substantial improvement over prior methods, and experimental validation showing that ProteinMPNN-designed sequences fold correctly far more often than Rosetta-designed sequences. The complementary half of the design pipeline alongside RFdiffusion. The reference for inverse folding.
-
Accurate proteome-wide missense variant effect prediction with AlphaMissenseThe AlphaMissense paper. Extends AlphaFold's evoformer machinery to predict missense-variant pathogenicity for all 71 million possible human missense variants. Combines masked-language-modelling with gnomAD population-frequency signal to produce calibrated pathogenicity scores; outperforms previous methods on benchmark sets and has been incorporated into clinical variant-interpretation pipelines globally. The reference for variant-effect prediction.
-
Accurate prediction of protein structures and interactions using a three-track neural network (RoseTTAFold)The RoseTTAFold paper. The major academic alternative to AlphaFold 2, developed concurrently by David Baker's lab. The three-track architecture (1D sequence, 2D pairwise, 3D coordinates with information flowing between tracks at each layer) became the substrate for many subsequent extensions including RFdiffusion. Open-source release made the methodology widely accessible to the academic community. The reference open-source structure prediction system.
-
Simulating 500 million years of evolution with a language model (ESM-3)The ESM-3 paper. Establishes a unified protein foundation model handling sequence, structure, and function as three modalities of the same data, with masked-language-modelling across all three. The 98B-parameter largest variant demonstrated state-of-the-art on standard benchmarks plus generation of esmGFP, a novel fluorescent protein with ~58% identity to known proteins. The methodology represents the next step in PLM development beyond ESM-2. The reference for unified protein foundation models.
-
Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences (ESM-1)The original ESM paper. The first demonstration that masked-language-modelling pretraining on UniProt produces representations encoding substantial structural and functional information — the foundational result that catalysed the protein-language-model wave and set the stage for ESM-2, ESM-3, and the broader PLM ecosystem. The reference for the protein-as-language framing.
-
ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and DesignThe ProteinGym benchmark paper. Aggregates ~250 deep mutational scanning experiments into a unified benchmark for variant-effect prediction methods. Provides standardised evaluation across diverse protein families and assay types, enabling apples-to-apples comparison of zero-shot and supervised methods. The 2024–2026 wave of variant-effect papers routinely evaluates against ProteinGym. The reference benchmark for variant-effect prediction.