Part XV · AI for Science · Chapter 04

Biology, Genomics & AI, from cells and the central dogma to genomic foundation models and the virtual cell.

Biology is the most data-rich science of the early twenty-first century. This chapter develops both the working vocabulary an AI reader needs to engage with computational biology (Sections 2–9 — cells and organelles, the central dogma, the genome, gene expression, Mendelian genetics, sequencing technologies, epigenetics, CRISPR) and the AI methodology that has substantially reshaped the field since 2015: sequence models for DNA and RNA (Section 11), genomic foundation models (Section 12), variant-effect prediction (Section 13), single-cell analysis (Section 14), spatial transcriptomics (Section 15), gene regulatory networks (Section 16), multi-omics integration (Section 17), and evaluation/interpretability/the frontier (Sections 18–19). The single chapter combines what the field treats as inseparable: the biology that frames the problems and the AI methods that address them.

Prerequisites & orientation

This chapter is both a domain primer and an AI-methods chapter. The first half (Sections 1–9) assumes no biology background beyond what most readers retain from high school. The second half (Sections 10–19) assumes the working machinery of modern deep learning (Part VI on transformers and CNNs in particular), the graph-neural-network material of Part XIII Ch 05 (useful for Section 16), the foundation-model material of Part X (the substrate for Sections 12 and 14), the survival-analysis methods of Part XIII Ch 06 (which connect to the variant-effect-prediction material of Section 13), and the equivariance methodology of Ch 01 Section 8. Readers with a strong biology background can skim Sections 2–9; readers with strong ML but no biology should take their time with the first half before engaging with the second.

Three threads run through the chapter. The first is the central dogma: information flows from DNA to RNA to protein, and most of computational biology is organised around predicting one layer from another. The second is the biology-as-language framing: DNA, RNA, and protein are sequences, and the toolkit developed for natural-language processing — embeddings, attention, masked-language-model pretraining, in-context learning — transfers to biological sequences with substantial empirical success. The framing is not the only useful one (some biological problems are genuinely structural, and the methodological details differ in important ways from NLP), but it has been the dominant productive analogy of the past five years and shapes most of the architectures the second half develops. The third is the evaluation problem: biological ground truth is expensive, indirect, and often noisy, and the field's evaluation practice has had to develop substantial discipline around what counts as a real result. Section 18 develops this in detail; it appears in passing throughout.

In this chapter

Why Biology, and Why Bio-AI cellular workforce · data scale · the data substrate · why now
Cells and Cellular Architecture prokaryote · eukaryote · organelles · membranes
The Central Dogma DNA · RNA · protein · transcription · translation
DNA, the Genome, and Chromosomes genome · chromosomes · gene · non-coding
Genes and Gene Expression transcription factors · enhancers · regulation
Mendelian Genetics and Evolution alleles · inheritance · selection · variation
Sequencing Technologies Sanger · NGS · long reads · single-cell · cost curve
Epigenetics and Chromatin methylation · histones · ATAC · ChIP · accessibility
CRISPR and Modern Genome Engineering Cas9 · base editing · prime editing · screens
From Biology to ML: An Orientation transferring ML to biology · what changes · the methodological bridge
Sequence Models for DNA and RNA CNNs · DeepBind · Basset · Basenji · Enformer · transformers
Genomic Foundation Models DNABERT · Nucleotide Transformer · Evo · Caduceus · HyenaDNA
Variant Effect Prediction CADD · SpliceAI · AlphaMissense · GWAS · regulatory variants
Single-Cell Analysis scRNA-seq · UMAP · scVI · scGPT · Geneformer · cell embeddings
Spatial Transcriptomics and Imaging Visium · MERFISH · spatial neighbourhoods · imaging-based ML
Gene Regulatory Networks and Pathways GRN inference · perturbation prediction · CellOracle · pathway ML
Multi-Omics Integration CITE-seq · multiome · MOFA · cross-modal translation · joint embeddings
Evaluation, Interpretability, and Validation benchmarks · in silico · experimental validation · biological priors
The Frontier: Virtual Cells and Beyond perturbation models · virtual cell · GeneAI · the road ahead

Why Biology, and Why Bio-AI

Modern biology is a science of information. Every cell carries a roughly three-billion-letter genetic instruction manual, copies and reads it billions of times in a lifetime, and uses the resulting molecules to build, maintain, and adapt itself. The methods of molecular biology over the past seventy years — discovering the double helix in 1953, decoding the genetic code in the 1960s, sequencing the first human genome in 2003, building CRISPR genome editing in 2012 — have steadily revealed the layers of this information system. This chapter develops both the working biology vocabulary an AI reader needs (Sections 2–9) and the AI methodology that has substantially reshaped the field since 2015 (Sections 10–19). Section 10 frames what makes AI-for-biology methodologically distinctive; this section maps the biology itself.

The information view of biology

The most useful framing of biology for an AI reader is the information view. DNA is a four-letter code (A, C, G, T) about three billion characters long in humans, organised into chromosomes; it is copied with high fidelity each cell division and transmitted from parent to offspring. The information is read by molecular machinery (RNA polymerase) that produces messenger RNA, which is then translated by other machinery (ribosomes) into proteins, which fold into three-dimensional shapes and do most of the actual work of cells. The whole system has elaborate copy-editing, regulation, error-correction, and modification machinery that varies the program by cell type, life stage, and environment. Each of these layers — the code, the copying, the reading, the translation, the regulation — is a target for both biological measurement and AI methods.

The scale problem

Biology spans nine orders of magnitude in length: atoms (10⁻¹⁰ m), small molecules (10⁻⁹ m), proteins (10⁻⁸ m), organelles (10⁻⁶ m), cells (10⁻⁵ m), tissues (10⁻³ m), organs (10⁻¹ m), organisms (1 m), and ecosystems (10⁵ m). It also spans twelve orders of magnitude in time: chemical reactions (10⁻¹² s), protein folding (10⁻⁶ s), gene expression (minutes), cell division (hours), human lifetime (10⁹ s), evolution (10¹⁴ s). Different scales have different data, different methods, and different questions. This chapter operates primarily at the molecular and cellular scales where most AI applications live; the organismal and evolutionary scales appear briefly in Section 6.

What modern biology looks like

The 2025 working biologist spends most of their time on a computer. Wet-lab experiments still happen — mice, cell cultures, microscopes, gels — but the primary product of most experiments is now a large dataset (sequences, expression measurements, image stacks, mass-spectrometry traces) that has to be analysed computationally. Bioinformatics, once a sub-field, is now central. Most major biology labs include computational scientists; most major computational labs collaborate with experimentalists. The interface where biology and computation meet is the natural home of the AI methods this chapter develops. The cost of sequencing fell faster than the cost of computing fell to process it; by 2020 the bottleneck had shifted from data generation to analysis, which is much of why AI methods became central rather than peripheral.

How this chapter is organised

Sections 2–9 develop the working biology vocabulary: cells and organelles, the central dogma, the genome, gene expression, Mendelian genetics and evolution, sequencing technologies, epigenetics, and CRISPR. Section 10 turns to the AI methodology proper, framing what makes the AI-for-biology subfield distinctive — its data substrate from an ML perspective, the biology-as-language analogy, the evaluation problem, the deployment gap, the interpretability premium, and the AlphaFold catalyst. Sections 11–19 develop the methods: sequence models for DNA and RNA (11), genomic foundation models (12), variant-effect prediction (13), single-cell analysis (14), spatial transcriptomics (15), gene regulatory networks (16), multi-omics integration (17), evaluation and interpretability (18), and the frontier (19). The single chapter combines what the field treats as inseparable: the biology that frames the problems and the AI methods that have substantially advanced what the field can do about them.

The Data, the Code, and the Reader

An AI reader approaching biology should think of it as: a four-letter digital code (DNA), a copying-and-reading system (transcription, translation), a regulatory layer (gene expression, epigenetics), and an editing toolkit (CRISPR). The wet-lab biology of cells, tissues, and organisms is the substrate, but the conceptual machinery is increasingly informational. Sections 2–9 develop what each of these means; Section 10 onward turns to the AI methodology that has begun to read, predict, and write biology at scale.

Cells and Cellular Architecture

Cells are the fundamental unit of biology. The 1839 Schleiden-Schwann cell theory holds: all living things are made of one or more cells; the cell is the basic structural and functional unit of life; all cells come from pre-existing cells. Modern biology has filled in the architecture, biochemistry, and dynamics, but the cell remains the anchor. Understanding cellular structure is necessary for understanding everything else.

Prokaryotes and eukaryotes

Cells fall into two broad categories. Prokaryotic cells (bacteria, archaea) are smaller (~1–10 μm), simpler, and lack a membrane-bound nucleus — their DNA floats free in the cytoplasm. Eukaryotic cells (animals, plants, fungi, protists) are larger (~10–100 μm), more complex, and have a nucleus and many internal compartments (organelles). The split is ancient: prokaryotes appeared roughly 3.5 billion years ago, eukaryotes roughly 2 billion years ago. Most AI-for-biology applications work with eukaryotic cells (typically human or mouse), but the prokaryotic system is essential for understanding the simpler version of the regulatory machinery and is the substrate for much of microbiology, infectious-disease research, and synthetic biology.

The eukaryotic cell, organised by compartment

A typical eukaryotic cell contains: a plasma membrane (the outer boundary, made of a phospholipid bilayer with embedded proteins); a cytoplasm (the interior, filled with cytoskeleton and organelles); a nucleus (which holds the DNA, organised into chromosomes, and is where transcription happens); mitochondria (which produce ATP, the cell's energy currency, and have their own small genome — a vestige of an ancient bacterial endosymbiosis); the endoplasmic reticulum and Golgi apparatus (which manufacture, modify, and traffic proteins); ribosomes (the molecular machines that translate mRNA into proteins, present in both prokaryotes and eukaryotes); lysosomes (waste-processing compartments in animal cells); and a cytoskeleton (microtubules, actin filaments, intermediate filaments) that gives the cell shape and enables movement. Plant cells additionally have chloroplasts (photosynthesis) and a cell wall (rigid outer shell).

Eukaryotic cell architecture. The major organelles each handle distinct biochemical work; AI methods for cell biology (Section 14) increasingly engage with the per-compartment proteome and metabolome.

The cell cycle

Most cells reproduce by division. The cell cycle has four canonical phases: G1 (growth and protein synthesis), S (DNA replication, when each chromosome is copied), G2 (further growth and preparation), and M (mitosis, the actual division of the cell into two daughters). Checkpoints between phases enforce quality control — cells that detect DNA damage halt at the G1/S or G2/M checkpoints to repair it; persistent damage triggers apoptosis, programmed cell death. Cancer is, at its core, a failure of these regulatory mechanisms: cells that escape cell-cycle controls and proliferate when they shouldn't. AI applications in cancer biology and drug discovery routinely engage with this machinery.

Cell types and differentiation

Multicellular organisms have many cell types — the human body has roughly 200 distinct cell types (neurons, muscle, hepatocytes, T cells, etc.), all sharing the same DNA but expressing different subsets of genes. Differentiation is the process by which a generic precursor cell (a stem cell, in extreme cases) commits to a specialised cell-type fate. The 2010s and 2020s revolution in single-cell biology — measuring gene expression, epigenetic state, and other properties one cell at a time, across tens of thousands or millions of cells — has substantially refined our understanding of cell types. The Human Cell Atlas project (2017–present) is the canonical reference effort, and AI methods (Section 14) are central to its analysis.

Tissues, organs, and organisms

Cells assemble into tissues (groups of similar cells working together — epithelium, connective, muscle, nervous), tissues into organs (heart, liver, brain), and organs into organ systems (cardiovascular, digestive, nervous). Most AI-for-biology applications operate at the molecular or cellular scale; the tissue and organ scales are increasingly tractable through spatial transcriptomics (Section 15) and tissue-imaging methods (the various 2024–2026 multimodal atlases), but they are harder. The organismal scale shows up in genetics (Section 6) and in clinical applications (Ch 05 of Part XIV); the population/evolutionary scale appears briefly in Section 6.

The Central Dogma

Francis Crick's 1958 statement of the central dogma remains the conceptual backbone of molecular biology: DNA makes RNA makes protein. The information flows in one direction; translation from RNA to protein is essentially never reversed. The details have been refined substantially since 1958, with reverse transcription (RNA back to DNA, in retroviruses) and various exceptions added, but the core picture holds. Understanding the dogma is a prerequisite for understanding everything that follows.

DNA: the storage layer

Deoxyribonucleic acid (DNA) is a long polymer made of four building blocks (nucleotides): adenine (A), cytosine (C), guanine (G), and thymine (T). The molecule is famously a double helix — two complementary strands wound around each other, with A pairing with T and C pairing with G. The pairing makes the molecule self-correcting and self-replicating: each strand is a template for synthesising the other, which is the basis of replication. The complementary structure also explains how mutations work (a single change in one strand creates a temporary mismatch that gets either repaired or propagated to both strands at next replication).

RNA: the messenger

Ribonucleic acid (RNA) is a related polymer, with three differences from DNA: it uses the sugar ribose instead of deoxyribose; it uses uracil (U) in place of thymine; and it is typically single-stranded. RNA serves several functions. Messenger RNA (mRNA) carries the gene sequence from nucleus to cytoplasm, where ribosomes translate it. Transfer RNA (tRNA) brings amino acids to the ribosome during translation. Ribosomal RNA (rRNA) is part of the ribosome itself. Non-coding RNAs (lncRNAs, miRNAs, siRNAs) play regulatory roles that have been a major area of biology since the early 2000s. The mRNA category is the most important for AI applications because gene expression is conventionally measured by counting mRNAs — RNA-seq is the standard technique.

Proteins: the runtime

Proteins are polymers of amino acids — twenty distinct building blocks linked by peptide bonds. The amino-acid sequence (the primary structure) is encoded by the gene; the sequence determines a complex folded three-dimensional structure; the structure determines the function. Proteins do most of the actual work of cells: they catalyse chemical reactions (enzymes), provide structural support (collagen, actin), transport molecules (haemoglobin, channel proteins), regulate gene expression (transcription factors), and serve countless other roles. Understanding proteins is essentially understanding what cells do — and Ch 03 of Part XV (Protein Science & AI) develops both the protein-science vocabulary and the AlphaFold-era methodology that has transformed the field since 2020.

The genetic code

How does a four-letter DNA code encode a twenty-amino-acid protein language? The answer is codons — triplets of nucleotides. Each three-letter codon specifies one amino acid (or a "stop" signal). With four letters and three positions, there are 4³ = 64 possible codons, and the code is degenerate: most amino acids are specified by multiple codons. The genetic code was deciphered in the 1960s (Nirenberg, Khorana, others) and is nearly universal across life — bacteria, plants, and humans use essentially the same code. This universality is one of the strongest pieces of evidence for common descent of all known life, and it has practical importance for biotechnology: a gene cloned from a human can be expressed in a bacterial system, and the bacterial ribosomes will translate it correctly.

Transcription and translation

The mechanics of going from gene to protein involves two main steps. Transcription is the synthesis of RNA from a DNA template, carried out by an enzyme called RNA polymerase. Translation is the synthesis of protein from mRNA, carried out by ribosomes with the help of tRNAs that bring the right amino acid for each codon. Both processes have many regulatory mechanisms (Section 5 develops this), and both have been studied in molecular detail. The methodology of modern molecular biology — and the data substrate of much of AI for biology — comes from increasingly sophisticated techniques for measuring transcription and translation across the genome simultaneously.

Exceptions and refinements

The central dogma has accumulated exceptions over seven decades, but they are exceptions, not refutations. Reverse transcription (RNA back to DNA) happens in retroviruses (HIV, the canonical example) and in some normal cellular processes (telomerase). Some viruses have RNA genomes that get translated directly. Prions (the cause of Creutzfeldt-Jakob disease and BSE) are infectious proteins that propagate without nucleic acid intermediates. Each exception has its own biology and its own AI applications, but the central dogma's information-flow framing remains the right starting point.

The central dogma. DNA stores the genetic information; RNA carries it to the ribosome; protein folds and does the work. The arrows are the molecular machinery (RNA polymerase for transcription, ribosomes for translation) that the rest of molecular biology engages with in detail.

DNA, the Genome, and Chromosomes

Section 3 introduced DNA at the molecular level; this section develops the organisation, content, and dynamics of DNA at the genome scale. The genome is the complete set of DNA in a cell, and understanding its structure, content, and behaviour is essential for understanding most of modern biology.

The double helix

The structure of DNA was solved in 1953 by James Watson and Francis Crick at Cambridge, drawing heavily on X-ray crystallography data from Rosalind Franklin and Maurice Wilkins at King's College London. The model — two antiparallel strands of nucleotides wound into a right-handed double helix, with A-T and G-C base pairs holding the strands together — was immediately recognised as a Rosetta Stone for biology: the structure suggested both how the molecule encodes information (the sequence of bases) and how it is replicated (each strand serves as a template for the other). The 1953 paper is one of the most important single contributions to twentieth-century science.

Chromosomes

The DNA in a typical eukaryotic cell would, if stretched out, be roughly two metres long. To fit inside a 10 μm nucleus, it is highly compacted. The basic compaction unit is the nucleosome: roughly 147 base pairs of DNA wrapped around a complex of eight histone proteins. Nucleosomes assemble into 30-nm fibres, then higher-order loops and domains, ultimately producing the structure visible under a microscope as a chromosome. Humans have 46 chromosomes (23 pairs) — 22 autosomal pairs plus the X-Y sex chromosomes. Other organisms have different counts: dogs have 78, fruit flies have 8. The chromosome count is not a measure of complexity; what matters is the content.

The human genome at a glance

The human genome is roughly three billion base pairs long. About 1–2% of it codes directly for proteins (~20,000 protein-coding genes). The other 98% includes regulatory regions (promoters, enhancers — Section 5), structural elements (centromeres, telomeres), repetitive sequences (transposable elements, simple repeats), and genuinely unknown regions whose function is still being investigated. The "junk DNA" framing of the 1970s has substantially shifted with the ENCODE project (2003–present), which has shown that much of the non-coding genome is biochemically active. The current consensus is that the genome is more functionally rich than the protein-coding fraction would suggest, but exactly how much of the non-coding is functional vs. neutral remains an active research question.

DNA replication

Before each cell division, the entire genome must be copied. The process is carried out by DNA polymerase with the help of a substantial supporting cast of enzymes (helicases that unwind the helix, primases that initiate synthesis, ligases that seal nicks, topoisomerases that relieve torsional stress). Replication is fast (~50 base pairs per second per polymerase) and accurate (~one error per 10¹⁰ base pairs, with proofreading). The 2024 understanding of replication is detailed enough that it is the substrate of substantial therapeutic development — many antibiotics and chemotherapies target replication enzymes.

Mutations and DNA repair

Despite the high accuracy of replication, mutations happen — typos in the genome that propagate to daughter cells. Most mutations are neutral (in non-coding regions, or silent at the protein level), some are deleterious, and a small fraction are beneficial. Point mutations (single-base changes) include both transitions (purine↔purine or pyrimidine↔pyrimidine: A↔G, C↔T) and the rarer transversions (purine↔pyrimidine). Insertions and deletions (indels) shift the reading frame in coding regions, often producing premature stop codons. Larger structural variations (copy-number changes, translocations, inversions) reshape the genome at scales of kilobases to megabases.

Cells maintain a portfolio of DNA repair mechanisms, each specialised for a different damage type. Mismatch repair (MMR) handles replication errors that escape polymerase proofreading — the MSH/MLH machinery recognises mismatched base pairs and replaces the daughter-strand stretch using the parent strand as template; loss-of-function mutations in MMR genes (MSH2, MLH1) cause Lynch syndrome. Base-excision repair (BER) handles small chemical lesions to individual bases (oxidation, deamination, alkylation) — DNA glycosylases excise the damaged base, leaving an abasic site that AP endonuclease processes. Nucleotide-excision repair (NER) handles bulky helix-distorting damage (UV-induced thymine dimers, chemical adducts) by excising a short oligonucleotide and re-synthesising; loss-of-function causes xeroderma pigmentosum. Homologous recombination (HR) repairs double-strand breaks accurately by using a sister chromatid as template; the BRCA1/BRCA2 proteins are central to this pathway, and their loss substantially elevates cancer risk. Non-homologous end joining (NHEJ) repairs double-strand breaks by direct ligation, faster but error-prone. Failures across these pathways produce specific cancer-predisposition syndromes, and the mechanistic understanding has produced clinically-deployed drugs — PARP inhibitors (Olaparib, Rucaparib) exploit the synthetic lethality between HR-deficient cancers and BER inhibition. AI methods for variant-effect prediction (Section 13) increasingly need to handle these repair-pathway-specific contexts to predict pathogenicity correctly.

Why genomes vary

Two unrelated humans differ at roughly one in every thousand base pairs — about three million single-nucleotide polymorphisms (SNPs) per pair, plus various structural variants. Most of this variation is benign; some predisposes to disease, affects drug response, or shapes traits. Population genetics studies how this variation arose, how it propagates, and what it tells us about human history. The 1000 Genomes Project (2008–2015), the UK Biobank (~500K genomes since 2006), and the All of Us Research Program (~1M genomes since 2018) are the canonical reference resources, and AI methods (Section 13) are central to their analysis.

Genes and Gene Expression

A gene is more than just a stretch of DNA encoding a protein. The full picture of how genes are read, when, and at what level — gene expression — is the conceptual core of modern molecular biology. AI methods for biology engage extensively with the regulatory machinery that controls expression, and understanding this layer is essential for understanding most AI-for-biology applications.

Anatomy of a gene

A typical eukaryotic gene has several parts. The coding sequence (the exons) encodes the eventual protein; in many genes, exons are interrupted by introns (non-coding sequences that are removed before translation). The promoter sits upstream of the gene and binds RNA polymerase to initiate transcription. Enhancers are regulatory regions (often distant from the gene) that boost transcription. Silencers repress it. The 5′ UTR and 3′ UTR are untranslated regions of the mRNA that contain regulatory signals controlling translation, mRNA stability, and localisation. Understanding these elements is essential for predicting how a given DNA sequence will behave.

Transcription factors

The proteins that read regulatory sequences and turn genes on or off are called transcription factors (TFs). Humans have roughly 1,600 TFs, each binding to a specific short DNA sequence (the TF's binding motif). A typical gene's expression is controlled by a combination of dozens of TFs, with the integration logic (which TFs turn it on, which turn it off, and how the signals combine) constituting a substantial part of regulatory biology. Predicting TF binding sites from sequence and predicting gene expression from TF activity are canonical AI-for-biology problems (Section 11 develops the methods).

RNA processing

The journey from gene to protein has several intermediate steps. Splicing removes introns from pre-mRNA and joins exons; alternative splicing can produce multiple distinct proteins from the same gene by including or excluding different exons. Capping adds a chemical modification to the 5′ end of the mRNA. Polyadenylation adds a tail of A nucleotides to the 3′ end. Each modification is regulated and contributes to mRNA stability, translation efficiency, and localisation. The 2020s wave of AI-for-biology methods includes substantial work on predicting splicing patterns from sequence (SpliceAI, Pangolin, the various successors) and on predicting RNA structure and modification patterns.

Translation regulation

Even after an mRNA is made and processed, its translation into protein is regulated. The untranslated regions (UTRs) contain motifs that affect ribosome recruitment, translation rate, and mRNA half-life. microRNAs bind to specific mRNAs and either degrade them or block translation. Phosphorylation of translation factors can globally up- or down-regulate translation in response to stress. The result is that the mRNA level (which RNA-seq measures) and the protein level (the actual functional output) often correlate only modestly — a complication that shapes the design of AI methods that aim to predict cellular state from gene-expression data.

The expression atlas

Modern biology has produced extensive measurements of gene expression across cell types, tissues, conditions, and species. The Genotype-Tissue Expression project (GTEx) provides expression measurements across ~50 human tissues. The Tabula Sapiens and Tabula Muris atlases provide single-cell expression measurements across many human and mouse cell types. The Allen Brain Atlas provides spatial expression measurements across the mouse and human brain. These atlases are reference resources for biology and the substrate for many AI applications. Knowing they exist and what they contain is essential for understanding where AI-for-biology gets its training data.

Mendelian Genetics and Evolution

Sections 2–5 covered the molecular layer; this section steps up to the inheritance layer — how genetic variation passes between generations, how it produces traits, and how it changes over evolutionary time. The methodology connects directly to population genetics and to the genome-wide association studies (GWAS) that are a major application area of AI-for-biology.

Mendel's laws

Gregor Mendel's 1866 paper on pea plants established the foundational laws of inheritance, decades before the molecular substrate (DNA) was understood. The law of segregation: each parent transmits one of two alleles per gene to offspring. The law of independent assortment: alleles of different genes segregate independently. Dominance: some alleles are dominant (one copy is sufficient to express the trait); others are recessive (both copies needed). Mendel's framework, refined and extended through the twentieth century, remains the conceptual backbone of classical genetics.

Alleles, genotypes, and phenotypes

An allele is a variant of a gene; genotype is the combination of alleles an individual carries; phenotype is the observable trait that results. The relationship between genotype and phenotype is rarely as clean as Mendel's pea plants suggested. Most traits are polygenic (influenced by many genes); many show incomplete penetrance (the genotype increases trait risk but doesn't determine it); environment interacts with genetics extensively. Modern polygenic risk scores aggregate the effects of thousands of variants to predict trait probability, and AI methods are increasingly central to their development.

Meiosis and recombination

Sexual reproduction involves a special form of cell division called meiosis that produces gametes (eggs and sperm) with one copy of each chromosome rather than two. During meiosis, chromosomes from the two parents undergo recombination — physical exchange of DNA segments — which shuffles alleles between chromosomes and generates the genetic diversity in offspring. The recombination rate varies along the genome; "recombination hotspots" are well-mapped regions where exchanges happen frequently. Understanding recombination is essential for understanding why some alleles travel together (linkage disequilibrium) and how GWAS works.

Evolution and natural selection

Over generations, the frequencies of alleles in populations change. Natural selection increases the frequency of alleles that improve survival or reproduction; genetic drift changes frequencies randomly, especially in small populations; migration introduces alleles from other populations; mutation creates new alleles. These four forces, formalised by Fisher, Haldane, Wright, and others in the 1920s–1940s "modern synthesis," produce the patterns of genetic variation we observe today. Modern population genetics uses these principles to infer demographic history, identify selection signatures, and understand the genetic architecture of disease.

GWAS and the genetics of common disease

The most-impactful application of population genetics in the last twenty years has been the genome-wide association study (GWAS). Sequence (or genotype at common SNPs) thousands to millions of individuals; record their phenotypes (height, BMI, disease status); test which SNPs are associated with which phenotypes. The methodology has identified tens of thousands of trait-associated variants since 2007, with substantial implications for our understanding of disease mechanism, drug-target identification, and personalised medicine. Most associations have small effect sizes — common diseases are highly polygenic — and the AI methods of Section 13 are increasingly central to extracting actionable insight from GWAS data.

Sex and the X-Y

A specific genetic complication worth flagging: the X and Y chromosomes. Females are typically XX, males typically XY. Genes on the X chromosome are inherited differently than autosomal genes (males have only one copy; X-linked recessive disorders like haemophilia are much more common in males). The Y chromosome is small and largely degenerate but contains the SRY gene that triggers male development. The 2023 telomere-to-telomere sequencing of the human Y completed our reference catalogue. AI-for-biology methods routinely have to handle X- and Y-specific complications, and naive autosomal-only analyses miss substantial biology.

Sequencing Technologies

The data substrate of modern molecular biology is sequencing — the experimental technique for reading DNA (or RNA) base by base. The cost has fallen by roughly six orders of magnitude since 2000, making sequencing ubiquitous and the resulting data the largest publicly-available biological dataset. Understanding the technology is essential for understanding what the data actually represents.

Sanger sequencing

The first widely-used DNA sequencing method, developed by Fred Sanger in 1977, uses chain-terminating dideoxynucleotides to produce a series of fragments of varying lengths, separated by gel electrophoresis. The method is accurate but slow and expensive — read length is up to ~1,000 bases, and the throughput is one read at a time. Sanger sequencing produced the original Human Genome Project draft (2000–2003) at a cost of roughly $3 billion. It remains the gold standard for low-volume, high-accuracy applications (validating individual variants, sequencing single genes), but for most modern applications it has been displaced.

Next-generation sequencing

Starting around 2007, a new generation of sequencing technologies — next-generation sequencing (NGS), or "short-read" sequencing — substantially changed the economics. Illumina, the dominant player, uses sequencing-by-synthesis: DNA is fragmented into short pieces (~150–300 bases), amplified into clusters on a flow cell, and read in parallel using fluorescent nucleotides and a camera. A modern Illumina NovaSeq X can produce roughly 16 Tb of sequence data per run — billions of reads — at a cost of a few hundred dollars per genome. The cost reduction has been the single most important factor in the rise of computational biology; nearly all the public datasets cited above were produced on Illumina or similar platforms.

Long-read sequencing

Short reads have a fundamental limitation: they cannot resolve large structural variants, repetitive regions, or the haplotype phase between distant variants. Long-read sequencing (PacBio, Oxford Nanopore Technologies) addresses this by reading much longer fragments — tens of thousands of bases per read on PacBio HiFi, hundreds of thousands on Oxford Nanopore. The technologies have higher per-base error rates than Illumina but are increasingly accurate, and they produce qualitatively different information. The 2022 telomere-to-telomere reference assembly (T2T-CHM13) was made possible by long-read sequencing, and 2024–2026 production deployments increasingly combine short- and long-read data.

RNA-seq and the transcriptome

The same sequencing technology can be used to measure RNA: reverse-transcribe RNA to cDNA, then sequence. The resulting RNA-seq data measures gene expression by counting mRNAs. Variants of the technique include single-cell RNA-seq (one cell at a time, transformative for cell-type discovery), spatial transcriptomics (preserving spatial information about where each cell sits in tissue), and single-nucleus RNA-seq (for tissues that don't dissociate cleanly). These methods have produced the major single-cell atlases of the 2020s and are central to the AI-for-biology methods of Section 14.

The data explosion

The combined effect of cheap NGS and ubiquitous variants of the technique has been an exponential increase in publicly-available sequencing data. The Sequence Read Archive (NCBI's primary repository) holds tens of petabytes of raw sequencing reads as of 2025, with growth that continues to accelerate. The European Nucleotide Archive, the DNA Data Bank of Japan, and the various specialised repositories add similar volumes. The Cancer Genome Atlas, GTEx, ENCODE, the UK Biobank, the Human Cell Atlas, and the various single-cell atlases together represent the largest body of biological data ever assembled, and they are the substrate of essentially all modern AI-for-biology work. The methodology has shifted from "I need to generate data" to "I need to find and analyse the right existing data" — an important orientation for any AI reader entering the field.

Sequencing cost per human genome, 2001–2025 (log scale, NIH/NHGRI data). The cost has fallen from $3 billion (Human Genome Project) to ~$200 (2025 production NGS), a faster decline than Moore's law. The cheapness of sequencing is what makes the modern AI-for-biology data substrate possible.

Epigenetics and Chromatin

The DNA sequence alone does not fully determine cellular behaviour. A liver cell and a neuron in the same person have the same genome, but they express dramatically different gene sets — and this difference is heritable through cell divisions. The layer of regulatory information beyond the DNA sequence is called epigenetics, and it is one of the most active research areas in biology with substantial AI-for-biology applications.

What epigenetics means

The word epigenetics literally means "above genetics" — heritable changes in gene expression that don't involve changes to the DNA sequence itself. The main epigenetic mechanisms are: DNA methylation (chemical modification of the DNA itself, typically at CpG sites); histone modifications (chemical marks on the histone proteins that DNA wraps around); and chromatin accessibility (whether a region of the genome is open and accessible to transcription factors or compacted and silent). These mechanisms collectively encode the cell-type-specific regulatory state.

DNA methylation

The most-studied epigenetic mark is methylation of cytosine residues, particularly at CpG dinucleotides. In mammals, methylation tends to repress gene expression; CpG-rich regions called CpG islands often sit at gene promoters and are unmethylated in expressed genes. Bisulfite sequencing measures methylation across the genome and produces methylation maps for thousands of cell types and conditions. The 2020s wave of AI methods has produced foundation models for predicting methylation from sequence (the various 2024 successors to early CpG-prediction methods), with applications in cancer biology, ageing, and developmental biology.

Histone modifications

Histone proteins (the spool DNA wraps around) carry various chemical modifications at specific residues on their N-terminal tails — and these modifications turn out to be a substantial regulatory layer in their own right. The chemistry connects directly to Ch 02: methylation (addition of one, two, or three methyl groups to a lysine or arginine side chain — chemically inert, mostly affects which other proteins can bind), acetylation (addition of an acetyl group to a lysine — neutralises the lysine's positive charge, weakens DNA binding, opens chromatin), phosphorylation (addition of a phosphate to serine, threonine, or tyrosine — adds negative charge, often signal-dependent), and ubiquitination (addition of a small ubiquitin protein, often as a marker for downstream regulation).

Different modification patterns correspond to specific functional states by recruiting "reader" proteins (which bind specific marks) that then drive downstream effects — opening or closing chromatin, recruiting transcription machinery, or attracting still other regulators. The most-studied marks worth knowing by name: H3K4me3 (trimethylation of lysine 4 on histone H3) marks active promoters; H3K27me3 marks Polycomb-repressed regions; H3K9me3 marks heterochromatin (constitutive silencing); H3K36me3 marks active gene bodies; H3K9ac and H3K27ac mark active transcription and active enhancers respectively. The combinations form what biologists call the histone code: distinct combinations correlate with distinct chromatin states (active promoter, active enhancer, polycomb-repressed, heterochromatin, transcribed gene body). ChIP-seq (chromatin immunoprecipitation followed by sequencing) measures modifications genome-wide by using antibodies specific to each mark; the ENCODE and Roadmap Epigenomics projects together produced reference histone-modification maps across hundreds of human cell types. AI-for-biology methods (Section 11) routinely consume ChIP-seq tracks as both training labels and input features.

Chromatin accessibility

Whether a region of the genome is "open" (accessible to transcription factors) or "closed" (compacted into nucleosomes and inaccessible) is a fundamental regulatory layer. ATAC-seq (assay for transposase-accessible chromatin using sequencing) measures accessibility genome-wide. DNase-seq serves a similar purpose with a different enzymatic mechanism. The resulting accessibility maps reveal which regulatory regions are active in which cell types, and they form the substrate of much modern regulatory biology and many AI applications.

Imprinting and X-inactivation

Two specific epigenetic phenomena are worth flagging. Genomic imprinting is the parent-of-origin-specific silencing of certain genes — about 100 genes in humans are imprinted, with the maternal or paternal copy silenced depending on the gene. Loss of imprinting causes specific developmental syndromes (Prader-Willi, Angelman, Beckwith-Wiedemann). X-chromosome inactivation is the random silencing of one X chromosome in each cell of female mammals, equalising X-linked gene dosage between males and females. Both mechanisms are mechanistically epigenetic and have substantial implications for disease genetics and AI methods that treat the genome as if it were uniform.

Epigenetic clocks and ageing

A specific application worth flagging is the epigenetic clock (Horvath 2013, refined extensively since). DNA methylation patterns at specific CpG sites change predictably with age, and a regression model trained on methylation data predicts chronological age within a few years. More importantly, the deviation between predicted and actual age (the "biological age") correlates with health outcomes and has become a standard biomarker in ageing research. The methodology has obvious AI-for-biology extensions, and the 2024–2026 wave of "biological-age prediction" tools has substantially refined the original Horvath clock.

CRISPR and Modern Genome Engineering

For most of biology's history, scientists could read genomes but not write them. The 2012 development of CRISPR-Cas9 as a programmable genome-editing tool changed this — biologists can now make targeted changes to genomes at specific locations, in essentially any organism, with relative ease. The implications for biology, medicine, and biotechnology have been profound, and the methodology continues to evolve rapidly.

The CRISPR system

CRISPR (Clustered Regularly Interspaced Short Palindromic Repeats) was discovered in bacteria and archaea, where it serves as an adaptive immune system against viruses. Bacteria capture short pieces of viral DNA, store them in CRISPR arrays, transcribe them into guide RNAs, and use these to direct Cas (CRISPR-associated) nucleases to cut matching viral DNA on subsequent infection. The 2012 Doudna-Charpentier and Zhang papers showed that the system could be programmed: by designing a custom guide RNA, scientists could target Cas9 to any specific site in any genome and make a double-stranded break, which the cell would then repair (often imperfectly, producing edits). The 2020 Nobel Prize in Chemistry recognised Doudna and Charpentier for this work.

The basic toolkit

The core CRISPR toolkit has several components. Cas9 from Streptococcus pyogenes is the original and most-studied nuclease, recognising sites with an adjacent PAM (protospacer-adjacent motif) sequence. The guide RNA (gRNA) is a short ~20-nucleotide sequence matching the target site. The repair template is an optional DNA sequence that the cell can use as a template for repair, enabling precise edits rather than just disruption. Various Cas variants (Cas12, Cas13 for RNA editing, the various engineered Cas9 with relaxed PAM requirements) extend the toolkit.

Base editing and prime editing

Two more-recent extensions deserve specific mention. Base editing (Liu lab, 2016 onward) uses a catalytically-impaired Cas9 fused to a deaminase enzyme to convert one DNA base to another (C to T, or A to G) without making a double-stranded break. The methodology is gentler than standard CRISPR and produces precise single-base changes. Prime editing (Liu lab, 2019 onward) uses a Cas9-reverse-transcriptase fusion guided by an extended pegRNA that encodes both the target site and the desired edit — enabling small insertions, deletions, and substitutions with substantially better precision than standard CRISPR. Both are active research areas with rapidly improving capabilities.

CRISPR therapeutics

The first FDA-approved CRISPR therapy was Casgevy (Vertex, 2023), for sickle-cell disease and beta-thalassemia. The therapy edits patients' bone marrow stem cells ex vivo to reactivate fetal haemoglobin, then re-infuses the edited cells. Subsequent CRISPR therapies in development target inherited blindness (Editas), familial hypercholesterolaemia (Verve), various cancers, and other conditions. The 2024–2026 pipeline includes both ex vivo therapies (edit cells outside the body, return them) and in vivo therapies (deliver CRISPR machinery into the patient's body), with the in vivo applications substantially harder due to delivery and off-target challenges.

The ethics layer

Genome editing raises substantial ethical questions, made acute by the 2018 He Jiankui case (a Chinese scientist who edited human embryos and produced live-born twins, in violation of international consensus norms). The current scientific consensus draws a line between somatic editing (changes to body cells, not transmitted to offspring — broadly accepted, increasingly clinical) and germline editing (changes to embryos, transmitted to offspring — broadly proscribed). The line is fragile and contested, and the policy debates continue. The 2018 international consensus statement, the 2020 NASEM-Royal Society heritable-genome-editing report, and the various subsequent expert commissions provide the current institutional framework, but the technology continues to outpace the policy.

Synthetic biology and beyond

Beyond editing existing genomes, the broader field of synthetic biology aims to design and build genetic circuits, organisms, and systems from scratch. The 2010 Venter Institute synthesis of an entire bacterial genome (Mycoplasma mycoides JCVI-syn1.0) was a landmark; the 2024 wave of "genome design" projects and the various AI-assisted design platforms (the Microsoft and DeepMind biology efforts, the various 2024 startup deployments) extend this trajectory. AI methods for biology are increasingly the design layer for synthetic biology, with the protein-design and molecular-design chapters that follow this one developing the methodology in detail.

From Biology to ML: An Orientation

The previous nine sections established the biology. This one is the bridge to the methodology that follows. Most ML practitioners come to biology assuming the methods will transfer cleanly — it's just sequences, just classification, just regression. The methods do transfer, but the discipline of doing AI for biology well requires confronting several specific properties that distinguish it from mainstream ML domains: a particular shape of public-data substrate, a productive but imperfect biology-as-language analogy, an unusually hard evaluation problem, a wide gap between benchmark and clinical utility, an interpretability premium the field cannot relax, and the methodological aftershocks of the AlphaFold result. This section orients the ML practitioner; Sections 11–19 develop the methods within that frame.

Data scale and the public-data substrate

Biology has more freely-available labelled data than almost any other ML domain. The Sequence Read Archive (NCBI) holds tens of petabytes of sequencing reads. UniProt holds 250+ million protein sequences. The Cancer Genome Atlas (TCGA), the UK Biobank, the All of Us Research Program, the Tabula Sapiens single-cell atlas, the GTEx tissue-expression atlas — together they represent the largest body of biological data ever assembled, all freely accessible to researchers. The methodology of AI for biology has been fundamentally shaped by this substrate: most successful methods build on top of standardised public datasets rather than collecting bespoke training data. The economics differs sharply from, say, manufacturing AI (Part XIV Ch 09), where data is a substantial cost centre.

Biology as language

Ch 03 developed the protein-as-language framing in detail; this chapter extends it to DNA, RNA, and single-cell expression. The masked-language-modelling toolkit transfers from natural language to biological sequences across all four modalities — protein (ESM, ESM-2/3), DNA (DNABERT, Nucleotide Transformer, Section 12), RNA (the various 2024–2026 RNA-LM efforts), and single-cell "cells as sentences of gene tokens" (scGPT, Geneformer, Section 14). The analogy is not perfect — biology has structural and 3D constraints language doesn't — but the empirical case is that the same architectural recipe (transformer + masked-LM pretraining on a large corpus) works across all four domains, with each modality requiring its own tokenisation choices and pretraining-corpus considerations.

The evaluation problem

Biological ground truth is expensive, indirect, and often noisy. A "correct" prediction of variant effect requires either thousands of dollars of laboratory experimentation per variant or large-scale screens that have been run for only a small fraction of the human variant landscape. A "correct" prediction of cell type depends on cell-type definitions that are themselves contested. A "correct" prediction of gene expression has measurement noise comparable to the predicted signal in many regimes. The methodology of AI for biology has had to develop substantial discipline around what counts as a valid evaluation, what kinds of predictions are testable, and how to compare methods fairly. Section 18 develops this in detail; the conceptual point is that biology is harder to evaluate than vision or NLP, and naive transfer of evaluation practice from those fields produces misleading results.

The deployment gap

An ML method that works well on a benchmark may or may not produce useful biology when deployed. The gap between benchmark performance and biological utility is real and well-documented. AlphaFold predictions are accurate on average but routinely mispredict specific structural features that matter for drug design. Variant-effect predictors agree on common variants but diverge on rare ones — exactly the variants where prediction is most useful. Single-cell methods produce different cell-type assignments depending on hyperparameter choices that benchmark performance doesn't capture. The methodology of effective AI-for-biology deployment requires substantial domain expertise and careful pairing with experimental validation, and Section 18 returns to this.

The interpretability premium

Biology has stronger interpretability requirements than most ML domains. A pure black-box predictor that gets right answers without explanation is less useful than a slightly worse predictor that produces interpretable mechanistic insights. Drug development pipelines need to know why a model predicts a variant is pathogenic; clinical genetics needs to provide actionable explanations to patients and families; basic biology values mechanistic understanding intrinsically. The field has developed substantial machinery for biology-specific interpretability — saliency on sequence regions, attention-pattern analysis on transformer-based models, in silico mutagenesis to probe model behaviour — and this machinery is part of the methodology rather than an afterthought.

The AlphaFold catalyst

This chapter assumes the AlphaFold story (told in detail in Ch 03) as background. The downstream consequences for genomics-AI specifically are the AlphaFold-derived methods that operate at the variant level — AlphaMissense for missense pathogenicity (Section 13) is the prime example — and the broader pattern that protein-AI's success catalysed investment, talent, and methodological exchange across the entire AI-for-biology field. Most of this chapter operates at the genome-wide and cellular scales rather than at the per-protein structural-biology scale; where protein structure matters here, it appears as a downstream input rather than a central object.

What Bio-AI Demands of ML Practice

It's not the data — biology has more freely-available data than almost any ML domain. It's the evaluation, the interpretability requirements, and the gap between benchmark performance and biological utility. The methodology of the chapter is a discipline shaped by these constraints; the headline architectures (transformers, foundation models) are familiar, but the surrounding engineering and evaluation practice differs substantially from mainstream ML.

The landscape of AI for biology: methods organised by biological scale, with the major data substrates that underwrite each. The chapter develops the methodology at each scale before turning to the integrative frontier of Section 19.

Sequence Models for DNA and RNA

The first wave of deep learning for biology, roughly 2014–2019, focused on convolutional and recurrent models for DNA and RNA sequences. The methodology established the core problem framings — predict a functional measurement (transcription-factor binding, chromatin accessibility, gene expression) from raw nucleotide sequence — and the substantial public-benchmark culture that the rest of the field rests on.

The DeepBind era

The watershed early paper was DeepBind (Alipanahi et al. 2015): use a CNN on DNA sequence to predict transcription-factor binding. The architecture treated DNA as a 1D signal with four channels (one per base), applied convolutions to learn motif-like filters, pooled, and predicted binding. The results matched or exceeded the best classical methods (position weight matrices and their refinements) across hundreds of TF binding datasets. Basset (Kelley et al. 2016) extended the approach to chromatin-accessibility prediction; Basenji (Kelley et al. 2018) extended it further to long sequences and quantitative gene-expression prediction. The methodology was simple in retrospect (CNNs on DNA) but established the empirical case for deep learning in regulatory genomics.

Architectural details: dilated convolutions and receptive fields

The Basset-Basenji architectural recipe deserves explicit detail because it remains influential. The input is a one-hot encoding of DNA (a 4×L matrix, L the sequence length). The first convolutional layer learns ~300–500 filters of width 19–25, intentionally sized to capture short TF-binding motifs (which average 8–15 bp); after a ReLU and max-pool, these motif activations feed into deeper layers. The deeper layers use dilated convolutions — convolutions with gaps in their receptive field — to expand effective context without exploding parameter count. With dilation rates of 1, 2, 4, 8, …, 128 across successive layers, the receptive field grows exponentially while the number of parameters per layer stays constant. Basenji's full receptive field reaches ~32 kb on a ~131 kb input window. The architectural choice was a substantial advance: enhancers can sit tens of kilobases from their target genes, and a model with a 1 kb receptive field literally cannot see them.

BPNet and base-resolution profile prediction

A specific architectural innovation worth flagging is BPNet (Avsec, Weilert et al. 2021), which predicts the full base-resolution read-density profile of a TF binding experiment rather than a single binding probability per region. The architecture combines dilated convolutions with a deconvolution head that produces per-position predictions. The methodology is more demanding (the model has to capture not just where binding happens but the precise shape of the profile) and substantially more interpretable: the learned filters reveal not just which TFs bind where but how their binding interacts with neighbouring factors, with directly observable cooperative and competitive patterns. BPNet's interpretability methodology — DeepLIFT contribution scores on individual nucleotides, TF-MoDISco for clustering motif instances — has become a standard toolkit for sequence-based regulatory-genomics interpretation.

Long-range context and Enformer

Regulatory elements (enhancers) can be hundreds of thousands of base pairs from the genes they control, and CNN-based models with limited receptive fields cannot capture this long-range dependency. Enformer (Avsec et al. 2021) addressed this with a transformer-based architecture, taking 196,608-base-pair input windows and predicting thousands of regulatory and expression measurements simultaneously. The paper reported substantial improvements over Basenji on quantitative trait loci (eQTL) effect prediction, splice-site prediction, and regulatory-element annotation. Enformer is among the most-cited regulatory-genomics ML papers of the past five years and has substantially shaped subsequent architecture choices.

Attention mechanics for genomics

Enformer's specific architecture deserves mention because it pioneered the design pattern most subsequent genomic transformers have followed. The input passes through a CNN tower (seven convolutional blocks with downsampling) to produce a sequence of ~1,500 binned representations at 128 bp per bin. These representations then feed into eleven transformer blocks with multi-head attention plus relative positional encoding (a Gaussian-decay variant tailored to genomic distances). The output head predicts ~5,000 tracks (CAGE expression, ChIP-seq, ATAC-seq, DNase-seq) simultaneously across the central ~115 kb. The architectural lesson is that pure transformers don't work well on raw single-base DNA — the sequence is too long for attention to handle directly — but a hybrid CNN-tower-then-transformer approach captures both local motif structure and long-range regulatory grammar. The 2024 wave of genomic transformers (Borzoi, the various successors) inherits this design with refinements.

SpliceAI and splice prediction

A specific high-impact application is splice-site prediction — predicting which positions in a pre-mRNA will be used as splice donors and acceptors. SpliceAI (Jaganathan et al. 2019, Illumina) is a CNN that predicts splicing patterns with sufficient accuracy to identify cryptic splice variants causing disease. The model is now deployed in clinical genetics pipelines worldwide and is a canonical example of AI-for-biology producing genuine clinical utility. Pangolin (Zeng & Bromberg 2022) is a successor that handles splicing across multiple species. The methodology has been a quiet success story of the field.

The empirical recipe

The accumulated 2015–2024 experience produced a relatively stable empirical recipe for sequence-based regulatory genomics. Use one-hot encoding of nucleotides; combine convolutional layers (early layers learn motif-like filters; later layers compose them) with attention or recurrence for long-range context; use multi-task learning across many regulatory measurements simultaneously to share representation across tasks; train on standard benchmarks (ENCODE, FANTOM, Roadmap Epigenomics) with held-out chromosomes for evaluation. The recipe has been remarkably stable, and the 2024 wave of foundation models (Section 12) builds on this foundation rather than replacing it.

Limitations

Sequence-based models have well-documented limitations. They model the canonical sequence, not the actual cellular context (which depends on chromatin state, cell type, and developmental stage). They cannot easily distinguish causal from correlational features (a binding site that co-occurs with another doesn't mean either is functional). Their predictions on rare variants — exactly the variants of clinical interest — are often less reliable than predictions on common variants where the training data is dense. The methodology has substantial machinery for working around these limitations, but the fundamental challenge that sequence is necessary but not sufficient for cellular behaviour remains.

Genomic Foundation Models

The 2022–2026 wave of biology AI has been substantially shaped by the foundation-model paradigm: pretrain a large transformer on biological sequences with a self-supervised objective (typically masked-language-modelling), then fine-tune for downstream tasks. The methodology has produced state-of-the-art results across many genomics tasks and is the conceptual core of much current research.

Early DNA language models

The foundational paper was DNABERT (Ji et al. 2021): treat DNA k-mers as tokens, pretrain a BERT-style transformer with masked-language-modelling on the human genome, fine-tune for downstream tasks. The empirical results were modestly better than CNN baselines, but the methodology was important: it showed that the language-model paradigm could transfer to DNA. Subsequent methods refined the tokenisation (single-nucleotide vs k-mer), the architecture (transformer vs Mamba/Hyena state-space models), and the training data (human only vs full tree of life), with steady improvements.

The Nucleotide Transformer

The most-cited modern DNA foundation model is the Nucleotide Transformer (Dalla-Torre et al. 2023, InstaDeep). It was pretrained on 850+ genomes from across the tree of life, with model sizes from 50M to 2.5B parameters. The largest variants matched or exceeded specialised baselines on a wide range of regulatory-genomics tasks (chromatin profile prediction, splice-site prediction, promoter prediction) without task-specific architecture engineering. The paper is the most-comprehensive empirical demonstration to date that the foundation-model paradigm transfers to DNA, and the released models are a substantial public resource.

Tokenisation choices for DNA

A specific design question deserves explicit treatment because it affects everything downstream: how to tokenise DNA. Three approaches dominate. Single-nucleotide tokenisation uses a 4-token vocabulary (A, C, G, T, plus N for ambiguous), which preserves all sequence information but produces very long token sequences (a 1 Mb genomic region becomes a 1M-token sequence). k-mer tokenisation uses overlapping k-letter substrings as tokens — DNABERT used 6-mers (a 4,096-token vocabulary), which compresses the sequence ~6× at the cost of redundancy (adjacent tokens share k-1 bases). Byte-pair encoding (BPE) learns the vocabulary from data, similarly to language models; the methodology is less common for DNA because it produces variable-length tokens that complicate position-dependent biological reasoning. The 2024–2026 consensus has shifted toward single-nucleotide tokenisation paired with efficient long-context architectures (state-space models, Section below), since the k-mer compression sacrifices information that turns out to matter for fine-grained variant-effect prediction.

State-space models and long context

Standard transformer attention scales O(N²) in sequence length, which limits practical context to thousands of tokens; biology routinely needs context lengths of hundreds of thousands or millions of bases. State-space models (SSMs) — Hyena, Mamba, S4 — use convolution-like structures parameterised in continuous time that compute in O(N log N) or O(N) depending on the variant. HyenaDNA (Nguyen et al. 2023) used Hyena to extend context to one million bases at single-nucleotide resolution, with strong empirical results on long-range regulatory tasks. The architecture replaces attention with implicit long convolutions whose kernels are themselves parameterised by small MLPs operating on relative positions. The methodology trades some expressivity (no fully content-aware mixing across all positions) for asymptotic scaling that makes single-nucleotide million-base context tractable. The empirical pattern through 2026 is that SSM-based and transformer-based DNA models are roughly comparable on tasks where both can fit the context, with SSMs winning when context length is the binding constraint.

Reverse-complement equivariance, mathematically

DNA is double-stranded; the two strands carry the same information in complementary form. A model that predicts function from DNA should produce the same output whether given the forward strand or the reverse complement of that strand — a property called reverse-complement (RC) equivariance. The naive approach is data augmentation: train on both strands. The principled approach is to bake equivariance into the architecture. Caduceus (Schiff et al. 2024) achieves this with a "BiMamba" backbone where each layer's weights are explicitly constrained so that processing the reverse-complement input produces the reverse-ordered output. The methodology connects directly to the equivariant-network material of Ch 01 Section 8 — the symmetry group is the order-2 group {identity, reverse-complement}, and equivariant architectures factorise the model into RC-symmetric and RC-antisymmetric components. The empirical case is substantial: Caduceus matches or exceeds HyenaDNA's performance with similar parameter counts, and the gain comes specifically from explicit RC equivariance rather than capacity.

Evo: generative DNA models

The 2024 release of Evo (Nguyen et al. 2024, Stanford and Arc Institute) marked a watershed: a 7B-parameter DNA foundation model trained on 2.7 million prokaryotic and bacteriophage genomes with single-nucleotide resolution and 131K-base context. Evo can both predict (variant effects, regulatory annotations) and generate (synthesise plausible new genomes, design CRISPR-Cas systems, generate functional proteins from genomic context). The model has substantial implications for both basic biology and synthetic-biology applications. Evo 2 (2025) extended this with 9.3 trillion base pairs of training data spanning the full tree of life.

The pretrain-finetune workflow

The empirical workflow for using genomic foundation models has stabilised. Take a pretrained model (Nucleotide Transformer, Evo, Caduceus, the various successors). Add a task-specific head (linear classifier, regression head, sequence-output decoder). Fine-tune on the task at hand, possibly with parameter-efficient methods (LoRA, adapters) when compute is limited. Evaluate against task-specific benchmarks. The workflow is essentially the same as in NLP, with biology-specific differences in data preparation and evaluation. The broader pattern — that genomics has fully adopted the foundation-model paradigm of the broader ML field — is among the most-significant recent developments in computational biology.

What hasn't transferred

Despite the success, the foundation-model paradigm has limits in biology. The training-loss-to-downstream-task relationship is weaker than in NLP — DNA models with similar perplexity can perform very differently on downstream tasks. The scaling laws for biological data are less clean than for natural language, with diminishing returns appearing earlier in many cases. The cellular-context problem persists: a foundation model trained on canonical genomes does not natively know about cell-type-specific chromatin state, developmental stage, or environmental conditions. The methodology continues to evolve, and the best practice in 2026 will likely look different from the best practice in 2028.

Variant Effect Prediction

A central practical problem in genetics is variant effect prediction: given a specific genetic variant (a single-nucleotide change, an indel, a structural variant), predict its functional consequences. The methodology has substantial pre-AI history — Polyphen-2, SIFT, CADD, Polygenic Risk Scores — that modern AI methods build on rather than replace.

The variant-prediction problem

Roughly 5,000 variants per individual differ from the human reference at a level expected to affect protein function; orders of magnitude more vary in non-coding regions where the effects are subtler. Most of these variants are benign; some cause disease; some affect drug response or trait variation. The classical clinical-genetics workflow asks: given a patient's variant of unknown significance (a "VUS" in the ACMG terminology), is it likely pathogenic? AI methods have substantially improved the answers to this question, particularly for missense variants in coding regions and for splice-affecting variants in non-coding regions.

The classical predictors

Several classical methods anchor the field. CADD (Combined Annotation Dependent Depletion, 2014) trains a logistic-regression model on the contrast between human-derived alleles (presumed neutral) and simulated mutations (presumed enriched for deleterious), using annotations across many feature types. Polyphen-2 (2010) and SIFT (2003) score missense variants based on conservation and structural features. REVEL (2016) is an ensemble of multiple methods. These classical methods remain widely deployed in clinical and research pipelines; modern AI methods either augment them or, in some cases, substantially outperform them.

Missense-variant prediction in the genomic context

The dominant contemporary missense-variant predictor is AlphaMissense (Cheng et al. 2023). Ch 03 Section 16 develops the architecture in detail (the evoformer-derived backbone, the gnomAD population-frequency training signal, the masked-language-modelling loss producing surprise-as-pathogenicity scores). For genomics-AI purposes, the relevant points are how it integrates with the genomic and clinical pipelines that consume its output. AlphaMissense releases pre-computed scores for all 71 million possible human missense variants as a downloadable VCF-compatible annotation track, which integrates directly into standard variant-annotation toolchains (snpEff, VEP, Ensembl). Major sequencing services (Illumina TruSight, Genomics England's clinical pipeline, the various US-based clinical labs) ship AlphaMissense scores as a standard feature alongside CADD, REVEL, and the classical predictors discussed above.

The score's calibration matters substantially for clinical workflows. AlphaMissense produces probabilities, not raw scores, and the probabilities are calibrated against ClinVar pathogenic/benign distributions — which means clinical labs can use ACMG-AMP-compatible thresholds (a probability ≥0.564 typically classifies as "likely pathogenic") rather than picking arbitrary cut-offs. The methodology has substantially shifted the practical landscape for variant-of-uncertain-significance (VUS) reclassification: a 2024 analysis reported that AlphaMissense alone reclassifies ~30% of clinically-relevant VUSes with high confidence, with corresponding implications for diagnostic yield in rare-disease workups. The downstream effect on clinical genetics has been substantial enough that AlphaMissense scores are increasingly cited in published clinical interpretations and in the evidence supporting variant classifications in ClinVar.

Polygenic risk score algorithms

For common diseases, no single variant has large effect; instead, hundreds or thousands of variants each contribute a tiny amount to overall risk. Polygenic risk scores (PRS) sum these effects to produce an aggregate risk prediction. The methodology has substantial pre-AI history (LDpred, PRS-CS, the various Bayesian regression methods) that AI methods extend. Modern PRS for cardiovascular disease, breast cancer, and several psychiatric conditions are sufficiently predictive to inform clinical decisions, and the 2024 wave of LLM-and-PRS integration is a frontier for personalised medicine. The cross-population transferability of PRS — how well a score trained on European-ancestry individuals works on African- or Asian-ancestry individuals — remains an active research and equity concern.

Polygenic risk score mathematics

The basic PRS for an individual is a weighted sum: PRS = Σ_i β_i g_i, where g_i is the number of risk alleles at SNP i (0, 1, or 2) and β_i is the effect size estimated from a GWAS. Naive estimation uses GWAS-reported effect sizes directly, but this is suboptimal because nearby SNPs are correlated through linkage disequilibrium (LD) — the GWAS effect at one SNP partially reflects the true causal effect at nearby SNPs. LDpred (Vilhjálmsson et al. 2015) addresses this with a Bayesian model that uses the LD structure (estimated from a reference panel like 1000 Genomes) to shrink correlated effects toward each other. PRS-CS (Ge et al. 2019) uses a continuous shrinkage prior with a global scaling parameter, producing better-calibrated effect estimates and improved out-of-sample performance. The 2024 wave of PRS-mix methods combines multiple per-trait scores (using a meta-regression over scores from multiple GWAS), and the various ML-based methods (LDpred-funct, the BayesR variants, increasingly deep-learning approaches) extend this further. The methodology is increasingly central to precision-medicine deployment, and the algorithmic detail matters: a poorly-calibrated PRS can produce confident-but-wrong clinical predictions.

Regulatory variants

Most disease-associated variants identified by GWAS lie in non-coding regulatory regions, not in protein-coding sequence. Predicting their effects is substantially harder than predicting missense effects because the regulatory grammar is more complex. The methodology connects to the sequence models of Section 11 — Enformer, the various foundation models — applied to evaluate the effect of a variant by comparing predictions on the variant vs. reference allele. Production deployments at major sequencing companies (Illumina, Genomics England) increasingly include AI-based regulatory-variant prediction, with the empirical accuracy still substantially below missense-variant prediction.

Saturation mutagenesis and deep mutational scanning

A specific evaluation methodology worth flagging: deep mutational scanning (DMS) experiments measure the functional effect of every possible single-amino-acid mutation in a protein, providing dense ground-truth data for variant-effect prediction. The MAVE-DB database (Multiplexed Assays of Variant Effect) collects such datasets across hundreds of proteins. AI-based variant predictors are increasingly evaluated against DMS data rather than against indirect proxies (clinical-significance annotations), with substantial improvements in evaluation rigour. The 2024–2026 wave of large-scale DMS data has been a quiet but important development.

Single-Cell Analysis

The single-cell revolution of the 2010s and 2020s — measuring gene expression, chromatin state, and other properties one cell at a time — produced the most data-rich subfield of modern biology and has been a major application area for AI methods. The methodology spans dimensionality reduction, cell-type identification, trajectory inference, and increasingly foundation models for cellular state.

The scRNA-seq workflow

The standard single-cell RNA-seq analysis workflow has stabilised over a decade of refinement. (1) Quality control — remove cells with too few or too many UMIs, filter mitochondrial-RNA-dominated cells (likely dying). (2) Normalisation — adjust for library-size differences. (3) Feature selection — pick highly-variable genes for downstream analysis. (4) Dimensionality reduction — principal-component analysis, then UMAP or t-SNE for visualisation. (5) Clustering — Leiden or Louvain community detection on the kNN graph. (6) Cell-type annotation — match clusters to known cell-type signatures or use ML-based assignment. (7) Differential expression and downstream biology. The Scanpy and Seurat ecosystems implement this workflow and are widely used.

The standard scRNA-seq analysis pipeline. UMAP is shown dashed because it is for visualisation only — clustering and downstream analysis operate on the higher-dimensional embedding (PCA or scVI latent z) where distances are meaningful.

Variational methods for single-cell

The methodological backbone of modern single-cell AI is the variational autoencoder applied to single-cell data. scVI (Lopez et al. 2018) was the foundational paper: model single-cell counts with a probabilistic decoder, learn a latent embedding via amortised variational inference, use the embedding for clustering, batch correction, and downstream analysis. Subsequent methods (totalVI for CITE-seq, peakVI for ATAC-seq, MultiVI for joint multi-omics) extend this framework. The scvi-tools ecosystem is the canonical implementation and has become standard infrastructure.

scVI in detail: ZINB likelihoods and amortised inference

The scVI architecture deserves explicit detail because it is the methodological substrate for so much subsequent work. The observation model is a zero-inflated negative binomial (ZINB) distribution per gene per cell — the right choice because scRNA-seq counts are integer-valued, overdispersed (variance > mean), and contain excess zeros from technical dropout (an mRNA was present but missed by the assay). The encoder is an MLP that maps a cell's gene-expression vector to the parameters of a Gaussian variational posterior over a latent z (typically 10–30 dimensional). The decoder is another MLP that maps z to the ZINB parameters: rate (mean expression), dispersion (overdispersion parameter), and dropout probability per gene. Training maximises the evidence lower bound (ELBO): reconstruction quality of the count data minus a KL divergence between the posterior and a standard-normal prior on z. Batch correction falls out naturally: passing batch labels as conditional inputs to the decoder lets the model learn batch-specific technical effects while keeping the latent z batch-free. The architecture's elegance is that one model handles dimensionality reduction, batch correction, normalisation, differential expression, and imputation simultaneously.

UMAP and the embedding question

The standard visualisation tool in single-cell biology is UMAP (Uniform Manifold Approximation and Projection, McInnes et al. 2018). The algorithm builds a fuzzy simplicial complex from the high-dimensional kNN graph of cells, then optimises a low-dimensional layout to preserve this topological structure under a cross-entropy loss. The resulting 2D embedding preserves local neighbourhood structure better than t-SNE while running ~10× faster on large datasets, which made it the de facto standard. A specific caveat worth stating: UMAP coordinates are not meaningful in the way PCA coordinates are. Distances between distant points are not interpretable, axes have no biological meaning, and the algorithm's stochastic initialisation produces visually-different layouts on rerun. Modern best practice uses UMAP for visualisation only; downstream analysis (clustering, trajectory inference) operates on the high-dimensional embedding (PCA, scVI latent z) where distances are meaningful. The 2023 wave of PaCMAP, TriMap, and other UMAP successors offers various tradeoffs between local and global structure preservation, but UMAP remains dominant for routine use.

Trajectory inference: pseudotime and RNA velocity

Many biological processes happen continuously in time, but typical single-cell experiments capture a snapshot. Trajectory inference methods order cells along a continuous developmental trajectory using their similarity in gene-expression space. Monocle 3 (Cao et al. 2019) builds a kNN graph of cells, identifies branching structure via reverse graph embedding, and assigns each cell a continuous "pseudotime" value reflecting its progress along the trajectory. PAGA (Wolf et al. 2019) operates on cell-cluster level rather than cell level, producing a coarse-grained graph of cluster-to-cluster transitions. RNA velocity (La Manno et al. 2018) takes a different approach: by separately counting unspliced (intronic) and spliced (mature) reads from the same cell, the method estimates the rate of change of mRNA abundance at the moment of measurement, which extrapolates into a directional velocity in expression space. The 2020 scVelo extension generalised the model to include cell-specific kinetic parameters; the 2022 CellRank extension treated the resulting velocity field as a Markov process for principled lineage prediction. The methodology has substantial caveats — RNA velocity assumes specific kinetic models that aren't always realistic — but the framework remains a productive way to extract dynamical information from snapshot data.

scGPT and Geneformer: how cells become tokens

The single-cell foundation models deserve explicit architectural detail because their tokenisation choices are unusual. Geneformer (Theodoris et al. 2023) ranks genes within each cell by expression level and uses the rank as the input token — a cell becomes a sequence of gene IDs ordered by expression, truncated to ~2,048 tokens. The pretraining task is masked gene prediction: hide some genes' positions in the rank list and predict them. The tokenisation effectively treats expression as ordinal rather than continuous, which is robust to technical noise but throws away magnitude information. scGPT (Cui et al. 2024) instead pairs each gene with its expression value, producing a sequence of (gene, expression) pairs; the architecture handles both via separate gene embeddings and value embeddings that are summed. The pretraining task includes masked expression prediction and contrastive learning across cells. Both architectures use standard transformer backbones with cell-level aggregation via a special CLS-token analogue. The empirical case for these models is mixed — some benchmarks (gene network inference, perturbation prediction) show clear improvements over scVI baselines; others (basic cell-type classification on standard datasets) show comparable or worse performance. The 2024–2026 generation of single-cell foundation models continues to refine the tokenisation question, and the "right" answer is genuinely unsettled.

Atlases and reference cell-type catalogues

The Human Cell Atlas (HCA, 2017–present), Tabula Sapiens (2022), Tabula Muris (2018), and the various organ-specific atlases have produced reference cell-type catalogues for human and mouse tissues. The methodology of using these as reference for query datasets — "given my new dataset, which cell types are present?" — is a standard application area, with tools like SingleR, Azimuth, and the various reference-mapping methods of scvi-tools. The methodology is increasingly important for interpretable single-cell analysis and is the substrate for many disease-comparison studies.

Spatial methods deserve their own section

Single-cell methods historically dissociate cells from their tissue context, losing information about where each cell sits in the body. The spatial-transcriptomics revolution of the 2020s has been changing this; Section 15 develops the methodology in detail.

Spatial Transcriptomics and Imaging

Most biological function depends not just on which cells are in a tissue but where they are — which cells are neighbours, which form structures, which are organised into functional niches. Spatial transcriptomics (ST) and related imaging methods preserve this information, and AI methods for spatial data are an active and rapidly-developing application area.

The major spatial platforms

Several platforms anchor the field in 2026. 10x Visium (sequencing-based, ~50 μm resolution) is the most-deployed, capturing whole-transcriptome data at ~5,000–10,000 spots per slide. Visium HD (2024 release) extends this to single-cell resolution. Slide-seq and Stereo-seq are sequencing-based alternatives with subcellular resolution. MERFISH, seqFISH+, and CosMx are imaging-based methods that detect tens to hundreds of pre-selected RNA species at subcellular resolution with sequential hybridisation. Xenium (10x's imaging platform) and Curio Bioscience's Trekker are recent commercial entries. Each has different trade-offs between coverage, resolution, and cost.

Spatial GNN architectures

The fundamental computational question for spatial data is: how does a cell's behaviour depend on its neighbours? The dominant methodology applies graph neural networks (Part XIII Ch 05) to the spatial graph, with edges encoding both physical proximity and (often) histology-image similarity. SpaGCN (Hu et al. 2021) builds a graph where nodes are spatial spots and edges connect spatially-adjacent spots; edge weights combine spatial distance with histology image similarity (features extracted from H&E or DAPI images via a CNN), producing tissue domains that respect both spatial and morphological structure. STAGATE (Dong & Zhang 2022) uses a graph attention autoencoder where attention coefficients between adjacent spots are learned from gene-expression similarity, producing spot embeddings that reflect both location and expression. GraphST (Long et al. 2023) adds contrastive learning across spatially-augmented views, and several 2024–2026 methods extend the framework with explicit cell-cell-communication priors (incorporating ligand-receptor interaction databases as soft edge weights). The spatial-biology specifics are: the choice of edge construction (k-nearest spatial neighbours, fixed-radius neighbours, or Delaunay triangulation), the inclusion of histology image features as edge weights, and the careful handling of tissue boundaries where graph connectivity should drop sharply.

MERFISH and image-based decoding algorithms

Imaging-based spatial methods like MERFISH (Multiplexed Error-Robust Fluorescence In Situ Hybridization, Chen et al. 2015) use a different computational backbone than sequencing-based methods. The experimental design encodes each gene with a binary barcode of length N (typically 16) and uses N rounds of FISH imaging, lighting up a different subset of probes each round. After imaging, an algorithm decodes the per-pixel signal across rounds into a binary string, then matches that string to the codebook to identify the gene. The decoding algorithm uses error-correcting codes (Hamming distance ≥ 4) to handle the high fluorescence-imaging noise, and modern variants apply ML methods (CNNs for spot detection, learned decoders for codebook matching) to improve sensitivity. The output is a list of (x, y, gene) triples per cell, which then feeds into the spatial-analysis pipeline. The methodology shares concepts with classical spot-detection in microscopy but with substantial domain-specific tuning, and 2024–2026 methods include explicit deep-learning replacements for the classical decoding pipeline that improve sensitivity by 30–50%.

Image-based deep learning for tissue

Beyond molecular spatial measurements, classical tissue imaging — H&E (haematoxylin and eosin) staining, immunohistochemistry, multiplexed imaging like CODEX — produces histology images that AI methods can analyse. PathologyAI models classify tumour vs normal tissue, predict survival from histology images, and increasingly infer molecular features (gene expression, mutation status) directly from images. The 2024 wave of pathology foundation models (UNI from Harvard, Virchow from Paige, the various successors) has substantially advanced this area, with potential to transform pathology workflows.

Spatial foundation models

The 2024–2026 wave of spatial-biology methods includes early foundation models for spatial data. The methodology is harder than for plain single-cell because spatial data is 2D-structured and varies across platforms; the field is genuinely early. Notable efforts include the various 2024 spatial transformers, the 2025 spatial-omics foundation models, and the cross-modal models that combine histology images with spatial transcriptomics. Empirical performance is improving but has not yet reached the level of confidence found in protein-language models.

The integration challenge

A general theme: spatial data integrates with single-cell data, with histology images, with bulk tissue measurements, with patient-level clinical data — and these integrations are where the most-impactful biology lives. The methodology of cross-modal data integration in spatial biology connects to the multi-omics material of Section 17 and to the broader foundation-model wave. The 2024–2026 frontier is increasingly defined by methods that can pool information across these modalities at scale.

Gene Regulatory Networks and Pathways

Beyond the per-cell analysis of single-cell methods, biology cares about the regulatory structure that connects genes — which genes regulate which other genes, which pathways are active, what perturbations would do to the system. AI methods for network and pathway analysis are an active area, with the methodology spanning network inference from data, perturbation prediction, and integration with prior biological knowledge.

Gene regulatory network inference

The classical problem is gene regulatory network (GRN) inference: from gene-expression data, infer which genes regulate which others. Pre-ML methods — Gaussian graphical models, mutual-information-based methods like ARACNe, ensemble methods like GENIE3 — have been the workhorses for two decades. ML-based methods (the various deep-learning-based GRN inference papers, the 2024 single-cell GRN methods) extend the toolkit with substantial improvements in some regimes, though the empirical case is mixed. The general challenge is that gene-expression correlations are abundant but causal regulatory relationships are sparse, and distinguishing the two requires perturbation data or strong priors.

Classical GRN inference algorithms in detail

Three classical algorithms define the field's substrate. ARACNe (Margolin et al. 2006) computes pairwise mutual information between every gene pair, then uses the data processing inequality to remove indirect interactions: if I(X; Z) is less than both I(X; Y) and I(Y; Z), then X→Z is more likely indirect (mediated through Y) and is pruned. The result is a directed network of putative direct regulatory relationships. GENIE3 (Huynh-Thu et al. 2010) trains, for each target gene, a random-forest regression to predict that gene's expression from all other genes; the importance scores from the trees serve as inferred edge weights. The methodology won the DREAM5 GRN inference challenge and remains a standard baseline. SCENIC (Aibar et al. 2017) extends GENIE3 specifically for single-cell data by combining GENIE3-inferred edges with TF-binding-motif filtering: candidate target genes for a TF must contain the TF's binding motif in their promoter, which removes spurious correlations. The 2024 wave of SCENIC+ additionally integrates ATAC-seq data to constrain inferred regulons by chromatin accessibility, substantially improving precision.

Perturbation prediction

A specific problem with substantial recent progress is perturbation prediction: given a baseline cellular state, predict what happens after a specific perturbation (gene knockout, drug treatment, environmental change). The methodology combines several lines of work. CPA (Compositional Perturbation Autoencoder, Lotfollahi et al. 2023) uses a VAE-based architecture trained on perturbation screens. scGPT and Geneformer include perturbation-prediction capabilities via in-silico mutation. The 2023 Perturb-seq screens (massively parallel CRISPR perturbations with single-cell readout) have generated substantial training data. Empirical performance has improved substantially since 2022 but remains imperfect, particularly for combinatorial perturbations the model hasn't seen.

How CPA models perturbations

The Compositional Perturbation Autoencoder architecture deserves explicit detail because its central trick — disentangling perturbation effects from baseline state — recurs across many perturbation-prediction methods. The encoder maps a (cell expression, perturbation label, dose, covariates) tuple to a basal latent z_basal stripped of perturbation information via an adversarial loss (a discriminator tries to predict the perturbation from z_basal; the encoder is penalised when it succeeds). Separately, the perturbation label and dose are encoded into a perturbation-effect vector via a learned embedding. The decoder takes z_basal + perturbation-effect and reconstructs the post-perturbation expression. At inference, novel combinations are generated by adding perturbation-effect vectors that the model has learned individually — the architecture is genuinely compositional in that perturbation effects can be added linearly in latent space. The empirical case is strong for in-distribution perturbations and combinations of seen perturbations; out-of-distribution generalisation (predicting effects of perturbations the model has never seen during training) remains hard. The 2024 wave of GEARS (Roohani et al. 2024) extends this with graph-based representations of gene-gene relationships, improving combinatorial prediction substantially.

Pathway analysis and gene-set enrichment

A more conservative methodology operates on curated pathway databases (Reactome, KEGG, MSigDB) rather than inferring networks from scratch. Gene set enrichment analysis (GSEA, Subramanian et al. 2005) tests whether differentially-expressed genes are concentrated in known pathways. The methodology is a workhorse of computational biology and connects to AI methods through the increasing use of ML-based pathway scoring, transformer-based pathway-aware analysis, and the various foundation-model-based pathway methods of the 2024–2026 generation.

CellOracle and prior-knowledge-based methods

A specific influential method worth flagging: CellOracle (Kamimoto et al. 2023). The approach combines TF-binding-motif-derived prior networks with single-cell expression data to simulate perturbations in silico. The methodology is conservative — using structural priors rather than purely learning from data — and has empirically validated predictions of cell-fate-engineering experiments. The broader pattern of combining AI methods with prior biological knowledge (literature-derived networks, ontologies, conservation across species) has been a productive theme that the chapter has touched on throughout.

Pathway-aware foundation models

The 2024–2026 wave of biological foundation models increasingly incorporates pathway information. Models like scFoundation and the various 2025 pathway-aware single-cell models combine sequence-level information with curated pathway annotations during training. The empirical case is still developing, but the methodology represents a genuine convergence between classical pathway biology and modern foundation-model architectures, and the integration is one of the active frontiers.

Multi-Omics Integration

Modern biology produces many parallel measurement types: genomic, transcriptomic, proteomic, metabolomic, epigenomic, spatial. Each captures a different layer of cellular state. Multi-omics integration — combining these measurements into a unified representation — is one of the most active areas of biology AI as of 2026.

The integration problem

Different omics modalities have different data types, scales, noise patterns, and missing-data structures. Bulk RNA-seq is continuous and high-dimensional; ATAC-seq is sparse and binary-like; proteomics is moderately dimensional but with high noise; metabolomics has substantial coverage gaps. Naive concatenation of features fails because the modalities differ qualitatively. The methodology of effective multi-omics integration handles these differences explicitly, typically through latent-variable models that learn a shared representation across modalities.

Classical methods: MOFA and friends

The classical multi-omics tool is MOFA (Multi-Omics Factor Analysis, Argelaguet et al. 2018, with successors). The method is essentially a Bayesian extension of PCA that handles multiple modalities with different distributions. MOFA and its successors (MOFA+, MEFISTO for spatial-temporal data) remain widely deployed for moderate-scale multi-omics datasets. The methodology is interpretable, well-validated, and computationally tractable; deep-learning alternatives match or exceed its performance only for larger datasets.

Joint VAE methods

The deep-learning multi-omics workhorse is the joint variational autoencoder. Methods like totalVI (CITE-seq integration), MultiVI (joint scRNA-seq + scATAC-seq), and the various 2023–2026 multi-omics VAE variants train a single model that jointly embeds multiple modalities into a shared latent space. The methodology produces strong performance on standardised benchmarks and has substantial deployment in single-cell multi-omics work.

Joint VAE architectures in detail

The joint-VAE architecture follows a consistent recipe across methods. Each modality has its own encoder (an MLP from modality-specific features to the parameters of a Gaussian variational posterior over a shared latent z) and its own decoder (mapping z back to modality-specific likelihood parameters: ZINB for RNA counts, Bernoulli for ATAC-seq peaks, negative binomial for protein counts). Critically, all encoders map to the same latent space, which forces the model to learn representations that capture shared cellular state across modalities. Training maximises a sum of per-modality reconstruction terms minus a single KL divergence on the shared posterior. MultiVI handles the case of partially-paired data — some cells measured in both modalities, others in only one — by treating missing modalities as a special token in the encoder, with the loss only computed on observed modalities. The methodology gracefully handles the practical reality that most multi-omics datasets combine cells with different measurement subsets.

The joint-VAE architecture for multi-omics integration. Per-modality encoders share a latent space z; per-modality decoders reconstruct each modality from z under appropriate likelihood (ZINB for RNA counts, Bernoulli for ATAC peaks, negative binomial for protein). Implemented in totalVI, MultiVI, and the broader scvi-tools ecosystem.

Optimal transport for cross-modal alignment

A different methodology worth flagging is optimal transport-based alignment, exemplified by methods like SCOT (Demetci et al. 2022) and UnionCom. The idea: even when two modalities have completely different feature spaces (RNA expression vs. chromatin accessibility) and no paired cells, if the underlying cell types are shared, the modalities should produce similar geometric structures in their respective embeddings. Optimal transport finds the assignment between cells in modality A and cells in modality B that minimises a transport cost, producing a soft alignment. The methodology connects to the broader optimal-transport literature in ML (the various Wasserstein-distance-based methods) and provides an alternative to VAE-based integration when the data is unpaired. Empirical performance is competitive with VAE methods on standard benchmarks, particularly when the modalities share substantial cellular structure.

Cross-modal translation

A specific application worth flagging is cross-modal translation: predict one modality from another. Predict ATAC-seq accessibility from RNA expression; predict protein abundance from RNA expression; predict spatial transcriptomics from histology images. The methodology is increasingly used as a way of imputing missing modalities or extending limited measurements to larger datasets, and the 2024–2026 wave of cross-modal models represents substantial progress over earlier methods.

The benchmark problem

Multi-omics integration evaluation is hard. There is no single gold-standard "correct" integration; different methods preserve different aspects of the data. The Open Problems in Single-Cell Analysis benchmark (Luecken et al. 2022 and subsequent updates) is the most-developed standard, comparing methods across multiple metrics (biological-conservation scores, batch-correction scores, consistency across modalities). The methodology of careful benchmark-driven evaluation is increasingly important, and the 2024 wave of methods includes substantially more rigorous evaluation than the 2020 wave.

Foundation models for multi-omics

The most ambitious multi-omics approaches of 2026 are foundation models trained jointly on multiple modalities. The scFoundation family, UCE (Universal Cell Embedding), and the various 2025 cross-modal foundation models train transformers on combined multi-omics data with self-supervised objectives. The empirical case is genuinely early — most published benchmarks are mixed — but the methodology represents a substantial extension of the foundation-model paradigm into one of biology's hardest problems.

Evaluation, Interpretability, and Validation

Section 10 introduced the evaluation problem; this section develops it. Doing AI for biology well requires substantial discipline around what counts as a valid result, how to compare methods, and how to validate predictions experimentally. The methodology has matured substantially over the past five years but remains uneven across subfields.

The evaluation hierarchy

A useful frame is to think of biological evaluation as a hierarchy. At the lowest level is in silico evaluation: hold out part of the existing data, train on the rest, measure prediction accuracy. This is fast and cheap but vulnerable to dataset-specific quirks. At the next level is cross-dataset evaluation: train on one dataset, test on a different one with the same task. This catches some overfitting but not all. At the next level is cross-organism or cross-cell-type evaluation: train on data from one biological context, test on a different one. At the highest level is experimental validation: predict, then run an experiment to test the prediction. The methodology of effective AI for biology increasingly emphasises the higher levels, and the 2024–2026 generation of papers includes substantially more experimental validation than the 2018 generation.

The interpretability toolkit

Section 10 introduced the interpretability premium; here are the working tools. In silico mutagenesis systematically perturbs every position in an input sequence and measures how predictions change, producing position-level importance scores. Attention pattern analysis in transformer-based models reveals which input positions the model attended to; the methodology is fragile (attention is not always explanation) but useful as a starting point. Saliency-based methods (integrated gradients, DeepLIFT) produce per-input importance scores. Concept activation vectors probe whether a model's intermediate representations encode specific biological concepts. None of these is perfect, but the combination provides a working toolkit.

How interpretability methods actually work

The specific algorithmic detail of the major interpretability methods is worth laying out. In silico mutagenesis (ISM) for a sequence model: for each position i, replace the base with each of the three alternatives, run the model, record the change in prediction. The result is a 4×L matrix of per-position-per-base importance scores; visualising it as a sequence logo reveals the motif structure the model learned. DeepLIFT (Shrikumar et al. 2017) computes contribution scores by comparing the model's activations on the input to its activations on a reference (typically a sequence of background nucleotides), backpropagating differences from output to input. The methodology produces gradient-like importance scores but is more stable than raw gradients and handles non-linearities (ReLU, max-pool) more gracefully. TF-MoDISco (Shrikumar et al. 2018) goes further: cluster the per-sequence ISM or DeepLIFT scores across many sequences to discover recurring motif patterns, producing a learned-from-scratch motif catalogue. The full BPNet+ISM+DeepLIFT+TF-MoDISco pipeline is the canonical interpretability stack for sequence-based regulatory models, and it has produced biological discoveries (cooperative TF interactions, cell-type-specific binding logic) that are testable experimentally.

Benchmark culture

Several public benchmarks anchor empirical assessment in different subfields. The Genomic Benchmarks dataset for DNA sequence tasks. The Open Problems in Single-Cell Analysis for single-cell methods. The CAFA challenges for protein function prediction. The CASP competitions for protein structure (Ch 03). The DREAM challenges for various computational-biology problems. The methodology of careful benchmark-driven research has substantially matured the field, and 2024–2026 publications routinely include results across multiple standardised benchmarks. The challenge is that benchmarks can drift toward measuring what's easy rather than what's biologically important; the field has had to actively manage this tension.

The replication crisis in biology AI

Computational biology has had its own version of the broader scientific replication crisis. Many published methods fail to reproduce on held-out data; many "improvements" over baselines disappear when carefully re-evaluated; the gap between paper claims and practical performance is often substantial. The 2020s wave of careful re-evaluation papers (Domazet-Lošo et al., the various reanalysis efforts, the 2023 reproduction studies) has documented this and produced incremental community shift toward better evaluation practice. Releasing trained models, providing reproduction scripts, and using standardised benchmarks are increasingly the norm; the field has not arrived but is moving in the right direction.

Experimental validation

The gold standard remains experimental validation: predict, then run an experiment to test. The methodology requires substantial computational-experimental collaboration, with computational scientists generating predictions and experimental biologists running the validations. Production AI-for-biology efforts increasingly include in-house experimental validation; major academic-industry collaborations (DeepMind-EMBL, Microsoft Research-Genentech, the various biotech-AI partnerships) have similar structure. The methodology is expensive but is what distinguishes empirically-validated AI for biology from purely-theoretical method development.

The deployment-validation gap

Even for well-validated methods, the gap between research-grade evaluation and clinical deployment remains substantial. Clinical pipelines need calibrated probabilities, regulatory documentation, ongoing monitoring, and domain-specific safety checks. The methodology of bringing AI-for-biology results into clinical practice is its own discipline, with substantial overlap with the healthcare-AI material of Part XIV Ch 05. Most production deployments require significant adaptation between research method and clinical tool.

The Frontier: Virtual Cells and Beyond

The previous sections developed the established methodology of AI for biology; this final section turns to the open frontiers and the patterns that will likely shape the field over the next several years.

The virtual cell programme

The most-ambitious framing of the AI-for-biology frontier is the virtual cell: a computational model that can predict cellular behaviour comprehensively — gene expression, protein abundance, metabolism, signalling, response to perturbation, evolution over time. The 2024 announcement of the Chan Zuckerberg Initiative's "Virtual Cell" project, the various large biotech AI efforts (e.g., the Anthropic and OpenAI biology partnerships), and academic efforts at Stanford, Harvard, and elsewhere collectively represent substantial investment in this direction. The methodology genuinely doesn't exist yet — current models capture parts of cellular behaviour but not the whole — but the framing has crystallised the long-term goal of the field.

Perturbation atlases at scale

A specific data revolution underway is the construction of perturbation atlases: systematic measurement of cellular response to thousands or millions of perturbations. Perturb-seq, CRISPR-Cas9 screens, and the various combinatorial-perturbation methods produce single-cell measurements of cellular response to genetic perturbations at unprecedented scale. The 2024 release of the Perturbation Cell Atlas and similar large-scale efforts represent a substantial new training-data substrate for AI methods. The empirical case is that these atlases will enable substantially better perturbation-prediction models than current methods support.

LLMs as biology research assistants

A specific frontier worth flagging: the use of LLMs as biology research assistants. The 2024–2026 wave of biology-specific LLMs (the various biotech deployments, the Microsoft Research and Anthropic biology efforts, the academic agentic-research deployments) provide research-assistant capabilities that complement the specialised models the chapter has developed. The methodology spans literature search and synthesis, hypothesis generation, experimental design, and analysis interpretation. The empirical evidence is mixed but encouraging — LLMs as assistants outperform unaided researchers on some tasks, with substantial caveats about hallucination and the need for verification.

Evolutionary AI and synthetic biology

The reverse direction of the AI-for-biology arrow — using AI to design new biology — is an active and rapidly-progressing frontier. Generative DNA models like Evo (Section 12) can propose novel genomic sequences with predicted function. Protein-design methods (Ch 03) can generate new proteins with specified properties. Pathway-design methods can propose new metabolic engineering targets. The methodology connects to synthetic biology and has substantial commercial implications, with major biotech investment in design-build-test-learn cycles where AI handles the design step.

Equity, access, and the data commons

A meta-frontier worth naming: the equity and access implications of AI for biology. Most published genetic data is from European-ancestry individuals, and AI methods trained on this data perform worse on under-represented populations. Pharmacogenomic predictions, polygenic risk scores, and rare-variant-pathogenicity assessments all show measurable disparities. The 2024–2026 generation of methods is increasingly incorporating ancestry-aware training, fairness-evaluation procedures, and explicit attention to under-represented populations. The All of Us Research Program, the H3Africa cohort, and the various ancestry-diverse data initiatives represent partial responses; the full equity problem remains an active research and policy frontier.

What this chapter does not cover

Several adjacent areas are out of scope. Protein-structure prediction and protein design are developed in Ch 03 of Part XV (Protein Science & AI). Drug discovery and molecular design are in Ch 06 (with Ch 02 Intro to Chemistry and Ch 05 Intro to Pharmacology as prerequisites). Healthcare AI applications (clinical risk prediction, medical imaging, ICU monitoring) are in Ch 05 of Part XIV. The substantial systems-biology literature on dynamical models of cellular processes (ODE-based models, stoichiometric models) intersects this chapter but is mostly skipped. The agricultural and ecological applications of biology AI are also out of scope. This chapter focused on the genomic and cellular-biology applications most central to modern computational biology; the broader landscape of biology AI is genuinely vast.