Part XV · AI for Science · Chapter 10

AI for Mathematics, where deep learning meets formal reasoning.

Mathematics has been one of the most-watched AI-for-Science domains since 2022. AlphaProof and AlphaGeometry achieved silver-medal performance on the 2024 International Mathematical Olympiad. FunSearch (DeepMind, 2023) used large language models in a program-search loop to discover new constructions in extremal combinatorics — the first time an AI system produced a genuinely new mathematical result. Lean's Mathlib library has crossed 1.5 million lines of formalised mathematics, providing the data substrate that ML methods consume. Symbolic regression methods (AI Feynman, PySR) routinely rediscover physics equations from data and are increasingly used for hypothesis generation. Large language models are now embedded into mathematical reasoning workflows, both as tactic generators inside interactive theorem provers and as standalone problem solvers. This chapter develops the methodology with the depth an AI professional needs to engage with the field — the architectures, training-data realities, evaluation practices, and the operational considerations that distinguish research demonstrations from deployable mathematical-reasoning systems.

Prerequisites & orientation

This chapter assumes the working machinery of modern deep learning (Part VI on transformers and CNNs), the reinforcement-learning material of Part VII (essential for §3 on AlphaProof and the various RL-for-proof-search methods), the symbolic-regression material of the Physics chapter (Ch 07 §17), the LLM material of Part IX (essential throughout, particularly for §7 on language models for mathematics, §6 on FunSearch's LLM-driven program search, and §9 on tool-use), and the search-and-tree-methods material of Part VII Ch 04 (Monte Carlo Tree Search, essential for §3). The chapter does not assume prior exposure to formal mathematics, but readers will benefit from comfort with proof-by-induction and basic logic.

Three threads run through the chapter. The first is the verifiability advantage: mathematics is the science where correctness is decidable — a proof either type-checks in a formal system or it does not. This makes mathematics an unusually well-posed AI target: every proposed proof can be checked, every conjecture either holds or is disproved, and there is no ambiguity in the ground truth. The second is the combinatorial-explosion challenge: the proof-search space for non-trivial theorems is astronomically large, and the central methodological question is how to use neural-network heuristics to prune it intelligently. The third is the discovery-vs-verification distinction: AI methods for mathematics divide cleanly into proof methods (verifying or constructing proofs of given theorems) and discovery methods (generating new conjectures, finding new mathematical objects, rediscovering or refining known equations). Different methodologies dominate in each regime; the chapter is structured around this split.

In this chapter

Why AI for Mathematics Is Distinctive verifiability · combinatorial search · discovery vs verification · stakes
The Mathematical-AI Data Substrate Mathlib · ProofNet · miniF2F · arXiv · OEIS
Automated Theorem Proving resolution · saturation · neural premise selection · ATP benchmarks
Interactive Theorem Proving and Formalisation Lean · Coq · Isabelle · Mathlib · type theory
AlphaProof and AlphaGeometry RL · MCTS · IMO 2024 · synthetic-proof generation
FunSearch and Program-Synthesis Discovery LLM-guided search · evolutionary loops · cap sets · combinatorics
Symbolic Regression and Equation Discovery AI Feynman · PySR · Eureqa · physics laws · interpretability
Conjecture Generation and Pattern Discovery DeepMind knot theory · Ramanujan Machine · OEIS mining
Large Language Models for Mathematics chain-of-thought · tool use · self-correction · benchmarks
The Frontier and the Operational Question grand challenges · IMO gold · open problems · what next

Why AI for Mathematics Is Distinctive

AI for mathematics has gone from a research curiosity to a demonstrated capability between 2022 and 2025. The methodology is now mature enough that DeepMind's AlphaProof reached silver-medal performance on the 2024 International Mathematical Olympiad, FunSearch produced new mathematical constructions in extremal combinatorics, and Lean's Mathlib library — the largest formalised-mathematics corpus — is increasingly being constructed with AI assistance. This section orients the AI reader to what makes mathematics methodologically distinctive as an AI target.

The verifiability advantage

Mathematics differs from every other AI-for-Science domain in one fundamental respect: correctness is decidable. A claimed proof either type-checks in a formal proof assistant or it does not, and there is no ambiguity in the ground truth. This is in sharp contrast to medicine (where ground truth is uncertain), climate (where forecasts are evaluated against noisy observations), or even physics (where measurement uncertainties propagate). The verifiability advantage makes mathematics an unusually well-posed AI target: search-based methods can self-verify their outputs, hallucinations are detectable in principle, and there is no labelling-noise tax on training data. The challenge is generating valid proofs, not assessing whether a proof is valid.

The combinatorial-explosion challenge

The other side of mathematical reasoning is that the proof-search space for non-trivial theorems is astronomically large. A proof in Lean with 100 tactic steps has roughly 10^200 candidate sequences, and most of them are dead ends. The central methodological question for proof-finding AI is therefore not "can we represent mathematics?" (we can, with formal proof assistants) but "how do we use neural-network heuristics to prune the search efficiently?" Several approaches exist: premise selection (predict which lemmas from a large library are likely useful for the current goal), tactic prediction (given a current proof state, predict the next tactic), value estimation (estimate how close a proof state is to being closed), and policy networks (sample tactics weighted by learned probabilities). All four feed into Monte-Carlo-Tree-Search-style proof search.

The discovery-versus-verification split

AI methods for mathematics divide into two methodological families. Verification-mode methods (Sections 3–5) take a target theorem and try to prove it: AlphaProof, GPT-f, HOList, Magnushammer, and related work. Discovery-mode methods (Sections 6–8) generate new mathematical objects, conjectures, or equations: FunSearch, symbolic regression, the Ramanujan Machine, the DeepMind knot-theory work that produced new pure-mathematical results in 2021. The two regimes use different architectures (search-heavy vs. generative-heavy), benchmark differently (proven theorems vs. discovery quality), and have different operational frontiers. Section 9 sits at the boundary, since LLM-based mathematical reasoning is increasingly hybrid.

The downstream view

Operationally, AI-for-mathematics today looks like this: training data comes from formalised libraries (Mathlib for Lean, the Coq stdlib, Isabelle/HOL); models are trained to predict tactics, premises, or proofs given goals; at inference time, the model serves as a heuristic inside a proof-search loop (MCTS, beam search, or evolutionary search); successful proofs feed back into the training corpus (self-play in the AlphaProof style, or curated additions to Mathlib). For discovery-mode work, the operational shape is different: the LLM produces candidate functions or equations, an automated evaluator scores them against a target metric, and an evolutionary loop iterates. The remainder of this chapter develops each piece: §2 the data substrate, §3 automated theorem proving, §4 interactive theorem proving, §5 AlphaProof, §6 FunSearch, §7 symbolic regression, §8 conjecture generation, §9 LLMs for mathematics, §10 the operational frontier.

The Mathematical-AI Data Substrate

Before discussing methods, it is worth understanding the data substrate concretely: what the inputs look like, where the labels come from, what counts as ground truth, and what cross-validation means in this domain. The substrate is formalised-mathematics libraries (the largest training corpora for proof-finding work), competition problem sets (the most-cited evaluation suites), and the natural-language mathematics literature on arXiv (the substrate for LLM-based mathematical reasoning).

Formalised-mathematics libraries

The largest single training resource for mathematical AI is Lean's Mathlib library, which crossed 1.5 million lines and roughly 200,000 theorems in 2025. Mathlib is a community-curated, peer-reviewed corpus covering substantial parts of undergraduate and early-graduate mathematics: real and complex analysis, abstract algebra, topology, category theory, measure theory, and growing coverage of more specialised subfields. Each Mathlib theorem is accompanied by its proof in Lean's tactic language, providing supervised training data for tactic-prediction networks. The Coq stdlib and Isabelle/HOL's Archive of Formal Proofs are smaller but methodologically comparable corpora.

Benchmark datasets

The most-cited evaluation benchmarks include miniF2F (Zheng et al. 2022), a curated set of olympiad and undergraduate problems formalised in Lean, Isabelle, HOL Light, and Metamath; ProofNet (Azerbayev et al. 2023), undergraduate-textbook theorem-proof pairs in Lean and natural language; the Putnam dataset (William Lowell Putnam Mathematical Competition problems, formalised by various groups); the IMO set (International Mathematical Olympiad problems, the highest-difficulty target as of 2025); and the MATH and GSM8K datasets for natural-language mathematical-reasoning evaluation.

The natural-language corpus

For LLM-based mathematical reasoning (§9), training data is dominated by the arXiv mathematics archive (~1 million math papers since 1991), the ProofWiki and nLab structured-encyclopedia resources, mathematics textbooks (in various legal-status corpora), and competition-problem databases. Pre-training on this corpus produces models that can reason about mathematics in natural language but cannot independently verify their reasoning — that is the role of formal-proof assistants when LLM methods are combined with verification.

Discovery-mode data

For discovery-mode work, the data substrate is task-specific. OEIS (the Online Encyclopedia of Integer Sequences) is the classical reference for conjecture-generation work — its hundreds of thousands of integer sequences are the substrate for the Ramanujan Machine and related projects. The L-functions and Modular Forms Database (LMFDB) provides number-theoretic data. For symbolic regression, the data substrate is whatever physical or mathematical relationship is being modelled (physics datasets for AI Feynman, etc.).

Labels and ground truth

Ground truth in mathematical AI is unusually clean. For proof methods, a proof is correct if and only if it type-checks; this is mechanically decidable. For conjecture generation, a conjecture is correct if it is proved (eventually) or refuted by counterexample; in the meantime it can be empirically validated against many cases. For symbolic regression, the candidate equation can be evaluated against held-out data with quantitative error metrics. The label-cleanness advantage is real and methodologically distinctive — it removes a major source of noise that handicaps other AI domains.

Evaluation conventions

Evaluation in mathematical AI uses several conventions. Pass rate at fixed budget (what fraction of held-out problems can the system solve given N seconds of search per problem) is the standard for proof-finding work; the comparison must specify the budget (1 second, 1 minute, 1 hour). Pass@k (success rate when the system is allowed k attempts) is also widely used. Solve-rate by difficulty (success on miniF2F-easy vs miniF2F-hard, IMO vs Putnam) is used for capability profiling. For discovery-mode work, the evaluation is intrinsically harder — the value of a new conjecture or a new mathematical object is judged by mathematicians.

Automated Theorem Proving

Automated theorem proving (ATP) is the classical task of mechanically constructing proofs of given theorems. Pre-ML ATP methods — resolution, saturation, superposition, the various calculus-of-clauses methods — date back to the 1960s and remain the workhorses of certain deployment contexts (program verification, hardware verification). Modern AI methods augment classical ATP with neural-network heuristics that guide the proof search, producing dramatic performance improvements on harder benchmarks.

Classical ATP and its limits

Classical ATP proceeds by transforming the target theorem and its negation into a clause normal form and applying inference rules (resolution, paramodulation) to derive a contradiction. The dominant systems are E, Vampire, SPASS, and Z3 (an SMT solver that handles richer logics). Classical ATP works well for first-order logic with limited theories but struggles with the higher-order logic and dependent types that modern formal mathematics uses.

Premise selection

For theorem-proving in libraries with thousands of available lemmas (Mathlib, the Mizar library), the major bottleneck is premise selection: from a corpus of tens of thousands of theorems, predict which subset is likely useful for the current goal. Pre-ML methods used handcrafted heuristics (k-NN over symbolic features); ML methods use transformer-based models trained on goal-premise relevance pairs. DeepMath (Alemi et al. 2016) was an early demonstration; subsequent work (the Magnushammer system, the various 2022–2025 follow-ups) has refined the methodology.

Neural tactic prediction

For interactive theorem provers (Lean, Coq), the analogous task is tactic prediction: given the current proof state, predict the next tactic (Lean tactic, Coq tactic) to apply. Tactic-prediction networks are typically transformer encoders trained on (goal, tactic) pairs from the formalised library. GPT-f (Polu and Sutskever 2020) was the first substantial neural-tactic system; Magnushammer (Mikuła et al. 2023) and LeanCopilot (Song et al. 2024) are the current state-of-the-art Lean systems. Tactic prediction interacts with proof search: predicted tactics are tried in priority order, and a proof-search procedure (best-first or MCTS) explores the resulting tree.

HOList and the benchmarking culture

The HOList benchmark (Bansal et al. 2019) was an early standard for higher-order theorem proving with deep learning. Subsequent benchmarks include miniF2F (the de-facto standard since 2022) and the various competition-problem datasets. The benchmarking culture in mathematical AI is unusually rigorous because the verification is mechanical — there is no debate about whether a proof is correct, only about whether the proof is "interesting" or whether the search budget was reasonable.

Proof-search engines

Modern proof-finding systems combine a tactic-prediction network with a search engine. Beam search over tactic sequences is the simplest. Best-first search with value-network estimates of proof-state proximity to closure is more sophisticated. Monte Carlo Tree Search (MCTS) with policy and value networks — the AlphaZero pattern — is the current state-of-the-art for hard problems and is the basis of AlphaProof. The choice of search engine substantially affects performance, and the methodological work on which engine matches which problem class is ongoing.

Production deployment

Several ATP systems are in production deployment beyond research benchmarks. Z3 (with neural extensions) is widely used for software-verification work. Sledgehammer in Isabelle/HOL is the standard "try to discharge this goal automatically" tool used by working formal-mathematicians. LeanCopilot (and successors) provide IDE-integrated tactic suggestions inside Lean. The production-deployment pattern is "AI as collaborator inside a verified-by-construction proof system" rather than "AI generates standalone proofs," because the verification step provides safety against errors.

Interactive Theorem Proving and Formalisation

Interactive theorem provers (ITPs) — Lean, Coq, Isabelle, Agda, HOL Light — are the operational substrate for modern mathematical AI. An ITP combines a small trusted kernel that checks proofs against the rules of a logical foundation with a tactic language that lets users construct proofs interactively. AI methods enter at the tactic-suggestion stage, accelerating but not replacing human formalisation work. Understanding the ITP landscape is essential for understanding what proof-finding AI is operating against.

Lean and Mathlib

Lean (Leonardo de Moura, Microsoft Research → Lean FRO) is a dependently-typed proof assistant whose latest version (Lean 4) has become the dominant ITP for new mathematical formalisation work since approximately 2022. Mathlib is the community-curated mathematics library, peer-reviewed by a network of formalisers and now exceeding 1.5 million lines. The Lean+Mathlib stack is the single largest training resource for mathematical AI and the substrate of nearly all leading 2024–2026 mathematical-AI systems including AlphaProof.

Coq, Isabelle, and the older ecosystems

Coq (a calculus-of-inductive-constructions system, 1989–present) has been the ITP of choice for software verification and substantial pure-mathematics formalisations (the Feit-Thompson theorem, the Four Colour Theorem). Isabelle/HOL (a higher-order-logic system) has the Archive of Formal Proofs and a different methodological tradition emphasising automation. HOL Light (a minimalist HOL system) was the substrate for Hales' formal proof of the Kepler Conjecture. Agda is a dependently-typed system used more for programming-language research than for mathematics. Each system has its own AI-methodology development; cross-system transfer is non-trivial.

The formalisation effort and its scale

The effort to formalise mathematics is substantial. The Liquid Tensor Experiment (Scholze 2020–2022) formalised a major theorem from Scholze's condensed-mathematics programme in Lean — the first time a research-level mathematical result was formalised faster by a community than the original paper had been refereed. The PFR project (Tao 2023–2024) formalised the Polynomial Freiman-Ruzsa conjecture in Lean within weeks of Tao's announcement. These successes establish that contemporary research mathematics can be formalised at scale, given sufficient AI-assisted tooling.

Tactic languages and proof structure

An ITP proof is constructed in a tactic language: a script of high-level operations (introduce a variable, apply a lemma, rewrite with an equation, induct on a structure) that the system expands into the underlying type-theoretic terms. The tactic language is what AI tactic-prediction systems learn. Lean's tactic language is particularly compact and amenable to ML modelling; Coq's is more elaborate. Isabelle's Isar language emphasises proof readability over conciseness, which has different ML-modelling implications.

Mathlib autoformalisation

A specific frontier is autoformalisation: translating natural-language mathematics (textbook theorems, arXiv papers) into formal Lean/Coq/Isabelle. The 2022–2025 work (Patel et al., Wu et al., Jiang et al.) has made substantial progress with LLM-based autoformalisation, though the success rate on research-level mathematics remains modest. As LLM capabilities continue to improve, autoformalisation is increasingly viable as a pipeline stage: take a natural-language paper, autoformalise into Lean drafts, have humans (or AlphaProof-style systems) close the proof gaps.

The verifiability guarantee in deployment

The crucial operational property of ITP-based AI is that the trusted kernel guarantees correctness. An AI system can be wrong about which tactics work, but it cannot cause an incorrect proof to be accepted as correct — the kernel checks every step. This makes ITP-integrated AI substantially safer than standalone LLM-based mathematical reasoning, where claimed "proofs" can contain undetected errors. The verifiability guarantee is a major reason why the leading mathematical-AI systems of 2024–2026 are ITP-integrated.

AlphaProof and AlphaGeometry

DeepMind's AlphaProof and AlphaGeometry, announced in July 2024, achieved silver-medal performance on the 2024 International Mathematical Olympiad — the first time an AI system reached medal level on the IMO. Together they solved 4 of 6 problems, the threshold for silver. The methodology is a synthesis of reinforcement learning, MCTS proof search, and synthetic-proof generation that builds on the AlphaGo lineage. This section unpacks the methodology in detail, since it is the current capability frontier.

The AlphaProof architecture

AlphaProof tackles non-geometric IMO problems (algebra, number theory, combinatorics) using Lean as the formal-proof backend. The architecture is: (1) a policy network that proposes Lean tactics given the current proof state, (2) a value network that estimates the proof state's distance from closure, (3) MCTS over tactic sequences with policy/value-guided rollout, and (4) a self-play loop where successful proofs are added to the training corpus. The network is initialised from a Gemini-based foundation model and fine-tuned on Mathlib + AI-generated synthetic problem-proof pairs.

Synthetic-proof generation

A key innovation is test-time problem solving: at inference, the system is given a natural-language IMO problem; it autoformalises into Lean, then generates millions of proof candidates via the policy network, and accepts the first one that closes the goal. The training distribution is augmented with synthetically-generated problem-proof pairs: AlphaProof generates new conjectures (using Lean's tactic combinators and library structures), proves them, and adds the successful pairs to its training set. This self-play data substantially exceeds the size of human-curated mathematical libraries.

AlphaGeometry

AlphaGeometry (Trinh et al. 2024) handles IMO geometry problems using a different methodology: a neural-symbolic system that combines a transformer-based construction proposer (predicts auxiliary points, lines, and circles to add to the diagram) with a symbolic deduction engine (a fast theorem prover that handles geometric reasoning given the augmented diagram). Geometry is methodologically separate because IMO geometry has a particularly rich auxiliary-construction structure that benefits from neural proposal followed by symbolic verification. AlphaGeometry's training data was augmented by ~100M synthetic geometry problems generated via random construction.

The IMO 2024 result

The 2024 IMO had 6 problems. AlphaProof solved problems 1, 2, and 4 (algebra, number theory, geometry — the geometry problem went to AlphaGeometry). AlphaProof also solved problem 6 (a difficult combinatorics problem). It did not solve problems 3 and 5. The 4-of-6 score corresponded to a silver medal at human contestant level. Each non-geometry problem took up to 3 days of compute time per problem — substantially longer than the 4.5 hours allowed to human contestants — but the result establishes capability at a level not previously demonstrated.

Methodological lessons

Several methodological lessons emerge from AlphaProof. First, scale matters: the synthetic-proof corpus is much larger than Mathlib alone, and the policy network is much larger than pre-2024 tactic predictors. Second, RL plus MCTS works for proof search in the same way it worked for Go and chess — the AlphaZero pattern transfers. Third, autoformalisation is a key bridge: the system handles natural-language inputs by first formalising them, which both broadens its capability and produces proofs that are mechanically verified. Fourth, geometry needs a different stack: AlphaGeometry's neural-symbolic split outperforms a pure-RL approach for IMO geometry.

What's next

The natural next step is IMO gold: solving 5 or 6 of 6 problems, ideally within a per-problem time budget closer to the human 45-minute average. The 2025 IMO will be the next public benchmark. Beyond IMO, the operational frontier is research mathematics: can AlphaProof-class systems contribute to active mathematical research, formalising recent results faster than human formalisers and proposing original lemmas? The 2024–2026 work suggests that the answer is increasingly yes.

FunSearch and Program-Synthesis Discovery

FunSearch (DeepMind, 2023) was the first AI system to discover new mathematical knowledge — specifically, new lower bounds for the cap set problem in extremal combinatorics and improved bounds for online bin packing. The methodology combines large language models (proposing candidate functions) with evolutionary search (selecting promising candidates and iterating), with both verification and scoring done programmatically. FunSearch is methodologically distinct from theorem-proving AI: rather than verifying claims, it discovers new constructions.

The FunSearch architecture

FunSearch operates on tasks framed as: "find a function f(x) that maximises a target metric M(f)." The architecture is: (1) start with a seed program (a baseline function), (2) prompt an LLM to propose mutations of the program (incorporate ideas, change strategies, add structure), (3) evaluate each candidate by running it and measuring M, (4) maintain a programs database with diversity-preserving selection, (5) iterate. The LLM's role is to generate creative candidates; the evolutionary loop's role is to select and accumulate progress.

The cap set discovery

The cap set problem asks for the largest subset of (Z/3)^n containing no three-term arithmetic progression. The asymptotic upper and lower bounds were a long-standing open problem, with substantial work since the 1990s. FunSearch ran for several days of compute time and produced a new construction giving a better lower bound for n=8 than any previously known. The construction was a programmatic specification of a set of points, generated by the LLM and validated by automatic counting. The discovery is genuine — the previous best bound was held for years and the new bound has been independently verified.

Online bin packing improvement

FunSearch also produced new heuristics for the online bin-packing problem, a classical combinatorial optimisation problem in operations research. The methodology was the same: LLM-proposed mutation, programmatic evaluation. The new heuristics outperformed handcrafted classical heuristics on benchmark distributions. This second result demonstrated that the methodology is not specific to extremal combinatorics — it works on practical optimisation problems where the objective function is tractable to evaluate.

Why this works

FunSearch works because it combines two complementary strengths. The LLM brings candidate diversity: it can incorporate mathematical ideas from its training corpus into novel constructions, generating candidates that random mutation would not produce. The evolutionary loop brings progress accumulation: it selects, recombines, and refines candidates over many generations, building structure that no single LLM call could produce. The combination is more powerful than either alone — pure LLM generation produces ideas without iteration, pure evolutionary search produces iteration without structured proposals.

Limitations and operational concerns

FunSearch has substantial limitations as of 2026. The methodology requires an automated evaluation function: the target metric M must be computable in seconds per candidate. This restricts FunSearch to problems where verification is fast (combinatorial enumeration, optimisation objectives). Theorems whose validity requires a full proof (rather than a counterexample-style check) are out of scope. Compute requirements are substantial — the cap set discovery used substantial GPU-days. The interpretability of discovered results is variable: some FunSearch outputs are crisp and human-comprehensible, others are programmatic specifications that produce correct outputs but resist human-readable mathematical understanding.

Successors and the broader programme

Several follow-up systems extend the FunSearch methodology. PromptBreeder (Fernando et al. 2023) and the various 2024–2025 LLM-evolution papers apply similar architectures to non-mathematical domains. The methodology of LLM-driven program search is increasingly viewed as a general-purpose discovery technique — useful wherever an LLM can generate candidates and an automated evaluator can score them. The cap set discovery remains the canonical demonstration that the methodology works for genuinely new mathematical results.

Symbolic Regression and Equation Discovery

Symbolic regression searches for compact mathematical expressions that fit observed data. Unlike standard ML regression, the output is an interpretable closed-form expression — a polynomial, a transcendental, a piecewise function — rather than a neural network. Symbolic regression has been used to rediscover physics laws from data, propose new conservation laws, and increasingly serve as a hypothesis-generation tool in scientific work. The methodology connects the AI-for-mathematics agenda to the AI-for-physics work of Ch 07 §17.

The classical and ML approaches

Classical symbolic regression used genetic programming: maintain a population of candidate expression trees, mutate and recombine them, select for low fit-error and small size. The seminal system was Eureqa (Schmidt and Lipson 2009), which rediscovered Hamilton's equations and Lagrangians from physical data and was widely used in scientific applications through the 2010s. ML methods entered the field around 2020 with neural-network-augmented genetic programming and pure-neural approaches.

AI Feynman

AI Feynman (Udrescu and Tegmark 2020, AI Feynman 2.0 in 2021) is a hybrid system that combines neural-network function-fitting with symbolic-regression search. The methodology is to first fit a neural network to the data, then use the neural network as an oracle to identify symmetries (translation invariance, separability, dimensional analysis) that constrain the search space, then search the constrained space with classical symbolic methods. AI Feynman recovered most of the equations in the Feynman Lectures on Physics, and is the standard benchmark for physics-equation discovery.

PySR and modern toolkits

PySR (Cranmer 2023) is the current state-of-the-art symbolic-regression library. It uses a hybrid genetic-programming-and-simulated-annealing search engine, a multi-objective Pareto-front formulation (trading expression accuracy against expression size), and substantial software-engineering work to make symbolic regression usable in production scientific workflows. PySR has been used in cosmology (rediscovering scaling relations from N-body simulations), neuroscience (extracting compact equations for neural dynamics), and economics. The Python library is widely used and is the recommended starting point for symbolic-regression work as of 2026.

Applications in physics

Symbolic regression has had its most-cited successes in physics. The galaxy clustering relation work of Cranmer et al. 2020 rediscovered scaling relations from cosmological simulations. The orbital mechanics work has rediscovered Newtonian and Keplerian relations. The field-equation discovery work is more aspirational — discovering the Schrödinger equation or the Maxwell equations from raw data has been demonstrated under strong constraints but is not robustly automatic. The Physics chapter (Ch 07 §17) develops the physics-application methodology in detail.

Applications in pure mathematics

Symbolic regression in pure mathematics is more recent. The Ramanujan Machine (Raayoni et al. 2021) uses symbolic search to propose new continued-fraction representations for fundamental mathematical constants (π, e, the Riemann ζ function). Several conjectured identities produced by the Ramanujan Machine have been subsequently proved. The Ramanujan Machine is a particularly clean example of AI as a mathematical-conjecture-generation tool: the proposed identities are mechanically verifiable, and the human work is to identify which proposals are mathematically interesting.

Limitations and methodological concerns

Symbolic regression has several limitations. Search-space explosion: the space of possible expressions grows combinatorially with allowed depth and operator set; pruning heuristics are essential. Overfitting: a sufficiently expressive expression class can fit any finite dataset, so cross-validation and parsimony pressure are essential. Interpretability paradox: a "discovered" equation that is technically a fit to the data may not be the canonical equation a physicist would write; the question of which equation is the right one is partly subjective. Despite these limits, symbolic regression is increasingly a practical tool for hypothesis generation in scientific work.

Conjecture Generation and Pattern Discovery

Conjecture generation — proposing new mathematical statements that are likely true — has been one of the most striking AI-for-mathematics applications since 2021. DeepMind's collaboration with mathematicians at Oxford and Sydney produced new conjectures in knot theory and representation theory that humans subsequently proved. The methodology uses ML to identify statistical relationships in mathematical data (knot invariants, group-representation features) and surface candidate conjectures that mathematicians refine and prove.

The DeepMind knot theory work

The Davies et al. 2021 paper (published in Nature) demonstrated that ML can guide pure-mathematical research. The setup: knot invariants (algebraic invariants computed from a knot diagram) form a substantial empirical dataset; neural networks were trained to predict one invariant from another; attribution analysis on the trained networks revealed unexpected statistical relationships between invariants; mathematicians then formulated and proved theorems formalising those relationships. The result was a new theorem in low-dimensional topology — a relationship between the signature of a knot and its hyperbolic invariants.

Representation theory

The same paper demonstrated a parallel methodology in representation theory of permutation groups. ML methods identified a relationship between the Kazhdan-Lusztig polynomials and the combinatorial structure of permutations, leading to a new conjecture (subsequently proved) about which permutations have which polynomial coefficients. The methodology has since been extended to other branches of mathematics — combinatorial enumeration, algebraic geometry, dynamical systems.

The Ramanujan Machine

The Ramanujan Machine (Raayoni et al. 2021) is a different methodology with the same goal: generate new conjectures about mathematical constants. It uses brute-force search over candidate continued-fraction representations of fundamental constants (π, e, the Riemann ζ at integer points), automatically validating candidates against high-precision numerical computation. Conjectured identities include several new representations of π and the Catalan constant, some of which have been subsequently proved.

OEIS pattern mining

The Online Encyclopedia of Integer Sequences (OEIS) has been a substrate for conjecture-generation work for decades. Modern ML methods (the various 2022–2025 OEIS-mining projects) train transformers on the sequence database and surface candidate conjectures: relationships between sequences, generating-function identities, asymptotic behaviour. The methodology produces a high candidate volume; expert mathematicians filter for the genuinely interesting ones.

The structural-conjecture frontier

A more ambitious frontier is generating conjectures about mathematical structures rather than numerical relationships: which categories admit which functors, which algebraic structures are equivalent under which transformations, which classes of objects are classified by which invariants. This is harder than numerical-conjecture generation because the structural data is more sparse and less standardised. The 2024–2025 work has begun to demonstrate capability here, though the results are less mature than the knot-theory and Ramanujan-Machine demonstrations.

The mathematician-in-the-loop pattern

The operational pattern that emerges from this body of work is mathematician-in-the-loop: ML methods surface candidate conjectures; mathematicians filter for which are interesting; mathematicians (or AlphaProof-style systems) prove the interesting ones. This is methodologically distinct from autonomous proof generation; it positions AI as a research-collaboration tool rather than a replacement. As of 2026 the pattern has produced a substantial number of new mathematical results, and is increasingly mainstream in pure-mathematics research workflows.

Large Language Models for Mathematics

Large language models have become surprisingly capable mathematical reasoners since 2022. GPT-4 (2023) cleared MATH and GSM8K benchmarks at near-human-expert level. Subsequent models (GPT-4o, Claude 3, Gemini 2.5, the various 2024–2026 successors) have pushed performance further with chain-of-thought, tool use, and self-verification. LLM-based mathematical reasoning is methodologically distinct from formal-proof systems — there is no mechanical verification — but the ergonomic accessibility makes it the dominant mathematical-AI interface for non-specialists.

Chain-of-thought and the reasoning revolution

The 2022–2023 discovery that chain-of-thought prompting (Wei et al. 2022) — instructing the model to "think step by step" — substantially improves mathematical-reasoning performance was the methodological turning point. Subsequent work (the various o1-style models from OpenAI, Claude's extended-thinking mode, and the open-source DeepSeek-R1 series of 2025) integrated extended reasoning into the model architecture itself. As of 2026, frontier LLMs reach near-perfect performance on grade-school mathematics (GSM8K), strong performance on undergraduate-level olympiad problems, and competitive performance on Putnam-style problems.

Tool use and verification

The combination of LLM reasoning with tool use — calling out to calculators, computer algebra systems (SymPy, Mathematica), or proof assistants — substantially extends capability. An LLM that can invoke a CAS to symbolically integrate, factor, or simplify expressions can correctly solve problems that exceed its native arithmetic and algebraic capability. The 2024–2026 work on integrating LLMs with Lean (the various LLM-LeanCopilot projects) demonstrates that LLM-driven tactic suggestion plus mechanical verification produces a hybrid system stronger than either component alone.

Self-correction and verifier-augmented inference

A standing weakness of LLM mathematical reasoning is silent error: the model produces an incorrect derivation that looks correct, and there is no mechanical check. Several methodological responses exist. Self-consistency (Wang et al. 2022) — sample multiple chains-of-thought, take the majority answer — substantially reduces error. Verifier-augmented inference trains a separate verifier model to score chain-of-thought traces; the system samples multiple traces and accepts the highest-verifier-score one. Self-correction prompts the model to critique and refine its own initial answer. All three methodologies are in production use as of 2026.

Benchmarks and the contamination problem

LLM mathematical-reasoning benchmarks include GSM8K (grade school), MATH (competition mathematics), AIME (American Invitational Mathematics Examination), Putnam, and the various olympiad sets. A persistent methodological concern is benchmark contamination: the test set may have leaked into the pre-training corpus, inflating apparent performance. The 2023–2025 work on contamination measurement (the various data-decontamination papers) has substantially clarified which benchmarks are reliable and which are compromised. Frontier evaluation increasingly uses held-out competitions (problems from competitions held after the model's training cutoff) where contamination is impossible.

The ARC-AGI and AGI-style benchmarks

Beyond standard mathematical benchmarks, the ARC-AGI challenge (Chollet 2019, with substantial AI-progress through 2024–2025) tests abstract pattern reasoning with novel visual problems. ARC has been a particular target for o1-class extended-reasoning models; the 2024–2025 progress (OpenAI o3 reaching competitive scores) marked a watershed. Whether ARC-AGI scores reflect genuine generalisation or specialised training remains disputed, but the benchmark continues to drive methodological development.

The hybrid-system frontier

The methodological frontier as of 2026 is hybrid systems combining LLM reasoning with formal-proof backends. LLM-LeanCopilot-style systems use the LLM for tactic suggestion and Lean for verification. Theorem-prover-as-tool patterns let the LLM call out to ATP systems for sub-goal closure. The trajectory is toward systems that combine LLM ergonomic flexibility with formal-proof mechanical reliability. This is also the most-likely route to solving research-mathematics problems in the next 5–10 years.

The Frontier and the Operational Question

AI for mathematics has moved from research curiosity to demonstrated capability between 2022 and 2026. The frontier as of 2026 has several dimensions: pushing IMO-class systems to gold-medal performance, scaling formalisation efforts to research mathematics, integrating LLM reasoning with formal verification, and addressing the fundamental methodological questions about creativity, originality, and what counts as mathematical understanding. This section traces the open methodological questions and the directions the field is moving in.

IMO gold and the next benchmarks

The natural next milestone after AlphaProof's 2024 silver is IMO gold: solving 5 or 6 of 6 problems, ideally within a per-problem compute budget closer to the human limit. The 2025 IMO will test progress; the 2026 IMO will be a substantial test of methodological robustness. Beyond IMO, the next benchmarks include Putnam gold (the Putnam is harder than IMO on average) and competitive performance on research-level conjectures.

The research-mathematics frontier

The most ambitious frontier is contributing to active mathematical research. The 2023–2025 work has demonstrated several proof-of-concept successes: the Liquid Tensor Experiment formalising recent research in Lean within months, FunSearch producing new combinatorial constructions, the DeepMind knot-theory work proving new theorems. As of 2026, there are increasing signs that AI systems are routinely participating in research mathematics workflows — formalising newly-published theorems, suggesting lemmas, identifying patterns. The threshold for "AI co-author on substantial new mathematical results" is increasingly being crossed, with the methodological work to track who-contributed-what an active topic.

The autoformalisation gap

A persistent operational bottleneck is autoformalisation: translating natural-language mathematics into Lean/Coq/Isabelle. The 2025 state of the art handles textbook-level mathematics with reasonable accuracy but research-level mathematics with substantial human cleanup. Progress here is critical because the formalised-mathematics corpus grows linearly with human formalisation effort but exponentially if autoformalisation reaches research-level reliability. The 2026–2030 period will likely see substantial methodological progress here.

Open problems as targets

Several long-standing open problems are increasingly viable AI targets. The Riemann Hypothesis is too hard, but ML-driven exploration has produced computational-evidence advances. The P vs NP problem is too foundational, but symbolic-search methods can probe specific lemmas. Twin Prime Conjecture work has used ML for sieve-method optimisation. The methodology of "AI-assisted attack on grand-challenge problems" is genuinely emerging and is the most-watched 2024–2026 thread.

Methodological questions

Several deep methodological questions remain open. What does creativity mean for AI mathematics — is the FunSearch cap-set construction more or less "creative" than a human mathematician's contribution? What is the right division of labour between AI and human mathematicians? Do we want AI systems that produce proofs that humans can read and learn from, or is mechanically-verified-but-opaque acceptable? How should research-mathematics credit be assigned? These questions are increasingly mainstream in 2025–2026 mathematical-AI discourse.

What this chapter has not covered

Several adjacent areas are out of scope. The substantial program-synthesis literature beyond mathematics has only been touched in passing (FunSearch). Theorem-prover-internal architectural details (the Lean kernel, the Coq calculus of inductive constructions) are out of scope despite their substantial relevance to AI methodology. Pre-LLM ATP systems (Vampire, E, Z3) are summarised but not developed in depth. Educational and tutoring applications of AI mathematics have not been covered. The chapter aimed at the methodological core of AI-for-mathematics work; the broader landscape — particularly at the intersection with computer science, theoretical computer science, and mathematical education — is genuinely vast.