Intro to Chemistry, the working vocabulary for the AI chemist.
Chemistry is the science of matter and its transformations. It sits between physics (which describes the fundamental behaviour of particles) and biology (which describes how matter organises into living systems), and it is the natural language for everything that happens at the molecular scale: how atoms bond, how molecules react, how drugs bind their targets, how materials transmit electricity, how proteins fold, how enzymes catalyse. The AI applications that fill the rest of Part XV — protein science, drug discovery, materials design — all rest on this foundation. This chapter develops the working vocabulary an AI reader needs to engage with that material: atomic structure, the periodic table, chemical bonding, molecular geometry, reactions and kinetics, thermodynamics, organic chemistry, the essentials of quantum chemistry that ground modern computational methods, and the spectroscopic techniques that produce most of the data ML methods consume. The chapter is intentionally thorough — chemistry is the substrate for too many downstream chapters to skim — but writes for an AI audience rather than a chemistry undergraduate, prioritising the concepts that ML methods engage with.
Prerequisites & orientation
This chapter assumes only the high-school chemistry most readers retain: that matter is made of atoms, that atoms combine into molecules, that reactions transform one set of molecules into another. No undergraduate chemistry background is assumed, but the chapter does cover material at an undergraduate-introductory level of depth in some sections (particularly Section 8 on quantum and physical chemistry). Readers with a strong existing chemistry background can skim Sections 2–4 and engage seriously with Sections 6–10, where the AI-relevant content is densest. Readers approaching chemistry for the first time should take their time with Sections 2–5, which introduce the foundational vocabulary that everything else builds on.
Two threads run through the chapter. The first is the structure-determines-property hierarchy: atomic composition determines bonding patterns, bonding patterns determine molecular geometry, geometry determines reactivity and physical properties. Each level of the hierarchy is a distinct prediction problem for ML methods (Ch 08 develops them in detail for drug discovery, Ch 14 for materials), and the chapter pays explicit attention to which level each topic operates at. The second thread is the chemistry-as-graphs framing: from the AI perspective, molecules are graphs (atoms are nodes, bonds are edges, with rich attributes on both), and most modern ML methods for chemistry exploit this representation. The framing is not the only legitimate one — quantum chemistry treats molecules as wavefunctions, classical mechanics treats them as ball-and-stick approximations — but it is the most useful for an AI reader, and the chapter pays attention to what gets gained and lost in the graph view.
Why an AI Reader Needs Chemistry
Chemistry is the molecular layer of science: smaller than biology, larger than physics, and the natural language for everything that happens between atoms. AI applications in drug discovery, materials science, catalysis, batteries, and protein engineering all rest on chemistry, and most of those applications struggle when the practitioner lacks the underlying vocabulary. This section frames why the chapter exists and what the rest of Part XV asks of it.
Chemistry is the substrate of multiple AI subfields
An AI reader entering modern computational science cannot avoid chemistry. Drug discovery (Ch 08) is essentially small-molecule chemistry meeting biology — designing molecules that bind specific protein targets, predicting their pharmacokinetics, optimising their selectivity. Materials science (Ch 14) is solid-state chemistry plus condensed-matter physics — predicting which combinations of elements produce useful catalysts, semiconductors, batteries, alloys. Protein science (Ch 04) is biochemistry — amino acids are organic molecules; protein folding is driven by chemistry (hydrogen bonds, hydrophobic effects, electrostatics); enzymes are chemistry made macromolecular. Without chemistry, the rest of Part XV reads as opaque jargon. With it, the methods make sense and the empirical results are interpretable.
The data substrate is enormous
Chemistry has accumulated an extraordinary public-data substrate over its history. PubChem (NIH) holds ~120 million compounds with associated assay results. ChEMBL (EBI) holds ~2.4 million bioactive molecules with measured activity against ~15,000 protein targets. Reaxys and SciFinder (commercial) catalogue tens of millions of synthetic reactions from the chemical literature. ZINC (UCSF) holds ~37 billion virtually-synthesisable molecules for screening. The Materials Project, OQMD, and AFLOW hold computed properties for millions of crystalline materials. Quantum-chemistry databases (QM9, ANI-1, the various 2024–2026 successors) hold computed energies and properties for millions of small molecules at quantum-chemical accuracy. The combined data substrate is among the largest in any science, comparable in scale to biology, and the AI methods that engage with it are correspondingly sophisticated.
Chemistry as graphs
A central conceptual shift for an AI reader is the framing of molecules as graphs. Atoms are nodes (with attributes like element, charge, hybridisation); bonds are edges (with attributes like type, order, stereochemistry); the topology captures most of what matters for chemical behaviour. This framing is not the only legitimate one — quantum chemistry treats molecules as wavefunctions, classical mechanics treats them as ball-and-stick models with bond springs — but it has been the most productive for ML. Graph neural networks (Part XIII Ch 05) operating on molecular graphs are the workhorse architecture for chemistry-AI applications, and the SMILES string representation (which serialises a molecular graph as a sequence of characters) is the dominant data format. The chapter introduces both perspectives and their relationship.
What this chapter does and doesn't try to do
The chapter is a primer aimed at AI readers. It does not try to substitute for an undergraduate chemistry sequence — those courses run for one to two years and produce graduates who can synthesise molecules, identify spectra by hand, and balance arbitrarily complex reactions. Instead it covers the concepts AI applications repeatedly invoke, with enough depth that the AI methods make sense. Sections 2–4 cover atomic-and-molecular structure (the substrate). Sections 5–6 cover what happens when atoms react (chemical change and its energetics). Section 7 covers organic chemistry (the chemistry of life and most drugs). Section 8 covers the quantum-chemistry essentials that ground modern computational methods (DFT in particular). Section 9 covers the spectroscopic techniques that produce most chemistry data. Section 10 closes by pointing at the AI-and-chemistry frontier that Ch 08 and Ch 14 develop in detail.
Chemistry is the molecular-scale science. Below it lies physics (atoms and the rules they follow); above it lies biology (molecules organised into living systems) and materials science (molecules organised into condensed matter). AI methods for drug discovery, protein science, materials, and catalysis all engage with chemistry as their substrate. This chapter develops the working vocabulary; the AI applications of Ch 04, Ch 08, and Ch 14 build on it.
Atoms, Elements, and the Periodic Table
The fundamental object of chemistry is the atom. Every chemical phenomenon is, ultimately, atoms doing things to each other. Understanding atomic structure, the periodic table, and the trends that organise the elements is the foundation for everything that follows.
Atomic structure
An atom consists of a tiny dense nucleus (containing protons and neutrons) surrounded by a much larger cloud of electrons. The proton has a positive electric charge of +1; the electron has a negative charge of −1; the neutron is uncharged. A neutral atom has equal numbers of protons and electrons. The number of protons (the atomic number, denoted Z) defines the element: every atom with one proton is hydrogen; every atom with six protons is carbon; every atom with eighty-eight protons is radium. The number of neutrons can vary for the same element, producing different isotopes (carbon-12 with 6 neutrons, carbon-13 with 7, carbon-14 with 8) that have nearly-identical chemistry but different masses and stability.
The mass of an atom is concentrated in the nucleus (each proton or neutron weighs about 1 atomic mass unit, while the electron is roughly 1/1836 as heavy). But the chemistry of an atom is determined almost entirely by its electrons — specifically by the electrons in its outermost layer, called the valence electrons. When chemists say "carbon has four valence electrons," they mean carbon's outermost layer has four electrons available to participate in bonding, and that fact dominates carbon's chemical behaviour.
Electron configuration and orbitals
Electrons in atoms occupy regions of space called orbitals. An orbital is not a planetary orbit — it is a probability distribution describing where the electron is likely to be found, derived from quantum mechanics (Section 8 develops the math). Each orbital has a characteristic shape and energy and can hold up to two electrons (with opposite spins, by the Pauli exclusion principle). Orbitals come in types labelled by letters: s orbitals are spherically symmetric, p orbitals are dumbbell-shaped (three of them per shell, oriented along x, y, z axes), d orbitals are more complex (five per shell), and f orbitals are even more so (seven per shell).
Electrons fill orbitals in order of increasing energy: 1s, 2s, 2p, 3s, 3p, 4s, 3d, 4p, and so on. The fill order produces the periodic table's structure, and writing out the configuration for an element (carbon: 1s² 2s² 2p²) reveals how many valence electrons it has and which orbitals they occupy. The patterns are not arbitrary; they follow from quantum mechanics, and the chapter returns to the underlying math in Section 8.
The periodic table
Dmitri Mendeleev's 1869 organisation of the elements into the periodic table is among the most-elegant intellectual achievements of nineteenth-century science. The table arranges elements by atomic number, with rows ("periods") corresponding to electron shells and columns ("groups") corresponding to similar valence-electron configurations. Elements in the same group have similar chemistry: the alkali metals (Group 1: Li, Na, K, Rb, Cs) all have one valence electron and react vigorously with water; the noble gases (Group 18: He, Ne, Ar, Kr, Xe, Rn) all have full outer shells and are essentially inert; the halogens (Group 17: F, Cl, Br, I, At) all have seven valence electrons and form salt-like compounds with metals.
The modern table contains 118 known elements, organised into the s-block (groups 1–2), p-block (groups 13–18), d-block (groups 3–12, the transition metals), and f-block (the lanthanides and actinides shown separately at the bottom). The block structure reflects which orbital is being filled across that section of the table. Knowing roughly where elements sit in the table is essential vocabulary; an AI reader doesn't need to memorise all 118 names but should recognise the blocks and the major groups.
The elements that matter most
For biological and pharmaceutical chemistry, six elements dominate: CHNOPS (carbon, hydrogen, nitrogen, oxygen, phosphorus, sulphur). These six make up >99% of the atoms in living organisms. Carbon is the structural backbone (Section 7 develops why); hydrogen forms the most numerous bonds; nitrogen and oxygen are essential for proteins and nucleic acids; phosphorus is in DNA, RNA, ATP, and membrane phospholipids; sulphur is in two amino acids (cysteine, methionine) and many cofactors.
Beyond CHNOPS, several other elements appear in important biological roles: halogens (fluorine, chlorine, bromine, iodine) appear frequently in pharmaceuticals and as biological signalling molecules; alkali and alkaline earth metals (Na, K, Mg, Ca) are essential ions for cellular function (sodium-potassium gradients drive nerve signalling; calcium is a ubiquitous signalling molecule; magnesium is required by many enzymes); transition metals (Fe, Cu, Zn, Mn, Co, Ni, Mo) sit at the active sites of many enzymes (haemoglobin's iron, vitamin B12's cobalt, nitrogenase's molybdenum). For materials science, the relevant elements are broader still — essentially the entire d-block (transition metals) and most of the p-block (Si, Ge for semiconductors; Al, Ga, In for various optoelectronics; the noble gases for inert atmospheres).
Periodic trends
Several properties vary predictably across the periodic table. Atomic radius increases down a group (more electron shells) and decreases left-to-right across a period (more protons pull electrons in tighter). Ionisation energy (the energy to remove the outermost electron) decreases down a group and increases left-to-right. Electronegativity (the tendency to attract bonding electrons) follows similar patterns, with fluorine the most electronegative element and the alkali metals among the least. Electron affinity (energy released when an atom gains an electron) shows analogous trends with more variation. These trends are essential for predicting how elements will bond and react, and they show up repeatedly in the chapters that follow.
Chemical Bonding
When atoms come together, they may stick. The various ways they stick are called chemical bonds, and the type of bond determines almost everything about the resulting molecule or material — its shape, its reactivity, its physical properties, its melting point, its solubility. This section catalogues the major bond types and what they imply.
Why atoms bond at all
The basic principle is energy: atoms bond when the result is more stable (lower energy) than the unbonded atoms. Most isolated atoms (other than the noble gases) have incompletely-filled outer electron shells and are correspondingly reactive. By sharing or transferring electrons with other atoms, they can achieve filled-shell configurations that lower the system's energy. The various types of bonds correspond to different ways atoms accomplish this energy reduction.
Ionic bonds
An ionic bond forms when one atom transfers electrons to another, leaving the donor positively charged (a cation) and the acceptor negatively charged (an anion). The opposite charges attract electrostatically, holding the ions together. Ionic bonds are typical between elements with very different electronegativities: alkali metals (low electronegativity, easy to ionise) bonded to halogens (high electronegativity, eager to gain an electron). Sodium chloride (Na⁺Cl⁻, table salt) is the canonical example. Ionic compounds typically form crystalline solids with high melting points, and they dissociate into separate ions when dissolved in water — which is why salt water conducts electricity.
Covalent bonds
A covalent bond forms when two atoms share a pair of electrons rather than transferring them. Each atom effectively counts the shared pair toward its valence shell, achieving stability without becoming charged. Covalent bonds dominate in molecules built primarily from non-metals (C, H, N, O, P, S, halogens) — i.e., almost all of biology and most of organic chemistry. A single covalent bond shares one pair of electrons; a double bond shares two pairs; a triple bond shares three. Higher-order bonds are stronger and shorter. The C=C double bond in ethylene (1.34 Å, 614 kJ/mol) is shorter and stronger than the C–C single bond in ethane (1.54 Å, 348 kJ/mol).
Polar covalent bonds and partial charges
When two atoms in a covalent bond have different electronegativities, the shared electrons are pulled toward the more electronegative atom, producing a polar covalent bond with partial charges (denoted δ⁺ and δ⁻) at the two ends. The water molecule (H–O–H) is the canonical example: oxygen's higher electronegativity pulls electron density away from the hydrogens, leaving the oxygen partially negative and each hydrogen partially positive. The polar nature of water drives most of biology — it explains why water dissolves salts, why proteins fold the way they do, and why hydrogen bonding (next subsection) shapes molecular structure throughout life.
Hydrogen bonds
A hydrogen bond is a particularly strong dipole-dipole attraction in which a hydrogen atom covalently bonded to an electronegative atom (typically O, N, or F) is electrostatically attracted to another nearby electronegative atom. Hydrogen bonds are weaker than covalent bonds (~5–30 kJ/mol vs. ~200–500 kJ/mol for covalent) but substantially stronger than other intermolecular forces, and they are everywhere in biology. Water's high boiling point comes from extensive hydrogen bonding. DNA's double helix is held together by hydrogen bonds between complementary base pairs (A-T pairs share two hydrogen bonds; G-C pairs share three). Protein secondary structures (alpha helices, beta sheets) are stabilised by hydrogen bonds along the polypeptide backbone. Most of structural biology is hydrogen-bond engineering.
Van der Waals forces
Even nonpolar molecules attract each other weakly through van der Waals forces (also called London dispersion forces). The mechanism is subtle: at any instant, the electrons in a molecule may not be perfectly symmetrically distributed, producing a fleeting dipole that induces a complementary dipole in a neighbouring molecule, and the two dipoles attract. Van der Waals forces are weak (typically <5 kJ/mol per contact) but additive — a large protein binding to its target through hundreds of van der Waals contacts can have substantial total binding energy. Drug binding affinities are often dominated by van der Waals interactions plus the hydrophobic effect (Section 7 develops this), and the methodology of molecular docking and binding-affinity prediction (Ch 08) is essentially the methodology of summing weak interactions to predict strong outcomes.
Metallic bonding
In metals, atoms contribute their valence electrons to a delocalised "sea" of electrons that flows freely through the lattice. This metallic bond explains why metals conduct electricity (electrons can move) and heat (vibrations propagate through the electron sea), why they are malleable (the lattice can deform without breaking specific bonds), and why they are typically opaque and shiny (the electron sea reflects light). Metallic bonding is the foundation of materials science (Ch 14) and the substrate for catalysis at metal surfaces.
Bond strengths and what they mean
Bond strengths span a wide range. Triple bonds like C≡N (in cyanide) are among the strongest at ~890 kJ/mol. Double bonds like C=O (in carbonyls) are typically 700–800 kJ/mol. Single covalent bonds are 150–500 kJ/mol depending on which atoms. Hydrogen bonds are 5–30 kJ/mol. Van der Waals contacts are typically <5 kJ/mol. The thermal energy at room temperature (RT, ~2.5 kJ/mol) is small compared to covalent bond energies, which is why molecules are stable, but comparable to weak interactions, which is why hydrogen bonds and van der Waals contacts can be transient and dynamic. The hierarchy is what makes biology possible: covalent bonds give molecules their persistent identity, while weaker interactions provide the flexibility needed for binding, folding, and reaction.
Molecular Geometry and Stereochemistry
Molecules are not flat lists of atoms; they are three-dimensional objects with specific shapes. The shape determines the function: drug molecules bind targets only when their shape matches; enzymes catalyse only when their active sites can accommodate substrates; materials conduct only when atoms align in particular ways. Understanding molecular geometry is essential for understanding chemistry.
VSEPR theory
The simplest theory for predicting molecular shape is VSEPR (Valence Shell Electron Pair Repulsion). The idea: electron pairs around an atom repel each other, so they arrange themselves to maximise their separation. Counting the electron pairs (bonding plus lone pairs) around the central atom predicts the geometry. Two pairs produce linear geometry (180° apart, e.g., CO₂). Three produce trigonal planar (120° apart, e.g., BF₃). Four produce tetrahedral (109.5° apart, e.g., methane CH₄). Five produce trigonal bipyramidal (a triangular base with axial atoms above and below). Six produce octahedral (90° apart, e.g., SF₆).
Lone pairs (non-bonding electron pairs) take up more space than bonding pairs and distort geometries. Water has four electron pairs around oxygen (two bonds, two lone pairs) — the underlying geometry is tetrahedral, but the actual H-O-H angle is 104.5° rather than 109.5° because the lone pairs squeeze the bonds together. Ammonia (NH₃) has the same four-pair pattern (three bonds, one lone pair) and a pyramidal shape. These shape distortions matter because polarity, hydrogen-bonding patterns, and reactivity all depend on the actual geometry.
Hybridization
For carbon (and other p-block elements), the actual geometry derives from orbital hybridization: the s and p atomic orbitals mix to form new hybrid orbitals with different shapes. sp³ hybridization (one s + three p, four hybrid orbitals) gives tetrahedral geometry — the bonding pattern of saturated carbons in alkanes (methane, ethane, the carbon backbone of biomolecules). sp² hybridization (one s + two p, three hybrid orbitals plus one unhybridised p) gives trigonal planar geometry with a p orbital available for π bonding — the bonding pattern of carbon in C=C double bonds, in carbonyls (C=O), and in aromatic rings. sp hybridization (one s + one p, two hybrid orbitals plus two unhybridised p) gives linear geometry with two perpendicular π bonds — the bonding pattern of carbon in triple bonds (C≡C, C≡N).
The hybridization state of every atom in a molecule largely determines its shape and reactivity. Modern AI methods for molecular property prediction routinely include hybridization as a per-atom feature, and the methodology connects the discrete graph view of molecules (Section 1) to the underlying quantum-chemical reality (Section 8).
Conformations and configurations
Two distinctions matter. Conformations are different shapes a molecule can adopt by rotating around single bonds — they interconvert without breaking bonds. Butane has multiple conformations (anti, gauche, eclipsed) that differ in energy by a few kJ/mol and interconvert millions of times per second at room temperature. Configurations are different shapes that require breaking bonds to interconvert — they are stable distinct entities. The cis/trans isomers of 2-butene (with the methyl groups on the same vs. opposite sides of the C=C double bond) are configurations; you cannot convert one to the other without breaking the double bond.
Chirality
A specific configurational distinction worth flagging is chirality: many molecules can exist as enantiomers — mirror images of each other that cannot be superimposed, like left and right hands. The standard case is a carbon with four different substituents: there are two ways to arrange them in 3D space, and the resulting molecules are enantiomers labelled R or S by convention. Enantiomers have identical physical properties in symmetric environments but opposite behaviours in chiral environments — and biology is intensely chiral. Most amino acids in proteins are L-isomers; most sugars are D-isomers; almost every drug-target interaction is sensitive to chirality.
The medical consequences of chirality can be severe. The classic case is thalidomide: the (R)-enantiomer is an effective sedative; the (S)-enantiomer causes severe birth defects. Marketed as a racemic mixture (50/50 R/S) in the late 1950s, thalidomide caused thousands of cases of phocomelia in children before being withdrawn. Modern pharmaceutical regulation requires evaluating each enantiomer separately, and many drugs are now sold as the single active enantiomer. The methodology of computational drug design (Ch 08) pays substantial attention to chirality.
Isomers more broadly
Isomers are molecules with the same molecular formula but different structures. The major categories: structural (constitutional) isomers have different bonding patterns — pentane and 2-methylbutane are both C₅H₁₂ but have different connectivities. Stereoisomers have the same connectivity but different 3D arrangements — these include cis/trans and R/S as discussed above, plus more complex cases like axial/equatorial in cyclohexanes. The number of distinct isomers grows combinatorially with molecular size, which is why chemical-design problems are hard: the search space of "molecules with this molecular formula" is vast.
Reactions, Kinetics, and Equilibria
A chemical reaction transforms one set of molecules (the reactants) into another (the products). Predicting which reactions happen, how fast, and to what extent is the central problem of chemistry — and a major application area for ML methods (retrosynthesis, reaction prediction, catalyst design).
The major reaction types
Chemical reactions can be classified by what kind of bond reorganisation happens. Substitution reactions replace one group with another (replacing the OH in an alcohol with a Br to make an alkyl halide). Addition reactions add atoms across a multiple bond (adding H₂ across a C=C double bond to saturate it). Elimination reactions remove atoms to form a multiple bond (the reverse of addition). Rearrangement reactions reorganise the carbon skeleton without adding or losing atoms. Each reaction type has its own mechanism — the detailed sequence of bond-making and bond-breaking events that connect reactants to products.
Two cross-cutting reaction categories are particularly important. Acid-base reactions transfer protons (H⁺) between molecules; the strength of an acid (its tendency to give up a proton) is measured by its pKa, and pKa values are essential for predicting drug behaviour because most drugs are weakly acidic or basic and their charge state in the body affects absorption, distribution, and binding. Redox reactions (reduction-oxidation) transfer electrons; oxidation is loss of electrons, reduction is gain. Combustion, respiration, photosynthesis, battery operation, and metabolism are all redox processes, and the methodology of computational electrochemistry rests on understanding electron-transfer reactions.
Stoichiometry
A balanced chemical equation specifies how many molecules of each reactant produce how many molecules of each product. The methane combustion equation CH₄ + 2 O₂ → CO₂ + 2 H₂O says that one methane molecule plus two oxygen molecules produces one carbon dioxide and two waters. Stoichiometry is conservation of atoms — every atom on the left must appear on the right — and it is the bookkeeping layer that grounds quantitative chemistry. Stoichiometric calculations are bread-and-butter for chemists but rarely the bottleneck in modern computational work; the harder problems lie in mechanism, kinetics, and equilibrium, which the rest of this section develops.
Reaction rates and rate laws
How fast does a reaction happen? The reaction rate is the rate of change of reactant or product concentrations over time. For a simple reaction A → B, the rate depends on [A] (the concentration of A) according to a rate law: rate = k[A]ⁿ, where k is the rate constant and n is the reaction order (typically 0, 1, or 2). The rate constant k depends on the reaction's intrinsic difficulty (specifically, on its activation energy — see below) and on temperature.
The mathematical tools are familiar to anyone with calculus background: a first-order reaction (rate = k[A]) has exponentially-decaying [A] over time; a second-order reaction (rate = k[A]²) has the inverse of [A] increasing linearly with time. Reaction orders are determined experimentally and can sometimes hint at the underlying mechanism — a second-order reaction in a single reactant suggests a step in the mechanism involves two molecules colliding.
Activation energy and the Arrhenius equation
Why do most reactions need heat to proceed? Because the path from reactants to products usually goes through a high-energy intermediate state called the transition state, and reaching the transition state requires energy. The energy difference between reactants and the transition state is called the activation energy (Ea), and it determines how fast the reaction goes at a given temperature. The relationship is the Arrhenius equation: k = A·exp(−Ea/RT), where A is a pre-exponential factor, R is the gas constant, T is absolute temperature. The exponential dependence means small changes in activation energy or temperature produce large changes in rate — a 10°C increase in temperature roughly doubles most reaction rates near room temperature.
Catalysis
A catalyst speeds up a reaction by providing an alternative path with lower activation energy, without being consumed. Catalysts are central to both biology (enzymes are catalysts; Section 7) and industry (the Haber-Bosch process for ammonia, catalytic converters in cars, the thousands of industrial catalytic processes). Computational catalyst design — predicting which combinations of materials and structures will catalyse a given reaction — is a major application of AI methods (Ch 14 develops it), and the methodology connects to the materials-science material throughout.
Equilibrium and Le Chatelier's principle
Most reactions don't go to completion; they reach a balance point at which the forward and reverse rates are equal — the reaction is at equilibrium. The position of equilibrium is described by the equilibrium constant Keq = [products]/[reactants] (with appropriate stoichiometric exponents), which depends on temperature but not on starting concentrations. A large Keq (≫1) means the reaction goes mostly to products; a small Keq (≪1) means the reaction barely proceeds. The thermodynamic interpretation of Keq (Section 6) connects equilibrium to free-energy differences.
Le Chatelier's principle describes how equilibria respond to perturbations: if you increase one reactant's concentration, the equilibrium shifts toward more products; if you increase pressure on a gas-phase reaction, the equilibrium shifts toward the side with fewer gas molecules; if you increase temperature on an exothermic reaction, the equilibrium shifts toward reactants. The principle is qualitative but useful for predicting how chemical systems respond to changes — and increasingly relevant for industrial-process design where ML methods optimise reactor conditions.
Thermodynamics
Whether a reaction is fast (kinetics, Section 5) and whether it goes at all (thermodynamics, this section) are different questions. Thermodynamics provides the rules that govern energy changes in chemical systems, and understanding them is essential for understanding why some reactions happen spontaneously while others don't.
The first law: energy conservation
The first law of thermodynamics says that energy can be neither created nor destroyed — it can only be converted between forms. Chemical reactions involve energy changes: bonds in reactants are broken (requiring energy input) and bonds in products are formed (releasing energy). The net energy change is the difference. Exothermic reactions release net energy (typically as heat); endothermic reactions require net energy input. Combustion is exothermic; melting ice is endothermic.
Enthalpy
For most chemistry done at constant pressure (i.e., open to the atmosphere), the energy change of interest is the enthalpy change ΔH, measured in kJ/mol. ΔH < 0 for exothermic reactions; ΔH > 0 for endothermic. Enthalpy is approximately the sum of bond energies of products minus reactants — strong product bonds and weak reactant bonds give a negative ΔH (exothermic, energetically favourable on enthalpy grounds alone). But enthalpy alone is not the whole story.
The second law: entropy
The second law of thermodynamics says that the total entropy of an isolated system never decreases. Entropy (S) is a measure of disorder — more accessible microscopic configurations means higher entropy. A pile of legos in a box has higher entropy than the same legos arranged into a model; a mixture of gases has higher entropy than the same gases segregated; an unfolded protein has higher entropy than a folded one (approximately, with caveats around protein-water entropy that Section 7 develops).
The second law's statistical-mechanical interpretation: entropy quantifies how many distinct microscopic arrangements are consistent with a given macroscopic state. Spontaneous processes proceed in the direction that increases the total number of accessible arrangements — toward more disorder, on average. The connection to information theory (S = k·ln(W), where W is the number of arrangements and k is Boltzmann's constant) is real and runs both directions: thermodynamic entropy and Shannon's information entropy are mathematically the same object.
Gibbs free energy
For chemistry at constant temperature and pressure (i.e., most laboratory and biological chemistry), the relevant quantity that determines spontaneity is Gibbs free energy: G = H − TS. The change ΔG = ΔH − TΔS combines enthalpy and entropy contributions. A reaction proceeds spontaneously when ΔG < 0; it will not proceed (and the reverse will run instead) when ΔG > 0; at ΔG = 0 the system is at equilibrium. The relationship between Gibbs free energy and the equilibrium constant is exact: ΔG° = −RT·ln(Keq), where ΔG° is the standard free-energy change. This equation is among the most-used in chemistry — it lets you convert between equilibrium constants (which are measurable) and free-energy changes (which are interpretable in terms of bond energies and entropy contributions).
Both terms in ΔG = ΔH − TΔS matter. Some spontaneous reactions are enthalpy-driven (large negative ΔH outweighs unfavourable ΔS) — combustion is the canonical example. Some are entropy-driven (positive ΔS outweighs unfavourable ΔH) — the dissolution of certain salts is endothermic but spontaneous because the entropy gain dominates. Most biological processes operate near the boundary, with ΔH and TΔS comparable, and small changes in conditions can flip the direction. Understanding which factor dominates a given reaction is a routine analytical exercise for chemists and biochemists.
Coupling reactions: ATP and biology
Biology faces a thermodynamic problem: many essential reactions (synthesising proteins, synthesising DNA, transporting molecules across membranes against gradients) are non-spontaneous on their own. The solution is reaction coupling — pair the non-spontaneous reaction with a strongly spontaneous one, so the combined ΔG is negative. The universal energy-coupling currency is ATP (adenosine triphosphate). Hydrolysis of ATP to ADP plus phosphate releases ~30 kJ/mol of free energy, which is enough to drive many otherwise non-spontaneous biological reactions when properly coupled. Cells continuously make and break ATP — the typical adult human turns over their entire body weight of ATP every day, recycling it through metabolism. The methodology of biochemistry rests substantially on understanding which reactions need ATP coupling and how it works.
The hydrophobic effect
A specific thermodynamic phenomenon worth flagging because it dominates biology: the hydrophobic effect. Why do oil and water separate? Why do hydrophobic amino acid side chains cluster on the inside of folded proteins? The mechanism is entropic: water molecules around a non-polar surface are forced to adopt restricted orientations (to maintain hydrogen bonding among themselves), reducing their entropy. When non-polar surfaces aggregate, fewer water molecules are forced into these restricted positions, and the freed water gains entropy. The effect is paradoxical (entropy increases when oil-like molecules clump) but real, and it is the dominant force driving protein folding, membrane formation, and most ligand-protein binding events.
Organic Chemistry
Organic chemistry is the chemistry of carbon-based compounds. The name dates from a nineteenth-century distinction between substances of biological origin ("organic") and substances from non-living sources ("inorganic"); the modern definition is more technical (carbon-containing) but the original biological connection is real — almost all of biology is organic chemistry, and almost all small-molecule drugs are organic compounds.
Why carbon is special
Why does carbon play this central role? Several reasons converge. Carbon has four valence electrons, which lets it form four covalent bonds — more than the alternatives (oxygen forms two, nitrogen three) and producing a wide variety of structural possibilities. Carbon-carbon bonds are unusually strong and stable, allowing the formation of long chains, rings, and complex networks. Carbon's small size (compared to silicon, the next element down its group) makes its bonds compact and its compounds biologically tractable. The combination — four-bond capability, strong C-C bonds, small size — produces a uniquely diverse chemistry that other elements cannot match. Silicon-based life has been speculated about; carbon-based life is what we observe.
Hydrocarbons: the foundation
The simplest organic compounds are hydrocarbons — molecules containing only carbon and hydrogen. They come in several flavours. Alkanes have only single bonds (CH₄ methane, C₂H₆ ethane, C₃H₈ propane, etc., extending to crude-oil constituents and beyond). Alkenes have at least one C=C double bond (ethylene C₂H₄, the basic monomer for polyethylene). Alkynes have at least one C≡C triple bond (acetylene). Aromatic compounds have specific resonance-stabilised ring structures, the canonical example being benzene (C₆H₆), a six-carbon ring with alternating single and double bonds (or, more accurately, a delocalised π-system that gives aromatic compounds their distinctive stability). Aromatic rings are everywhere in biology and pharmaceuticals.
Functional groups
Beyond hydrocarbons, organic chemistry is organised around functional groups — characteristic atomic groupings that confer specific chemical behaviour regardless of the rest of the molecule. The major functional groups every AI reader should recognise:
Alcohols (R-OH) — a hydroxyl group on a carbon. Methanol, ethanol, glycerol. Polar, hydrogen-bond donors and acceptors, often biologically active. Ethers (R-O-R) — an oxygen between two carbons. Less reactive than alcohols, common as solvents (diethyl ether). Amines (R-NH₂, R₂-NH, R₃-N) — nitrogen with one, two, or three carbons attached. Basic; can accept a proton to become positively charged. Found in essentially all biology (amino acids, neurotransmitters, alkaloids).
Carbonyls (R-C=O) are the family of functional groups containing a C=O double bond, including aldehydes (carbonyl with at least one H attached), ketones (carbonyl flanked by two carbons), carboxylic acids (R-COOH, where the carbonyl carbon also has an OH — these are acidic, with pKa typically ~4–5), esters (R-COO-R, where the OH is replaced by an OR group — the basis of fats and many fragrances), and amides (R-CO-NH-R, with N replacing O — the linkage between amino acids in proteins, called the peptide bond when it occurs in proteins). Carbonyl-based functional groups are everywhere in biology and pharmaceuticals.
Halides (R-X where X is F, Cl, Br, I) appear frequently in pharmaceuticals because they often improve binding affinity or metabolic stability. Phosphates (R-OPO₃²⁻) are essential for biology — DNA, RNA, ATP, and membrane phospholipids all contain phosphates. Sulphates (R-OSO₃⁻), sulphonamides, and thiols (R-SH) round out the major classes. Recognising functional groups by name and structure is essential vocabulary; it is roughly the chemistry analogue of recognising parts of speech in a sentence.
Reactions of organic chemistry
Organic chemistry has a rich repertoire of reactions, organised by which functional groups they involve. A practising organic chemist learns hundreds of named reactions (the Diels-Alder, the Wittig, the Suzuki coupling, the Aldol condensation, the dozens of others); an AI reader does not need this depth but should recognise broad categories. Nucleophilic substitution replaces one substituent with another on a carbon (the SN1 and SN2 mechanisms differ in their kinetics and stereochemistry). Electrophilic addition adds atoms across multiple bonds (typical of alkene chemistry). Elimination reactions form new double bonds. Coupling reactions (typically transition-metal-catalysed) form new C-C bonds — these are the workhorse of modern pharmaceutical synthesis, with the 2010 Nobel Prize in Chemistry going to Heck, Negishi, and Suzuki for their development.
Retrosynthesis — working backwards from a target molecule to identify the simpler precursors that could synthesise it — is the central problem-solving skill of synthetic chemistry. Modern AI methods (Ch 08) include substantial retrosynthesis-prediction work, with systems like AlphaFold-style transformers trained on millions of literature reactions to suggest synthetic routes.
Biomolecules
Most of biology reduces to four classes of organic molecules. Carbohydrates (sugars) are polyhydroxy aldehydes or ketones — glucose, fructose, sucrose; their polymers (starch, glycogen, cellulose) are the major energy-storage and structural molecules of life. Lipids are fats and related hydrophobic molecules — triglycerides for energy storage, phospholipids for cell membranes, steroids (cholesterol, the various hormones) for signalling. Amino acids (twenty standard ones) are organic molecules with both amine and carboxylic acid groups; they polymerise into proteins (Ch 03). Nucleotides are sugar-phosphate-base assemblies; they polymerise into DNA and RNA (Ch 05). Each class has its own reactions, its own AI applications, and its own connection to the methodology of computational biology.
Drug-like molecules and Lipinski's rules
A specific application worth flagging: most small-molecule drugs are organic molecules with restricted size and physical properties. Christopher Lipinski's 1997 "rule of five" summarised the empirical pattern: orally-bioavailable drugs typically have molecular weight < 500 Da, log P < 5 (a measure of lipophilicity), < 5 hydrogen-bond donors, < 10 hydrogen-bond acceptors. The rule isn't strict — many marketed drugs violate one or more criteria — but it captures the regime where most successful small-molecule drugs live. Modern AI methods for drug discovery (Ch 08) routinely use Lipinski properties as features and increasingly as constraints during generative design.
Quantum and Physical Chemistry Essentials
The previous sections described chemistry phenomenologically — atoms bond, molecules react, equilibria settle. The deeper layer that explains why these things happen is quantum mechanics. AI methods for chemistry increasingly engage with this quantum layer (DFT-trained interatomic potentials, neural quantum-chemistry methods), so understanding what's going on is essential.
The Schrödinger equation and atomic orbitals
The electron's behaviour in an atom is described by quantum mechanics. The fundamental equation is the Schrödinger equation: Ĥψ = Eψ, where Ĥ is the Hamiltonian operator (encoding the kinetic and potential energies), ψ is the wavefunction (a complex-valued function whose squared magnitude gives the probability of finding the electron at each location), and E is the corresponding energy. Solving the Schrödinger equation for the hydrogen atom (one electron, one proton) gives the familiar atomic orbitals — 1s, 2s, 2p, 3s, 3p, 3d, etc., — with their characteristic shapes and energies.
For atoms with more electrons, exact solutions to the Schrödinger equation become intractable, and chemists use approximations. The standard procedure: solve approximately for each electron in the average field of all the others, then iterate (the self-consistent field, SCF, method). This produces the multi-electron orbital picture that grounds most of chemistry. The orbitals that emerge — the 1s through 4f shapes, with their symmetries and nodal patterns — are the basis vocabulary for everything quantum chemistry does.
Molecular orbital theory
When atoms bond into molecules, their atomic orbitals combine into molecular orbitals (MOs) that span the whole molecule. The combination follows specific rules: orbitals add in phase to produce bonding orbitals (lower energy than the constituent atomic orbitals); they add out of phase to produce antibonding orbitals (higher energy). Electrons fill bonding orbitals first, which lowers the system's energy and produces the bond. The methodology of molecular orbital theory (MO theory) is the dominant framework for understanding bonding quantitatively; it explains everything from why O₂ is paramagnetic (it has unpaired electrons in degenerate antibonding π* orbitals) to why benzene is so stable (it has six π electrons spread across three bonding π orbitals).
Two specific molecular orbitals deserve names because they appear repeatedly: the HOMO (Highest Occupied Molecular Orbital, the highest-energy filled orbital) and the LUMO (Lowest Unoccupied Molecular Orbital). Reactivity is largely about HOMO-LUMO interactions: a nucleophile (electron donor) uses its HOMO; an electrophile (electron acceptor) uses its LUMO; the reaction's feasibility depends on the energy gap between them. Modern AI methods often predict HOMO and LUMO energies as standard outputs, since they correlate with so many practical properties.
Hartree-Fock and post-HF methods
The standard ab initio approach to solving the molecular Schrödinger equation is the Hartree-Fock (HF) method — essentially the molecular extension of the SCF procedure for atoms. HF treats each electron in the average field of all others, with appropriate antisymmetry to satisfy the Pauli principle. It is conceptually simple, broadly accurate for many properties, and serves as the starting point for more accurate methods. Its main limitation: HF doesn't capture electron correlation — the way electrons avoid each other beyond the average-field treatment. Post-HF methods (Møller-Plesset perturbation theory, coupled-cluster theory, configuration interaction) add correlation in various ways, with progressively higher accuracy and computational cost. The most accurate, CCSD(T) (coupled cluster with singles, doubles, and perturbative triples), is sometimes called "the gold standard" of computational quantum chemistry, but its O(N⁷) scaling in basis-set size limits it to small molecules.
Density Functional Theory
The dominant practical approach for medium-to-large systems is Density Functional Theory (DFT). Instead of working with the full multi-electron wavefunction (a function of 3N coordinates for N electrons), DFT works with the electron density (a function of just 3 coordinates). The Hohenberg-Kohn theorems (1964) showed that the ground-state density determines all properties of the system, and Kohn-Sham DFT (1965) provided a practical procedure: introduce auxiliary single-particle orbitals whose density matches the true density, and approximate the troublesome exchange-correlation contribution with a functional. The 1998 Nobel Prize in Chemistry went to Walter Kohn for these foundations, and DFT has been the workhorse method of computational chemistry for forty years.
DFT's central trade-off is the choice of exchange-correlation functional. Different functionals (LDA, GGA, hybrid functionals like B3LYP, the more recent meta-GGAs, the various range-separated functionals) trade off accuracy and computational cost. Choosing an appropriate functional is a major skill for computational chemists, and DFT-predicted properties depend non-trivially on the choice. The methodology is still empirically tuned in ways that bother some theorists, but the empirical track record on practical problems has been strong, and DFT-based property prediction is the substrate of most modern AI-for-materials applications (Ch 14).
Why quantum chemistry is computationally expensive
Why can't we just solve the Schrödinger equation directly for whatever we want? The fundamental difficulty is scaling. Even DFT scales as O(N³) or O(N⁴) with system size for routine functionals, while CCSD(T) scales as O(N⁷). A typical drug-target binding event involves ~5,000 atoms (counting solvent); a typical protein has ~10,000–100,000 atoms; a typical solid-state simulation cell has ~100–1,000 atoms. Direct quantum-chemical treatment of these systems with high-accuracy methods is prohibitive, which is why ML-trained interatomic potentials (Ch 14) — neural networks that learn to mimic DFT or CCSD(T) at orders-of-magnitude lower cost — are the active frontier of the field. The methodology connects directly to the equivariant-network material of Ch 01 Section 8 (chemistry has rotational and translational symmetries that ML potentials must respect) and forms a substantial part of the modern AI-for-chemistry methodology.
Spectroscopy and Characterization
Once you've made a chemical, how do you know what it is? Spectroscopy is the answer — using the interaction of matter with electromagnetic radiation to identify and characterise molecules. Different spectroscopic techniques probe different aspects of molecular structure, and together they form the empirical foundation of modern chemistry. They are also increasingly used as ML training data.
NMR spectroscopy
Nuclear Magnetic Resonance (NMR) is the single most-informative technique for organic structure determination. The principle: certain atomic nuclei (¹H, ¹³C, ¹⁵N, ³¹P most commonly) have intrinsic spin and behave like tiny magnets. In a strong magnetic field, they precess at frequencies determined by their local electronic environment; radio-frequency pulses excite them and the relaxation back is measured. The result is a spectrum showing one peak per chemically-distinct nucleus, with positions (chemical shifts) characteristic of the local chemistry and intensities proportional to abundance.
Modern NMR can determine complete molecular structures for moderate-sized molecules and partial structures for proteins (this is one of the techniques mentioned in Ch 03 for protein structure determination). Multidimensional NMR (2D, 3D, 4D) correlates signals from different nuclei to reveal which atoms are bonded to which. The methodology is mature and accuracy is high, but interpretation requires expertise. Modern AI methods can predict NMR spectra from structures (forward problem) and increasingly identify structures from spectra (inverse problem), making NMR a fertile area for ML applications.
Infrared (IR) spectroscopy
Infrared spectroscopy measures vibrations of molecular bonds. Each bond type vibrates at a characteristic frequency in the infrared range — C=O bonds vibrate around 1700 cm⁻¹, O-H around 3200–3600, N-H around 3300, aromatic C-H around 3000. An IR spectrum shows absorption peaks at the frequencies of bonds present, providing a fingerprint that identifies functional groups. IR is fast and cheap (a typical spectrum takes seconds to minutes), and it is the standard method for confirming functional-group composition during synthesis.
UV-Visible spectroscopy
UV-Vis spectroscopy probes electronic transitions — electrons jumping from occupied molecular orbitals to unoccupied ones. The energy required corresponds to ultraviolet or visible light, and absorption produces colour. UV-Vis is most useful for molecules with extensive π-systems (aromatic rings, conjugated double bonds, dyes); aromatic compounds typically absorb in the UV, and chromophores (highly-conjugated systems like β-carotene or anthocyanins) absorb in the visible producing colour. The methodology is straightforward, the data is quantitative (Beer-Lambert law: absorbance is proportional to concentration), and it is the standard method for measuring concentrations of light-absorbing compounds.
Mass spectrometry
Mass spectrometry (MS) measures the mass-to-charge ratio of ionised molecules. The methodology: ionise the sample, separate the ions by m/z in a magnetic or electric field, detect them. The result is a spectrum showing peak intensities at various m/z values, from which the molecular weight can be read directly. Modern high-resolution MS (Orbitrap, Q-TOF) can distinguish isotopologues (12C vs. 13C versions of the same molecule) and provide enough mass accuracy to determine elemental compositions uniquely. Combined with chromatographic separation (LC-MS, GC-MS), MS is the workhorse technique for analysing complex mixtures — proteomics samples, metabolomics samples, drug-pharmacokinetics samples, environmental samples. AI methods for spectrum interpretation (predicting fragmentation patterns from structure, identifying compounds from spectra) are an active research area.
X-ray crystallography
X-ray crystallography determines atomic-resolution 3D structures by diffracting X-rays through crystallised samples. The methodology is described in detail in Ch 03 (for proteins) and Ch 14 (for materials); briefly, the diffraction pattern is the Fourier transform of the crystal's electron density, and inverse Fourier transformation (with phase information from various sources) reconstructs the 3D structure. X-ray crystallography produced essentially all of structural biology before AlphaFold and is still the gold standard for atomic-resolution structures. The Protein Data Bank (PDB) holds about 200,000 X-ray structures of proteins; the Cambridge Structural Database (CSD) holds about 1.2 million X-ray structures of small molecules.
Spectroscopic data as ML substrate
The 2024–2026 wave of AI methods increasingly uses spectroscopic data as training input. Predicting NMR spectra from structures is a routine application; the inverse problem (predicting structure from spectra) is a more ambitious frontier. Mass-spectrometry-trained foundation models for metabolomics are an active area. Deep learning on chromatographic and MS data substantially improves identification of compounds in complex mixtures. The general pattern: spectroscopic data is abundant (every chemistry experiment produces some), labelled (chemists annotate their spectra), and information-rich (each spectrum encodes substantial structural information), making it well-matched to modern ML methods.
Chemistry Meets AI: Representations and the Frontier
The previous sections developed chemistry. This final section connects chemistry to AI — what representations get used, what problems are tractable, what frontier the field is operating on, and what comes next in Part XV.
Molecular representations for ML
How do you give a molecule to a neural network? Several approaches dominate. SMILES (Simplified Molecular Input Line Entry System) serialises a molecule as a string of characters: methane is "C", ethanol is "CCO", benzene is "c1ccccc1", aspirin is "CC(=O)Oc1ccccc1C(=O)O". The encoding is compact, human-readable, and amenable to language-model architectures (RNN, transformer-based). It has known weaknesses — multiple valid SMILES strings can represent the same molecule, requiring canonicalisation; small string changes can produce invalid molecules — but it remains the dominant string format. SELFIES (Self-referencing Embedded Strings, Krenn et al. 2020) addresses some of SMILES's invalidity issues with a grammar that guarantees every string corresponds to a valid molecule.
Molecular graphs represent molecules directly as graph data structures: nodes are atoms (with element, charge, hybridisation, formal connectivity attributes), edges are bonds (with type, stereochemistry attributes). This representation matches the underlying chemistry directly and is the natural input for graph neural networks (Part XIII Ch 05). Modern AI for chemistry (Ch 08, Ch 14) uses graph-based methods extensively, with architectures like GIN, MPNN, GNN-based variants tailored to chemistry, and the equivariant successors (NequIP, MACE) that respect the rotational symmetries of 3D molecules.
Molecular fingerprints are fixed-length binary or count vectors that capture substructure information. The classical Morgan/ECFP fingerprints encode the presence or absence of each circular fragment up to a given radius around each atom; the resulting vector is dense, comparable across molecules, and amenable to similarity-based methods. Fingerprints predate modern deep learning but remain useful baselines and are increasingly combined with neural representations in production cheminformatics pipelines.
Cheminformatics: the older discipline
Cheminformatics is the pre-deep-learning discipline that developed methods for storing, searching, and analysing chemical data. The classical tools — RDKit (the dominant open-source cheminformatics library), chemical-similarity searching (Tanimoto coefficients on fingerprints), QSAR (quantitative structure-activity relationship) modelling, ADMET property prediction (Lipinski-style rules and successors) — remain widely deployed and form the substrate on top of which modern AI methods build. An AI reader entering chemistry should know the major cheminformatics tools exist and what they do; Ch 08 develops the AI-driven extensions in detail.
Where AI for chemistry lives in this compendium
The applied AI-for-chemistry chapters of Part XV develop specific application areas. Ch 04 (AI for Protein Science) covers AlphaFold and the broader protein-structure-prediction methodology, which engages with chemistry at the amino-acid level. Ch 06 (AI for Biology) covers the genomic and cellular-biology AI methods, which engage with chemistry mostly through the molecular biology of nucleic acids and proteins. Ch 08 (AI for Drug Discovery & Molecular Design) is the most chemistry-intensive of the applied chapters — molecular representations, generative chemistry, docking, ADMET prediction, retrosynthesis. Ch 14 (AI for Materials Science) applies similar methods to solid-state chemistry — property prediction, ML interatomic potentials, generative crystal-structure design. Together they exercise nearly every concept this chapter has introduced.
The frontier as of 2026
Several frontiers are particularly active. Foundation models for chemistry — large pretrained models on hundreds of millions of molecules, fine-tuned to specific downstream tasks — have produced strong empirical results but are still less mature than their language-model counterparts. Generative chemistry — designing novel molecules with desired properties — has produced impressive demonstrations (the various 2024–2026 AI-designed drug candidates moving toward clinical trials) but limited validation at scale. ML interatomic potentials trained to reproduce DFT or CCSD(T) at modest cost have begun to displace classical force fields in molecular dynamics, with substantial implications for materials science and drug discovery. Reaction prediction and retrosynthesis — predicting what products will form from given reactants, or what reactants could synthesise a target — has matured substantially with transformer-based methods. The field is rapidly evolving and the methodology in 2028 will look different from 2026.
What this chapter does not cover
Several chemistry topics are out of scope. The substantial inorganic-chemistry literature on transition-metal chemistry, organometallics, and main-group chemistry beyond the basics is mostly skipped. Industrial-process chemistry (the chemical engineering of producing chemicals at scale, with its catalysts, reactor designs, and separation operations) is a major discipline that intersects this chapter but is not developed. Polymer chemistry — the chemistry of long-chain molecules with industrial applications — is its own substantial field. Surface chemistry and electrochemistry, while essential for catalysis and battery research, are barely touched. The chapter aimed at the chemistry concepts AI methods most often invoke; the broader landscape of chemistry is genuinely vast and several of these adjacent areas have their own active AI research communities.
Further reading
Foundational references for chemistry. Atkins' Chemical Principles plus Clayden et al.'s Organic Chemistry plus a quantum-chemistry text (Levine or Szabo & Ostlund) plus an introductory cheminformatics reference is the right starting kit for an AI reader serious about the field.
-
Chemical Principles: The Quest for InsightThe standard general-chemistry textbook. Comprehensive coverage of atomic structure, bonding, thermodynamics, kinetics, equilibrium, and the major chemistry concepts at the level this chapter operates. Roughly 1,000 pages but well-written; the right starting reference for any AI reader entering chemistry seriously. The reference general-chemistry textbook.
-
Organic ChemistryThe standard graduate-level organic-chemistry textbook in the UK and increasingly elsewhere. Strong on mechanism and the conceptual structure of organic reactions; the right reference for understanding the functional-group chemistry of Section 7 in depth, particularly for an AI reader who needs to engage with retrosynthesis or generative chemistry. The reference organic-chemistry textbook.
-
Quantum ChemistryA standard introduction to quantum chemistry at the graduate level. Covers the Schrödinger equation, atomic and molecular structure, Hartree-Fock, post-HF methods, and DFT. The natural reading for the Section 8 material in depth, and the foundation for understanding modern computational chemistry methodology. The reference quantum-chemistry textbook.
-
Modern Quantum Chemistry: Introduction to Advanced Electronic Structure TheoryA classic accessible introduction to ab initio quantum chemistry. Substantially shorter and more focused than Levine, with strong coverage of Hartree-Fock and the post-HF methods. Available as an inexpensive Dover reprint. The right reading for understanding the algorithms underlying modern quantum-chemistry codes. The classic ab initio reference.
-
Inhomogeneous Electron Gas (Hohenberg-Kohn)The foundational paper of Density Functional Theory. Establishes that the ground-state electron density determines all properties of the system — the theorem that grounds DFT and underlies most modern computational chemistry. The 1998 Nobel Prize in Chemistry recognised this work and Kohn's subsequent contributions. The foundational DFT paper.
-
Self-Consistent Equations Including Exchange and Correlation Effects (Kohn-Sham)The companion paper that made DFT computationally practical. Introduces the auxiliary single-particle orbitals (Kohn-Sham orbitals) and the exchange-correlation functional, producing the equations that underlie essentially every modern DFT code. The reference for practical DFT.
-
Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings (Lipinski's Rule of Five)The original "rule of five" paper. Establishes the empirical pattern that orally-bioavailable drugs share specific physicochemical properties. The rule has become canonical in pharmaceutical chemistry and is referenced extensively in Ch 08; the 1997 paper remains the foundational citation. The reference for drug-like physicochemical properties.
-
SMILES, a chemical language and information systemThe original SMILES paper. Establishes the string-based molecular representation that has become the dominant data format for chemistry-AI applications. The methodology has been refined substantially over four decades but the original 1988 specification remains the foundation. The reference for SMILES notation.
-
RDKit: Open-source cheminformatics softwareThe dominant open-source cheminformatics library. Provides the practical infrastructure for working with molecules computationally — SMILES parsing, structure manipulation, fingerprint computation, similarity searching, descriptor calculation, and integration with most modern Python ML frameworks. Essentially every AI-for-chemistry project in 2026 uses RDKit at some point. The reference cheminformatics infrastructure.
-
Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representationThe SELFIES paper. Addresses the SMILES-invalidity problem with a grammar-based representation where every string corresponds to a chemically-valid molecule. Increasingly used for generative chemistry where the model must produce novel valid molecules — particularly in VAE and diffusion-model-based design. The natural successor to SMILES for many AI applications. The reference for invalidity-robust molecular strings.