AI for Science

The chapter assumes the generative-modelling material from the Generative Models chapter, the Foundation Models chapter for the FM-as-substrate framing, and Deep Learning for architectural background. It is deliberately application-driven: each domain section develops one or two landmark systems mechanistically and refers to the literature for breadth.

Scope and What This Chapter Is About

The chapter develops AI for Science — the use of modern machine learning to address scientific problems where the goal is not just commercial deployment but discovery, prediction, design, or understanding of the natural world. We cover the major scientific domains where AI has produced consequential results (proteins, mathematics, chemistry, materials, biology, climate, physics), the methodological substrate (equivariance, scientific foundation models, generative design, learned simulators), and the evolving paradigm of AI-assisted scientific research. Open problems are flagged inline and consolidated in §14.

§1. Motivation and Scope

Three worked instances to anchor the chapter

The clearest way to define AI for Science is by example. Three landmark systems from 2020–2024, each from a different scientific domain:

1. AlphaFold 2 predicts protein structures from amino-acid sequences. Trained on the Protein Data Bank (~180K experimentally-determined structures) plus evolutionary information from sequence databases. The system takes an amino-acid sequence and predicts the 3D coordinates of every atom in the folded protein. For most proteins, the predicted structure matches experimental structures to within a few angstroms — accuracy approaching the precision of experimental methods themselves. The result transformed structural biology: a problem that had been the central open question of the field for 50 years was effectively solved by a single deep-learning system in 2020. The follow-up, AlphaFold 3 (2024), generalizes to protein-ligand, protein-DNA, and protein-protein complexes.

2. GNoME discovers stable crystal materials at scale. A graph neural network trained on density-functional-theory (DFT) calculations of crystal stability. Used in an active-learning loop with high-throughput DFT validation, GNoME (Merchant et al., 2023) predicted 380,000 stable crystal structures — comparable to the entire accumulated discovery of stable crystals by traditional chemistry. The result is now feeding downstream applications in batteries, catalysts, and superconductors.

3. AlphaProof proves mathematics at the level of Olympiad medalists. A combination of LLM-based reasoning and Lean (a formal proof assistant) trained with reinforcement learning. In 2024 AlphaProof solved four of the six 2024 International Mathematical Olympiad problems at silver-medal level — the first AI system to match top-human-mathematician performance on hard, written-from-scratch mathematics problems.

These three instances span structural biology, materials chemistry, and mathematics. They share little at the application level — different inputs, different outputs, different evaluation. They share method: each combines a deep neural network with a domain-specific structural representation (3D coordinates, crystal lattices, formal proofs) and a domain-specific training signal (experimental structures, DFT calculations, formal verification). The pattern is what this chapter calls AI for Science: deep learning applied to scientific prediction, discovery, or design, with substantive scientific consequence.

What AI for Science is

A working definition. AI for Science (AI4S) is the application of modern machine learning — primarily deep learning, often in the Foundation Model paradigm — to problems where:

The goal is to predict, discover, or design something in the natural world (a protein structure, a catalyst, a theorem, a weather forecast), not to deploy a product feature.
The success metric is scientific — accuracy against experimental measurements, novelty of discovery, generality of the predicted artefact — rather than user engagement or revenue.
The application interacts with traditional scientific methodology — published in scientific journals, evaluated against domain-specific benchmarks, validated experimentally.

The three components together define the subfield. The middle one is most important: AI for Science is differentiated from “ML applied to data that happens to be scientific” by its commitment to scientific success criteria. A neural-network classifier for medical imaging that improves clinical workflows is healthcare ML, not AI4S. A neural-network predictor for chemical reaction yields that discovers new catalysts is AI4S.

What AI for Science is not

Several boundaries worth flagging because they are sometimes blurred.

Not all ML on scientific data is AI4S. Statistical modelling, data analysis, and classical machine learning on scientific datasets have a long history. AI4S in the modern sense — deep learning, foundation models, generative design — is a specific subset emerging since roughly 2018.

Not all “science” is AI4S. Theoretical AI, computer-science research about AI, and AI methodology research are not AI4S even when published in scientific venues. The criterion is whether the work targets external scientific problems.

Not a separate discipline. AI4S researchers come from many places — AI labs, universities, pharmaceutical and biotech companies, climate-modelling centres, theoretical-physics groups. The methods substantially overlap with mainstream AI; the applications and evaluation are domain-specific.

Not “the next era” of AI. AI4S is one major application area among several (language, vision, robotics, content generation). It is consequential but not displacing other AI applications.

Why AI for Science matters in 2026

Four motivations a 2026 researcher should care about.

1. Several scientific problems have been substantially solved by AI. Protein structure prediction (AlphaFold 2/3); short-range weather forecasting (GraphCast, Pangu-Weather); some theorem-proving subdomains (AlphaGeometry). These are not small results; they are large-scale scientific achievements with widespread downstream impact.

2. AI is becoming a scientific instrument. A 2026 structural biologist routinely uses AlphaFold-predicted structures as a starting point; a 2026 weather forecaster routinely uses GraphCast forecasts alongside traditional numerical weather prediction; a 2026 chemist uses ML-predicted properties to triage candidate molecules. AI4S systems are now part of the everyday scientific toolkit, not just published results.

3. The methodological substrate is rich and growing. Equivariant neural networks, scientific foundation models, generative design systems, learned simulators — each is a distinct methodological line with its own theoretical and engineering depth. AI4S is producing genuinely new ML methodology, not just applying off-the-shelf systems.

4. The scientific stakes are high. Drug discovery, materials for clean energy, climate adaptation, fundamental biology — the problems AI4S addresses have substantial real-world consequences. The potential upside justifies substantial investment, and the field is seeing that investment ($1B+ annually in industrial labs alone by 2026).

Cross-cutting methodological themes

Several themes recur across the chapter’s domain-specific sections. Worth flagging them at the outset:

Equivariance and symmetries. Physical-world data has symmetries — rotations, translations, permutations of identical particles. Models that respect these symmetries by construction (equivariant neural networks) are dramatically more sample-efficient and generalize better than models that learn the symmetries from data. SE(3)-equivariant networks (Thomas et al., 2018; Geiger and Smidt, 2022) appear in protein design, materials chemistry, and molecular generation; permutation-invariant graph neural networks appear throughout. We develop equivariance in §3, §5, §6 as needed.

Foundation-model-as-substrate. Many AI4S systems start from a pretrained foundation model — protein language model (ESM, Evo), molecular foundation model (MolFormer), genomic foundation model (Enformer, AlphaGenome), even general-purpose LLMs (GPT-4, Claude) as scientific reasoning systems. The FM paradigm of the Foundation Models chapter applies here, with domain-specific pretraining objectives and data.

Generative design. AI4S frequently uses generative models (from the Generative Models chapter) for design tasks: generate novel proteins, novel molecules, novel materials, novel theorems. The §3, §5, §6 sections develop this in domain depth.

Learned simulators. Many scientific applications involve simulating physical systems (molecular dynamics, climate, fluid dynamics, particle physics). Learned simulators — neural networks trained to mimic physics — are substantially faster than traditional simulators (sometimes by 10⁶ or more), trading some accuracy for speed. §6, §8, §9 develop instances.

Active learning loops. Scientific data is expensive — a DFT calculation, an experimental synthesis, a clinical trial. AI4S workflows often combine ML prediction with selective experimental validation: predict, choose the most informative candidates, validate experimentally, retrain. This is the active-learning loop, central to GNoME, autonomous chemistry, and several other systems.

Boundaries with adjacent chapters

This chapter sits among many others; the boundaries.

Generative Models is the source for the modelling techniques (diffusion, flow-matching, equivariant generators) that drive protein, molecule, and material design. This chapter develops the applications; the Generative Models chapter develops the machinery.
Foundation Models provides the FM-as-substrate framing. Scientific FMs (ESM, AlphaGenome, MolFormer) instantiate the FM pattern in scientific domains.
Reinforcement Learning provides the algorithmic basis for RL-based scientific systems (AlphaProof, RL for molecule design, RL for materials).
Self-Supervised Learning is the pretraining paradigm for scientific FMs.
Deep Learning provides architectures (GNNs, equivariant networks, Transformers) used throughout.
Multimodal Models (planned) develops cross-modal systems; protein-structure-and-language models are a scientific instance.
Theoretical Foundations of Learning §10 develops sample-complexity theory relevant to AI4S’s data-efficiency concerns.
Causality (planned) is relevant to the causal-vs-correlational critique of AI4S; we touch on it in §13 and §14.
Evaluation (planned) develops cross-cutting benchmarking methodology; this chapter’s §11 is domain-specific.

What this chapter does not try to do

Several explicit exclusions.

We do not provide a complete survey of every AI4S system or paper. The field is too large; we develop one or two landmark systems per domain in mechanistic detail and refer the reader to surveys.
We do not derive scientific results in domain depth. The AlphaFold 2 architecture is described mechanistically here; the biophysics of protein folding lives in biochemistry textbooks.
We do not cover all scientific domains. Some active AI4S areas — astronomy, neuroscience, economics, social science — are mentioned only briefly. The selection emphasizes domains with high-impact 2020–2026 systems.
We do not treat the industrial AI4S landscape (DeepMind Isomorphic Labs, NVIDIA BioNeMo, Microsoft AI4Science, Google DeepMind Science, OpenAI partnerships) systematically. These are flagged where relevant.
We do not develop the philosophy-of-science questions that AI4S raises (what counts as discovery? does AI “understand” the systems it predicts?). These are flagged in §13 and live in the broader philosophy-of-AI literature.

Position taken in this chapter

The chapter is organized by scientific domain (proteins, mathematics, chemistry, materials, biology, climate, physics) because the methods, data, and evaluation criteria vary substantially across domains. Cross-cutting methodological themes are flagged throughout. The chapter is application-driven: each domain section develops one or two landmark systems in mechanistic detail and shows the data flow, training objective, and key innovations.

The chapter takes a cautious-but-positive stance toward AI4S. The achievements are real and substantial (AlphaFold 2 is one of the most consequential results in computational biology of the last 50 years). The limitations are also real (correlational vs causal, generalization beyond training distribution, the reproducibility crisis, the impact-vs-capability gap). We develop both dimensions.

§2. Historical Context

This section traces AI for Science from its statistical-ML roots through the deep-learning era and the 2020–2024 inflection that produced AlphaFold 2 and its successors. The history is essential for understanding why AI4S consolidated as a subfield when it did.

A timeline of the inflection points:

1960s–1980s

Early computational science

Classical ML and statistics applied to chemistry (QSAR), biology (sequence alignment), and physics (model fitting).
1990s–2000s

Chemoinformatics and bioinformatics mature

Random Forests for QSAR, kernel methods for property prediction. BLAST and HMMs for sequence analysis. Computational chemistry develops DFT and simulation methods.
2007

ImageNet and the early deep-learning era

Deep learning begins to emerge but does not yet substantially affect scientific applications.
2012–2015

Deep learning enters scientific domains piecemeal

Convolutional networks for medical imaging (CheXNet etc.); deep networks for QSAR (Merck challenge, Mayr et al.); first AlphaFold attempts.
2016–2017

Graph neural networks for molecules

GNNs emerge as a natural representation for molecules (MPNN, Gilmer et al. 2017). Sequence-level deep models for biology (DeepBind, Basset). Materials Project scales.
2018

AlphaFold 1 (CASP 13)

Senior et al. produce the first major deep-learning system for protein structure prediction. Substantial improvement over earlier methods but not yet experimental-quality.
2020

AlphaFold 2 (CASP 14)

Jumper et al. achieve near-experimental-accuracy protein structure prediction — a single-paper transformation of structural biology.
2021

AlphaFold 2 published; Open Catalyst Project

AlphaFold 2 paper published in Nature; AlphaFold Protein Structure Database released. Open Catalyst Project at Meta scales chemistry datasets to millions of DFT calculations.
2022–2023

Protein design, weather forecasting, materials discovery

RFdiffusion (Watson et al. 2023) makes generative protein design practical. ESM-2 (Lin et al. 2023) is a protein language model at scale. GNoME (Merchant et al. 2023) predicts 380K stable crystals. GraphCast (Lam et al. 2023) beats ECMWF on key weather-forecast metrics.
2023

FunSearch and AlphaGeometry

FunSearch (Romera-Paredes et al.): LLM + evolutionary search for combinatorial discoveries. AlphaGeometry (Trinh et al.): near-Olympiad-level geometry solver.
2024

AlphaFold 3, AlphaProof, MatterGen, GenCast, Evo, AI Scientist

AlphaFold 3 (Abramson et al.) generalizes to protein complexes and small-molecule binding. AlphaProof (DeepMind) achieves silver-medal performance on IMO 2024. MatterGen (Microsoft) enables generative crystal design. GenCast (DeepMind) brings probabilistic weather forecasting via diffusion. Evo (Arc Institute) is a DNA foundation model; AlphaGenome (DeepMind) performs variant-effect prediction at scale. AI Scientist (Sakana) is an early autonomous research agent.
2025–2026

AI4S consolidates as a subfield

Shared infrastructure emerges (Hugging Face for scientific models, common benchmarks, cross-domain reading lists). Industrial labs (Isomorphic Labs, BioNeMo, Microsoft AI4Science) operate at substantial scale. Next-generation protein, materials, and climate foundation models proliferate.

We develop each phase below.

The pre-deep-learning era: classical scientific ML

Computational science substantially predates deep learning. Three threads of development:

Chemoinformatics. Started in the 1960s with substructure matching for chemical databases; matured through the 1990s with Quantitative Structure-Activity Relationships (QSAR) — statistical models relating molecular structure to biological activity. By the 2000s, Random Forests and SVMs on hand-engineered chemical descriptors (Morgan fingerprints, RDKit features) were the standard tool. The data was limited (often hundreds to thousands of molecules per task), so deep learning offered limited advantages until much larger datasets emerged.

Bioinformatics. Sequence alignment (BLAST, 1990), Hidden Markov Models for sequence analysis (HMMER), and homology-based methods dominated structural biology. Pre-deep-learning protein-structure prediction was based on fragment assembly (Rosetta) and threading (matching to known structures). The CASP (Critical Assessment of Structure Prediction) competition, started in 1994, was the field’s benchmark — and through CASP 12 (2016), no method achieved experimental-quality predictions.

Computational chemistry and physics. DFT (Density Functional Theory) and other electronic-structure methods provided physics-based simulation of molecular and material properties; they were accurate but expensive (a single DFT calculation can take hours of CPU time for a moderate-size system). Materials databases (Materials Project, started 2011) accumulated DFT-computed properties at scale.

The pre-deep-learning state. Each subfield had its own statistical-ML toolkit, optimized for its data characteristics. Cross-subfield methodology was rare; AI for Science as a unified subfield did not yet exist.

2012–2015: deep learning enters piecemeal

The general deep-learning revolution (AlexNet 2012, etc.) did not produce immediate parallel revolutions in scientific domains. Several reasons: scientific datasets were typically too small for deep networks (no ImageNet equivalent); domain-specific featurization was already strong; the deep-learning community’s attention was on vision and language.

A few early adopters demonstrated potential:

Merck Molecular Activity Challenge (2012). A Kaggle competition won by a deep-learning approach (Dahl et al.) over more conventional QSAR methods. Modest improvement on a domain with limited deep-learning history.
Medical-imaging deep learning developed in parallel through 2012–2015 (radiology, dermatology, ophthalmology). Not strictly AI4S (medical AI has commercial-deployment goals), but adjacent.
Deep learning for sequence biology. DeepBind (Alipanahi et al., 2015) for transcription-factor binding; Basset (Kelley et al., 2016) for chromatin accessibility. First deep models for genomics.

These were incremental advances rather than paradigm shifts. The field was still primarily statistical-ML-with-deep-models, not a coherent new approach.

2016–2017: graph neural networks and the chemistry-data-scaling thread

Two related developments mattered.

Graph neural networks emerged as the natural representation for molecules. The Message Passing Neural Network framework (Gilmer et al., 2017) unified existing approaches; GNN-based models began outperforming classical chemoinformatics on standard property-prediction benchmarks (QM9, ZINC).

Materials Project and Open Catalyst Project scaled chemistry datasets dramatically. The Open Catalyst Project (Meta, 2020-) eventually contained over 100 million DFT calculations on catalyst surfaces — enough data to train deep models substantially better than physics-based simulations for the same task.

These threads set up the conditions for AlphaFold 2: deep learning ready, graph-style representations mature, and at-scale scientific data accumulating.

2018: AlphaFold 1 — the first major breakthrough

At CASP 13 (2018), DeepMind submitted AlphaFold 1 (Senior et al., 2020 — published with detail two years later). The system substantially outperformed all other entries, leveraging deep convolutional networks on multiple-sequence-alignment (MSA) features.

AlphaFold 1’s results were strong but not transformative. Average GDT-TS (a structure-accuracy metric) was around 60 — better than other methods but not at the ~90 level that corresponds to experimental-quality predictions. The field saw the result as significant but not a “solution.”

What AlphaFold 1 did establish: deep learning could substantially advance protein-structure prediction, and DeepMind was investing serious resources in the problem. The 2018-2020 period was a focused effort to push the recipe further.

2020–2021: AlphaFold 2 transforms structural biology

At CASP 14 (November 2020), DeepMind submitted AlphaFold 2 (Jumper et al., 2021). The results were extraordinary: median GDT-TS of ~92, with many predictions matching experimental structures to within the experimental precision itself. For the first time, a computational method achieved experimental-quality protein structure prediction across a diverse range of targets.

The CASP organizers described the result as solving the 50-year-old protein-structure-prediction problem. The biological community’s reaction was a mix of celebration and disbelief. The peer-reviewed Nature paper appeared in July 2021 along with code release; the AlphaFold Protein Structure Database (predicting structures for ~200 million known protein sequences) followed.

The mechanism (developed in §3). AlphaFold 2 used a novel architecture (Evoformer + Structure Module) trained end-to-end on the PDB plus distillation, combined with multiple-sequence-alignment features and recycling (passing predictions back as inputs). The architectural and engineering depth substantially exceeded all prior protein-structure systems.

The impact. AlphaFold 2 demonstrated that AI could substantively solve a major scientific problem — not merely improve incrementally on prior approaches. The result rapidly became a template for AI4S: deep neural networks with domain-specific architectures, trained on substantial scientific datasets, with careful engineering, can produce transformative scientific results.

2022–2023: the AI4S expansion

The years following AlphaFold 2 saw rapid expansion across scientific domains. Several landmark results:

ESMFold (Lin et al., 2022). Protein structure prediction using only a language model (ESM-2) over amino-acid sequences — no MSAs required. Less accurate than AlphaFold 2 but much faster, enabling structure prediction at the scale of all known sequences (~600M).
RFdiffusion (Watson et al., 2023). Generative protein design via SE(3)-equivariant diffusion on protein backbones. The first practical method for designing novel proteins with desired properties; substantial uptake in synthetic biology.
GraphCast (Lam et al., 2023). Graph-neural-network weather forecasting that outperformed the European Centre’s Integrated Forecast System (the world’s leading numerical weather prediction system) on most key metrics. Computationally orders of magnitude cheaper. A direct challenge to traditional numerical weather prediction.
GNoME (Merchant et al., 2023). Materials discovery via GNN prediction + DFT validation in an active-learning loop. 380K predicted stable crystals — comparable to the entire previous accumulated discovery.
FunSearch (Romera-Paredes et al., 2023). LLM-driven evolutionary search for combinatorial problems. Discovered new solutions to long-standing problems (cap-set problem, online bin-packing) — the first AI-discovered results in pure mathematics with publishable scientific consequence.
AlphaGeometry (Trinh et al., 2024). Olympiad-level Euclidean geometry solver combining LLM-generated proof candidates with symbolic verification.

Several themes consolidated. Equivariance became standard for 3D scientific data. Generative design (diffusion-based) became the dominant paradigm for protein/molecule/material design. Foundation-model-style pretraining was applied across scientific domains.

2024: the AI4S consolidation year

The cluster of landmark results in 2024 effectively made AI4S a coherent subfield:

AlphaFold 3 (Abramson et al., 2024). Extended to protein-protein, protein-DNA, protein-RNA, and protein-ligand complexes. The Pairformer architecture replaced AlphaFold 2’s Evoformer; the model substantially generalized the application domain.
AlphaProof (DeepMind, 2024). RL-trained theorem proving in Lean. Silver-medal-level on IMO 2024 — the first AI system at top-human-mathematician level on hard from-scratch problems.
MatterGen (Microsoft, 2024). Generative diffusion for crystal materials design.
GenCast (DeepMind, 2024). Probabilistic weather forecasting via diffusion — substantively better than GraphCast for ensemble forecasts.
Biology foundation models. Evo (Arc Institute, 2024) — DNA-scale foundation model. AlphaGenome (DeepMind, 2024) — variant-effect prediction. scGPT, Universal Cell Embedding (UCE) — single-cell foundation models.
AI Scientist (Sakana, 2024). Early end-to-end autonomous research system — LLM agent that proposes hypotheses, runs experiments, writes papers. Limited but suggestive of the autonomous-research direction.

The aggregate effect. By end-2024, AI for Science had its own shared infrastructure (Hugging Face for scientific models, cross-domain benchmarks, methodologies that span domains), its own communities (NeurIPS AI4Science workshop track, dedicated conferences, industrial labs), and its own narrative (AI as scientific instrument).

2025–2026: consolidation and impact

The 2025–2026 period (the chapter’s present) is one of consolidation rather than further revolutions. The major AI4S systems have stabilized into production-grade tools used by working scientists. Industrial AI4S labs are at substantial scale (Isomorphic Labs has hundreds of researchers; NVIDIA BioNeMo is multi-million-dollar infrastructure; Microsoft AI4Science has dozens of substantial systems in development). Academic AI4S is well-funded and embedded in major universities.

The unsolved problems are also clearer. AlphaFold-style success has not generalized to protein dynamics (folding kinetics, conformational changes). AI weather forecasting works at 1-10 day timescales but not for climate. Mathematics AI can solve known-hard problems but has not yet discovered important new mathematics. Drug discovery has seen incremental improvements but no AlphaFold-level transformation.

Where this leaves us in 2026

The current state. AI for Science is a substantial subfield with:

Several landmark systems (AlphaFold, GraphCast, AlphaProof, GNoME, RFdiffusion) at production-grade quality.
Methodological substrates (equivariance, scientific FMs, generative design, learned simulators) that are mature and shared across domains.
Active research at substantial scale in industry and academia.
Major unsolved problems (dynamics, long-horizon, novelty, causal reasoning) that constitute the field’s open frontier.

The remaining sections develop the major domains in mechanistic detail. §3 covers proteins (the flagship subfield). §4 covers mathematics. §5–§7 cover chemistry, materials, and biology beyond proteins. §8 covers climate and weather. §9 covers physics and scientific simulation. §10 covers reasoning agents and autonomous research. §11 covers evaluation. §12–§16 close out.

Editorial note. AI for Science is a rapidly-evolving subfield, and parts of the chapter will date faster than the more mature paradigm-chapters (Foundation Models, LLMs). The 2024 landmark systems are likely to persist as references; specific evaluation results and benchmark numbers will shift. We treat the chapter as a snapshot of the methodological landscape, not a comprehensive survey of every recent result.

§3. Proteins: Structure and Design

This is the flagship subfield of AI for Science — the area where AI has produced its most consequential results to date, and where the field’s methodology is most developed. We cover protein structure prediction (AlphaFold 2/3, ESMFold), protein design (RFdiffusion, Chroma), and the limitations that remain open.

The protein-structure problem

The biology, briefly. Proteins are linear polymers of amino acids — sequences over a 20-letter alphabet, typically 50–1000 residues long. The linear sequence is encoded by DNA and synthesized by ribosomes; once synthesized, the chain spontaneously folds into a specific 3D structure that determines the protein’s biological function. The same amino-acid sequence reliably folds to the same structure (modulo physiological conditions); the structure is determined by the sequence.

The problem. Given an amino-acid sequence, predict the 3D coordinates of every atom in the folded protein.

Why this is hard. The protein folding problem (predicting 3D structure from sequence) was the central open problem of structural biology for 50 years. Physical-simulation approaches (molecular dynamics from sequence) are computationally infeasible for all but the smallest proteins — folding requires sampling a vast conformational space at timescales (microseconds to seconds) far beyond simulation budgets. Empirical approaches (predicting from sequence-sequence similarity to known structures) work when close homologs are known but fail otherwise. By 2016, decades of effort had produced average GDT-TS scores around 40–50 — useful for some applications, far from experimental quality.

Why solving it matters. Protein structure determines function. Knowing the structure of an unknown protein enables understanding its biology, designing drugs that bind it, engineering enzymes for industrial use, and identifying disease-causing mutations. Experimental structure determination (X-ray crystallography, cryo-EM) requires months of work per protein and fails for many targets (intrinsically disordered proteins, large complexes, membrane proteins). Computational structure prediction at experimental quality unlocks structure-based biology at the scale of the entire proteome.

AlphaFold 2: the architecture

AlphaFold 2 (Jumper et al., 2021) achieved median GDT-TS ~92 at CASP 14 — experimental-quality accuracy. The system’s architecture has three main components, each with substantial novel engineering.

Amino acid sequence (input)

Input featurization

residue features; MSA features; template features (if any); pair features (relations between residue positions)

Evoformer — 48 blocks

each block jointly updates: MSA representation (

N_{\text{seq}} \times N_{\text{res}}

) and pair representation (

N_{\text{res}} \times N_{\text{res}}

), exchanging information via attention and triangle multiplications

Single sequence repr + pair repr

Structure Module — 8 layers

Invariant Point Attention iteratively places residues in 3D space; outputs 3D coordinates of all atoms

Recycling — up to 4 iterations

pass the predicted structure back as input; refine

3D coordinates + per-residue confidence (pLDDT)

Multiple Sequence Alignment (MSA) input. Evolution provides a powerful signal. Across the millions of natural protein sequences in databases, residues that are physically close in 3D space tend to co-evolve — a mutation at one position is often compensated by a correlated mutation at a contact partner. AlphaFold 2 retrieves homologous sequences (via search tools like JackHMMER) and feeds the resulting MSA as input. The MSA carries evolutionary couplings that strongly constrain the 3D structure.

The Evoformer. The architectural heart. A stack of 48 transformer-like blocks that operate on two representations: the MSA representation ( $N_{\text{seq}} \times N_{\text{res}}$ matrix of per-position-per-sequence features) and the pair representation ( $N_{\text{res}} \times N_{\text{res}}$ matrix of residue-pair features). Each block updates both representations, with the two exchanging information through:

Attention along rows and columns of the MSA.
Triangle multiplication operations on the pair representation — capturing that, in a self-consistent set of distances, the distance from residue $i$ to $j$ constrains the distance from $j$ to $k$ and from $k$ to $i$ . The triangle inequality forces consistency.
Outer-product mean updates from MSA to pair representation.

The Evoformer is not a standard Transformer. The pair representation and the triangle multiplications are domain-specific innovations that encode the geometric structure of 3D distance matrices. This is one of the architectural keys.

The Structure Module. Takes the final pair representation and “single” representation (collapsed from MSA) and produces 3D coordinates. Uses Invariant Point Attention (IPA) — an attention mechanism that operates equivariantly with respect to the rotation and translation of the protein (a rotated input produces a correspondingly rotated output). The Structure Module iteratively refines the placement of all residues over 8 layers.

Recycling. A novel training-and-inference scheme. The model’s output (predicted pair distances + sequence representation) is fed back as additional input; the model refines its prediction over up to 4 iterations. Recycling is part of the network — the model is trained to use its own output as input. Empirically substantial: structures predicted with recycling are noticeably better than without.

Training and the per-residue confidence

Trained on the Protein Data Bank (~180K experimental structures) with several specific techniques:

Self-distillation. AlphaFold 2’s predictions on unlabelled sequences are used as additional training data after a confidence filter — substantially expanding the effective training set.
Auxiliary losses. The model is trained against many losses simultaneously (FAPE loss for structure, distogram loss for pairs, masked-MSA loss for sequence reconstruction). Multi-task pretraining stabilizes learning.
pLDDT confidence. The model produces a per-residue confidence score (pLDDT, predicted local distance difference test) that calibratedly indicates which residues are likely accurate. Low-confidence regions often correspond to intrinsically disordered loops or to predictions the user should distrust.

The pLDDT score is one of the most-used outputs in practice. A working scientist treating AlphaFold predictions knows to trust high-pLDDT regions and skepticize low-pLDDT ones.

AlphaFold 3: from structures to complexes

AlphaFold 3 (Abramson et al., 2024) extends the recipe to biomolecular complexes: protein-protein interactions, protein-DNA, protein-RNA, protein-ligand. The architectural changes:

The Evoformer is replaced by the Pairformer, a simplified architecture that handles the broader range of inputs.
The Structure Module is replaced by a diffusion-based structure predictor — the network predicts denoising directions on noisy 3D coordinates, similarly to image diffusion. (The connection to the Generative Models chapter, §6, is explicit.)
Inputs include not just amino-acid sequences but also DNA/RNA sequences, small-molecule ligands (SMILES strings), and ions.

The result. AlphaFold 3 makes high-quality predictions for protein-ligand complexes (a long-standing drug-discovery challenge), protein-DNA binding, and protein-protein interactions. The same model handles all of these, replacing what was previously a zoo of specialized tools.

The diffusion-structure-prediction move connects AlphaFold 3 architecturally to RFdiffusion (below) and to the broader generative-modelling toolkit. The two threads — structure prediction and structure design — converge methodologically.

ESMFold: structure from language models alone

A parallel development. ESMFold (Lin et al., 2022) demonstrated that protein structure could be predicted from a protein language model (ESM-2, a large Transformer pretrained on amino-acid sequences) without MSAs. The architecture: feed amino-acid sequence to ESM-2; extract internal representations; use a small structure-prediction module on top.

The trade-off vs AlphaFold 2. ESMFold is less accurate (median GDT-TS around 80 vs 92) but much faster: predictions take seconds rather than minutes, no MSA search required. At scale, ESMFold enabled the ESM Metagenomic Atlas — predicted structures for over 600 million proteins, including many that have no characterized homologs and would be unreachable with AlphaFold’s MSA dependency.

ESMFold demonstrates an important methodological point: protein language models can serve as substrates for downstream tasks, the same way general-purpose LLMs can serve as substrates for general reasoning. The FM-as-substrate paradigm of the Foundation Models chapter applies directly to proteins.

Protein design: the inverse problem

Structure prediction takes a sequence and produces a structure. Protein design takes a desired structure (or property) and produces a sequence. This is the inverse problem, and it is where generative modelling enters scientifically.

The classical protein-design problem is hard. The space of possible sequences is enormous ( $20^N$ for an $N$ -residue protein), most sequences do not fold to any defined structure, and the relationship from sequence to function is highly non-linear. Pre-deep-learning approaches (Rosetta-based design) achieved modest success on simple targets but did not scale.

RFdiffusion: generative protein design

RFdiffusion (Watson et al., 2023) made generative protein design practical. The recipe:

Take RoseTTAFold (the Baker lab’s successor to AlphaFold 2) — a structure-prediction network.
Train it to perform denoising on protein backbone coordinates. Just as a diffusion image model is trained to remove noise from images, RFdiffusion is trained to remove noise from protein structures.
The architecture is SE(3)-equivariant: a rotated and translated input produces a correspondingly rotated and translated output. The network respects the physical symmetries of 3D space by construction.

Generation: start from random noise (a noisy “structure” consisting of random 3D positions), apply the trained denoiser iteratively. After many denoising steps, the noise becomes a plausible protein backbone — one that respects the geometric constraints of real proteins.

t = T

: random noise (random 3D points)

Apply denoiser

t = T-1

: slightly less noisy

t = 0

: plausible protein backbone

with conditioning, the user can specify: target binding pocket geometry; symmetry constraints (homooligomers); sequence motif requirements; functional-site placement

The breakthroughs.

Conditional generation. RFdiffusion supports conditional generation — design proteins that meet specific functional or structural constraints. Examples: design a protein that binds a specified target site; design a symmetric homooligomer with a specified rotational axis; design a scaffold around a known functional motif.

Experimental validation. Designed proteins were synthesized and characterized. The hit rate (fraction of designs that fold as predicted and have the intended function) is substantially higher than previous methods — order-of-magnitude better than pre-deep-learning protein design.

Functional design. Subsequent work has applied RFdiffusion to design enzymes for specific reactions, binders for therapeutic targets, and scaffolds for self-assembling structures. The hit rate is still modest but the technique works.

Chroma and other generative-protein systems

Chroma (Ingraham et al., 2023). A programmable generative protein model with conditioning on symmetry, shape, function, and class. Similar diffusion-based architecture but with more flexible conditioning interfaces.

ProteinMPNN (Dauparas et al., 2022). A different problem: given a desired backbone structure, design a sequence that folds to it. This is the fixed-backbone design problem. ProteinMPNN uses a graph neural network with message passing; designs have high experimental success rates.

The pattern: structure prediction (AlphaFold 2) plus structure design (RFdiffusion) plus sequence design (ProteinMPNN) gives a protein design pipeline — specify a desired function, design backbones that could support it, design sequences for those backbones, validate predictions with AlphaFold 2 before synthesis. This pipeline is now standard in computational protein-design labs.

Limitations: dynamics, allostery, intrinsic disorder

The honest accounting of what remains open.

Protein dynamics. AlphaFold-family models predict the folded ground-state structure. Real proteins are dynamic — they breathe, undergo conformational changes, transition between functional states. Predicting dynamics (which states a protein samples, how often, how transitions occur) is much less developed. Several recent systems (Distributional Graphformer, AlphaFlow) extend the recipe toward conformational ensembles, but the field is far behind static-structure prediction.

Allostery. Many proteins regulate function via allostery — binding at one site changes the structure or dynamics at a distant site. Predicting allosteric behaviour is largely unsolved; AlphaFold-style static-structure prediction can show the ends of an allosteric transition but not the path or thermodynamics.

Intrinsically disordered regions. Many proteins contain regions that do not have a defined 3D structure — they fluctuate among an ensemble of conformations. These regions have biological function (signalling, regulation) and are common in eukaryotic proteomes. Static-structure prediction is ill-defined for these regions; AlphaFold flags them with low pLDDT but offers no positive characterization.

Generalization to evolutionary outliers. Proteins very different from anything in the training data (extreme sequence design, designed-from-scratch proteins, very large proteins) sometimes produce poor AlphaFold predictions even when they are well-folded.

Function prediction. Knowing the structure of an unknown protein helps identify function (via structural similarity) but does not directly predict function. Function-prediction systems (DeepFRI, ProteInfer, modern LM-based methods) exist but are substantially less mature than structure prediction.

These limitations are the open frontier of AI for protein biology in 2026. The field continues to advance; AlphaFold-style transformative results in the open problems above have not yet appeared.

Where AI for proteins sits in 2026

The summary. Protein structure prediction is largely solved at the proteome scale. Protein design is mature for backbone generation and binder design; less mature for enzyme design. Protein dynamics, allostery, and intrinsic disorder are open problems. Function prediction is improving but trails structure prediction.

In production: the AlphaFold Protein Structure Database (~200M predicted structures, freely available) is a routine starting point for any structural biology project. RFdiffusion and successors are used in commercial protein-design pipelines. AlphaFold 3 is gated commercially but widely used in industrial labs.

The field is the canonical AI4S success story — and also the most-developed exemplar of what remains hard even after a major breakthrough.

§4. Mathematics

A different domain entirely — and one where AI’s progress is more recent and more limited but no less consequential. Mathematics tests AI’s reasoning capabilities at a level no other domain does, because mathematical proofs require strict logical correctness that cannot be approximated.

The reasoning challenge of mathematics for AI

Mathematics is structurally hostile to standard AI approaches. Three properties make it hard.

Verification is strict. A mathematical proof is either correct or incorrect; there is no “mostly correct” credit. A 49-step proof with one logical error has zero value as a proof. This is unlike image generation (slightly-wrong images can still be useful) or protein structure (slightly-wrong structures can still inform research). Mathematics requires exact correctness.

Solution paths are long. Hard mathematical problems often require long chains of reasoning — dozens or hundreds of inferential steps. Each step depends on the previous; a wrong step early in the chain invalidates everything that follows. This is the credit assignment problem in extreme form.

The reward signal is sparse. Most candidate proof steps are wrong; the few that work are not obvious in advance. Standard ML training signals (gradient descent on continuous objectives) do not apply directly; mathematical reasoning has a fundamentally combinatorial character.

These properties had made mathematics one of the hardest domains for AI through 2023. Standard LLMs could solve grade-school arithmetic but failed on competition-level problems; specialized symbolic systems (computer algebra, classical theorem provers) had narrow capabilities. The 2024 breakthroughs (AlphaProof, AlphaGeometry, FunSearch) demonstrated that the gap was bridgeable with the right combination of techniques.

Theorem proving with formal systems

The technical substrate for AI mathematical reasoning is interactive theorem provers: software systems where mathematical proofs are written in a formal language that the system can verify. The dominant systems:

Lean (Lean 4, 2021–). The most active modern system. Used by AlphaProof. Built around a dependent type theory.
Coq. Long-established. Used in formalized mathematics (the four-colour theorem, the odd-order theorem) and software verification.
Isabelle. Higher-order logic. Used in some major formalizations.

Each system has a kernel that performs strict logical verification. If a proof is accepted by the kernel, it is correct — modulo bugs in the kernel itself (kernels are small enough to be carefully audited).

Mathematical research is increasingly formalized in these systems — major theorems have been verified, large libraries (mathlib for Lean, the Mizar Mathematical Library) of formal definitions and proofs are growing. The Mathlib4 library (Lean 4’s standard library) contains over 100,000 theorems as of 2026.

The AI angle. A theorem prover provides an unambiguous reward signal for an AI proof system: does the proof kernel accept the proof? This is exactly what mathematics’s strict-verification requirement demands. AI for mathematics can be framed as RL with a verifier-based reward (RL §10–§12) — and that is precisely what AlphaProof did.

AlphaProof (2024): RL meets Lean

AlphaProof (DeepMind, 2024) achieved silver-medal performance on the 2024 International Mathematical Olympiad — solving 4 of 6 problems. The breakthrough was the first AI system to match top-human-mathematician performance on hard from-scratch mathematics problems.

The architecture. The system pairs a neural-network proof candidate generator (an LLM-style model) with the Lean proof verifier. Training:

Pretraining. A base LLM is trained on natural-language mathematics and Lean code, producing a model that can generate plausible Lean proof candidates.
Statement translation. Olympiad problems are stated in natural language. AlphaProof first translates them into Lean statements (the formal goal to prove) using an LLM. This step is itself non-trivial; some IMO problems require careful interpretation of “find all” vs “prove” formulations.
RL with Lean as verifier. The model generates many candidate proofs for each statement; Lean verifies each. Successful proofs (Lean accepts) give positive reward; failures give negative reward. The model is trained with GRPO-style RL (cf. RL §10) to produce more successful proofs over time.
Test-time search. At inference, the model generates many candidate proofs in parallel; the Lean-verified successes are returned. Substantial inference-time compute (often days of GPU-time per problem) is allocated to hard problems.

The result. AlphaProof’s 4-of-6 IMO 2024 result is striking. The system solved two algebra problems, one number theory problem, and one geometry problem at silver-medal level. It failed on two combinatorics problems — those that have shorter, more “clever” solutions that the system did not find within its compute budget.

AlphaGeometry: a specialist for Euclidean geometry

AlphaGeometry (Trinh et al., 2024) is a more specialized system for Euclidean geometry — the subset of mathematics with the most direct geometric reasoning structure. The architecture:

A symbolic deduction engine that performs forward chaining: given known facts, derive new facts using geometry rules.
A neural network that proposes auxiliary constructions — the clever auxiliary points and lines that make hard geometry problems tractable. (Constructing the right auxiliary line is often the creative insight; once it’s in place, the deduction is mechanical.)

The combination is powerful. The symbolic engine handles the routine deduction; the neural network handles the creative construction. AlphaGeometry solved 25 of 30 IMO geometry problems from 2000–2022 — near-Olympiad-medalist level on geometry specifically.

The lesson. Specialist systems with strong domain inductive biases can substantially outperform general-purpose systems on the domains they specialize for. AlphaGeometry is a clean example: the system is not “smart” in any general sense, but it is engineered specifically for the structure of Euclidean geometry, and the engineering pays off.

FunSearch: discovery via LLM-guided search

A different mode of AI mathematics. FunSearch (Romera-Paredes et al., 2023) “Mathematical discoveries from program search with large language models” used LLMs in an evolutionary search loop to discover solutions to combinatorial problems.

The setup. Take a problem with a verifiable objective — e.g., the cap-set problem (find the largest subset of $\mathbb{F}_3^n$ with no three points in arithmetic progression). Write a Python skeleton function that evaluates candidate solutions. Use an LLM to mutate candidate solutions (small changes, additions, rewrites). Keep candidates that improve the objective.

Initial candidate function

LLM proposes mutated variants

Each variant evaluated against the verifiable objective

Best variants kept; loop repeats

After many iterations: improved candidate function (potentially a new mathematical discovery)

The result. FunSearch discovered new constructions for the cap-set problem — improvements over a 50-year-old mathematical result. The discoveries were human-publishable: they generalized to families of constructions and were validated by mathematicians.

The mode of AI mathematics here is different from AlphaProof. FunSearch is not proving theorems but discovering combinatorial constructions. The LLM is acting as a creative mutation operator; the evaluation is symbolic; the loop is evolutionary search. The pattern generalizes: combinatorial-discovery problems with verifiable objectives are well-suited to this approach.

Conjecturing and pattern-finding

A third mode. Davies et al. (2021) “Advancing mathematics by guiding human intuition with AI” used deep neural networks to discover patterns in knot theory and representation theory — patterns that human mathematicians then proved into theorems.

The recipe. Compute many examples of a mathematical object; train a neural network to predict some property of the object; use saliency analysis on the trained network to identify which features of the input drive the prediction; conjecture that those features are mathematically significant; prove the conjecture by traditional means.

This is AI-assisted mathematics rather than AI-only mathematics. The AI provides the conjecture; humans provide the proof. The conjectures Davies et al. discovered were non-trivial — humans had been studying these objects for decades without spotting the patterns. AI’s value here is as a pattern-detection instrument in high-dimensional spaces that humans find hard to navigate.

The deep open question: AI-discovered important mathematics

The four systems above represent different modes of AI mathematics. They all share a limitation: the problems they have solved were known and posed by humans. AlphaProof solved IMO problems; AlphaGeometry solved IMO geometry problems; FunSearch improved on known combinatorial constructions; Davies et al. found patterns humans then proved.

What AI has not done in 2026: pose and solve a substantial new mathematical question that humans had not already framed. The “discovery” in FunSearch is a new construction for an old problem; the “discovery” in Davies et al. is a new pattern in old objects. The genuinely new direction — AI proposing a research programme, suggesting which problems are worth attacking, choosing the productive abstractions — is open.

Whether this open frontier is a quantitative gap (AI needs more data, more compute, better RL) or a qualitative gap (mathematics requires something AI fundamentally lacks) is debated. The chapter does not adjudicate. OP-S-3 (AI-discovered important mathematics) is the formal flag for this question.

Where AI for mathematics sits in 2026

The summary. Substantial progress on solving known hard problems (AlphaProof at IMO level, AlphaGeometry on geometry). Some progress on discovering new mathematical objects (FunSearch). Some progress on conjecturing patterns for human proof (Davies et al.). All of these are real advances; none constitute “AI doing mathematics” in the way a working mathematician does mathematics.

The systems are also computationally expensive. AlphaProof on a single IMO problem may use days of GPU-time; the FunSearch loop is similar. This is not the productivity regime where AI replaces mathematicians; it is the regime where AI provides expensive but capable tools for specific tasks.

The mathematics community’s reception of these results has been broadly positive. Major mathematicians (Terence Tao, Tim Gowers) have engaged seriously with AI tools; the Lean community has grown substantially; AI-assisted mathematical research is becoming a recognized methodology. Whether the trajectory continues to AI-as-research-mathematician remains to be seen.

§5. Chemistry and Molecules

Chemistry is the AI4S domain with the longest history of computational methods and one of the broadest current AI activity. We cover representation choices, property prediction with modern chemistry FMs, retrosynthesis (planning chemical reactions), generative molecule design, the Open Catalyst Project as a landmark scaling effort, and ML-accelerated quantum-chemistry simulation.

Molecular representations

The first design choice for any chemistry AI system: how to represent a molecule. Three dominant representations, each with characteristic strengths.

SMILES strings. A linear text encoding (Simplified Molecular Input Line Entry System). The molecule benzene becomes c1ccccc1; aspirin becomes CC(=O)OC1=CC=CC=C1C(=O)O. SMILES is compact, machine-readable, and Transformer-compatible — a molecule can be processed by a standard language-modelling architecture treating each atom or bond character as a token.

   SMILES EXAMPLE: aspirin
       structure (drawn):                          SMILES:
                                                CC(=O)OC1=CC=CC=C1C(=O)O
              O                                  ^   ^  ^^^^^^^^  ^
              ║                                  │   │  │       │  │
       CH3 ── C ── O ──┐                         │   │  ring     │  carboxylic
                       │                         CH3 acetyl       carbon-
                  ┌────┴────┐                                     containing
                  │ benzene │
                  │   ring  │
                  └────┬────┘
                       │
                  HO ──C═O
                   carboxylic
                   acid group

The disadvantage: SMILES is not unique (the same molecule can have multiple valid SMILES), and the linear ordering imposes an artificial sequence that does not reflect molecular structure. Canonicalization algorithms produce a unique SMILES, partially solving this.

Molecular graphs. Atoms as nodes, bonds as edges. The natural representation for graph neural networks. Captures the molecular structure directly with no artificial ordering. Most chemistry-specific deep learning (graph neural networks, message-passing networks) uses this representation.

A typical graph-based architecture (MPNN, Gilmer et al., 2017): each atom has a feature vector (element type, charge, hybridization); each bond has a feature vector (bond order, aromaticity); a graph neural network performs message passing — each atom updates its features based on its neighbours’ features over several rounds. The final atom embeddings are pooled to produce a molecular embedding.

3D coordinates. For tasks where geometry matters (binding affinity, conformational stability, spectroscopic properties), the 3D structure of the molecule is essential. 3D representations require equivariant architectures (a rotated molecule is the same molecule). SE(3)-equivariant networks (NequIP, MACE, e3nn-based architectures) are standard for 3D molecular tasks.

The trade-offs. SMILES is convenient but loses geometric information; graphs capture topology but not geometry; 3D representations capture geometry but require equivariance machinery. Modern systems often use combinations — a Transformer on SMILES tokens with auxiliary graph or 3D features.

Property prediction with chemistry foundation models

A typical chemistry task: given a molecule, predict a property — solubility, toxicity, binding affinity, catalytic activity. The traditional approach used hand-engineered molecular descriptors plus Random Forest or kernel methods.

Modern chemistry FMs replace this with pretrained representations. The recipe:

Pretrain. Train a large model (Transformer on SMILES; GNN on graphs; equivariant network on 3D coordinates) on millions of unlabelled molecules using self-supervised objectives — masked-atom prediction, contrastive learning between molecule representations, denoising.
Fine-tune. Adapt the pretrained model to a specific property-prediction task with a small labelled dataset (often only a few hundred to thousands of measured molecules).

Notable chemistry FMs:

ChemBERTa (Chithrananda et al., 2020). BERT-style masked-language-model pretraining on SMILES. Demonstrated FM-style transfer for molecular property prediction.
MolFormer (Ross et al., 2022). Larger SMILES-based Transformer with rotational position encoding. Strong on standard property-prediction benchmarks.
MolBERT, ChemGPT, MoLFormer, SELFormer. Variations on the SMILES-FM theme.
GNN-based foundation models (Uni-Mol, GROVER, GraphMVP). Graph-based pretraining; more competitive on geometry-dependent tasks.
3D equivariant FMs (3D-MPP, GeoMolformer). Pretrained on 3D conformations for geometric property prediction.

The empirical state: chemistry FMs typically improve over hand-engineered-feature baselines by modest amounts (5–20% on standard benchmarks), with larger gains on small target datasets (where pretraining substantially reduces sample complexity). The pattern is familiar from general FM-as-substrate (Foundation Models chapter) — pretraining substitutes for labelled data.

Retrosynthesis: planning chemical syntheses

A different chemistry task: given a target molecule, plan a sequence of chemical reactions that synthesize it from commercially available starting materials. This is retrosynthesis — working backward from the target through intermediate molecules until the inputs are commercial reagents.

The space is combinatorially large. A typical synthesis is 5–20 reactions; at each step, many possible reactions could produce the target intermediate from different precursors. Choosing well requires deep chemistry knowledge.

The AI approach. Modern retrosynthesis systems combine:

A retrosynthetic-step predictor: a model that, given a target molecule, predicts what reactions could produce it. Trained on millions of known reactions (USPTO database, Reaxys).
A search algorithm: typically Monte Carlo Tree Search (MCTS) or beam search through the space of possible synthesis routes.
A commercial-reagent database: terminates the search when all leaf molecules are purchasable.

Notable systems:

Synthia (Chematica) and AiZynthFinder (open-source). Production-grade retrosynthesis tools used by pharmaceutical chemists.
Chemformer (Irwin et al., 2022). Transformer-based reaction-prediction model.
RetroSynRL (Schreck et al., 2019) and later RL-based variants. RL to learn route-selection policies.

The state of practice. Retrosynthesis AI is useful but not autonomous. Pharmaceutical chemists use retrosynthesis tools as suggestions to evaluate, often combined with manual route design. The systems work best for routine syntheses; for novel chemistry (new bond-formation reactions, unusual functional groups), they fall back on suggestions chemists must validate.

Generative molecule design

The other inverse problem: given desired properties (binds a specified target, low toxicity, good solubility), design a novel molecule satisfying them.

Two technique families.

Sequence-based generation. Train an autoregressive model on SMILES; sample new molecules from the distribution. For conditional generation, condition on the target properties (predict the SMILES given a property vector). This is the simplest approach and works well for near-distribution design but struggles for novel chemistry.

Graph-based generation. Build molecules graph-incrementally (add atoms one at a time, add bonds, terminate). Junction Tree VAE (Jin et al., 2018), GraphVAE, and modern variants. Stronger structural awareness; harder to train.

Equivariant diffusion on 3D coordinates. Treat molecule generation as a diffusion problem over atomic positions. EDM (Hoogeboom et al., 2022), GeoLDM, and successors. SE(3)-equivariant denoising networks. The dominant approach for 3D-aware molecular generation in 2026.

Latent-space generation. Encode molecules into a continuous latent (often with a VAE), optimize in latent space, decode. ChemVAE and successors. Useful for property optimization but produces lower-quality molecules than diffusion methods.

The practical use case: drug discovery and catalyst design. Modern pharmaceutical pipelines use AI molecular generation as a hypothesis-generation tool — propose candidate molecules with desired properties; chemists evaluate, refine, synthesize promising candidates. The AI generates many proposals; the chemist filters; the synthesis is human-driven.

The hit rate (fraction of AI-proposed molecules that, when synthesized, have the predicted property) has improved substantially from 2020 to 2026 but is still well below 50% in production drug-discovery settings. The systems are useful but require chemist filtering.

Open Catalyst Project: scaling chemistry data

A landmark scaling effort worth detailed treatment. Open Catalyst Project (OC) (Chanussot et al., 2021; Tran et al., 2023) at Meta AI is the chemistry analogue of ImageNet: a massive dataset (over 200 million DFT calculations as of 2024) on catalyst-surface interactions, released with strong baselines and ongoing community competitions.

The goal. Many important industrial processes (electrochemical CO2 reduction, hydrogen evolution, ammonia synthesis) depend on catalysts — surfaces that lower the activation energy of specific reactions. Discovering better catalysts could enable clean-energy applications. The traditional method is DFT simulation of each candidate, which is computationally expensive ( $10^3$ to $10^5$ CPU-hours per system). At hundreds of thousands of candidates of interest, the total simulation budget is enormous.

The OC approach. Use ML to approximate DFT — train neural-network potentials that predict energies and forces given atomic configurations. Once trained, the neural potential is 6 orders of magnitude faster than DFT, enabling screening at scale that DFT alone could not support.

The data. OC released sub-datasets at scale: OC20 (2020, 1.2M DFT calculations on 200K candidate catalyst configurations); OC22 (2022, oxides); OC25 (2025, more diverse chemistry). Each dataset is paired with a benchmark and a leaderboard.

The architectural lessons. Best-performing methods on OC are equivariant (NequIP, MACE, EquiformerV2 architectures), use graph attention over the molecular structure, and benefit from large model sizes. The OC benchmarks have driven substantial methodological progress in chemistry deep learning.

The downstream impact. Several catalyst-discovery efforts have used OC-style ML potentials as preprocessing for downstream applications. The scaling-and-benchmark approach pioneered by OC has been replicated in other chemistry subdomains.

ML-accelerated DFT and quantum-chemistry simulation

A broader programme. Density Functional Theory (DFT) is the workhorse of computational chemistry, but it has well-known limitations: limited accuracy for certain functional choices, computational cost that scales as $O(N^3)$ in system size. ML can help in two modes.

Replace DFT with neural potentials. Train a neural network to predict energies and forces given atomic configurations (the OC approach above). Trained models predict in $O(N)$ time and can be substantially more accurate than the underlying DFT for well-trained systems.

Improve DFT itself. Use ML to correct DFT predictions toward higher-accuracy references (CCSD(T), experiment). Delta-learning approaches: train a neural network to predict the difference between DFT and a higher-accuracy method. This makes high-accuracy quantum chemistry computationally affordable for larger systems.

Notable systems:

NequIP (Batzner et al., 2022). Equivariant message-passing network for ML potentials; sample-efficient.
MACE (Batatia et al., 2022). Higher-order equivariant network; better accuracy with similar sample budget.
Allegro (Musaelian et al., 2023). Local equivariant network that scales better than message-passing alternatives.
JAX-MD and similar frameworks. Differentiable molecular dynamics with ML potentials.
MatterSim (Microsoft, 2024). Universal interatomic potential covering broad chemistry.

The combined effect. ML-accelerated quantum chemistry is now a substantial subfield with major applications in materials (§6), drug discovery, and basic chemistry. The recipes are mature; the open frontier is coverage (how broad a chemistry can a single model handle?) and transferability (do models trained on one chemistry class work on another?).

Where chemistry AI sits in 2026

The summary. Property prediction is mature; FMs improve over baselines but not dramatically. Retrosynthesis is useful but not autonomous. Generative molecule design is widely used in industry as a hypothesis-generation tool, but not as a stand-alone discovery system. Open Catalyst and ML potentials have substantively changed how computational chemistry is done — neural-potential-accelerated quantum chemistry is now standard for many tasks.

The unsolved problems. De novo drug discovery (designing a drug from scratch with AI alone) remains aspirational. Novel-chemistry retrosynthesis (truly new bond-formation reactions) remains hard. Transferability across chemistry families is limited. The field is making progress but is still substantially behind the protein subfield in terms of paradigm-shifting impact.

§6. Materials Discovery

Materials science — the discovery and characterization of solid materials with desired properties — is one of the most-impacted AI4S subfields. We cover the representation choices for crystals, equivariant neural networks for materials, universal interatomic potentials, and the two landmark systems (GNoME for stable-crystal discovery, MatterGen for generative crystal design).

Crystals, lattices, materials representation

Materials, broadly. Crystalline materials consist of atoms arranged in a repeating lattice — a unit cell that tiles 3D space. Specifying a crystal requires:

The chemical composition (which elements, in what stoichiometric ratio).
The unit cell (lattice vectors, defining the 3D periodicity).
The atomic positions within the unit cell.
The space group (the symmetry group of the crystal — there are 230 distinct 3D crystallographic space groups).

For amorphous materials (glasses, polymers) the periodicity is absent, but the same atom-position framework applies at the local-structure level.

The properties of interest include thermodynamic stability (does this material exist as a solid at relevant temperatures?), mechanical properties (hardness, elasticity), electronic properties (band gap, conductivity), magnetic properties, and chemical reactivity. Predicting these from structure-and-composition alone is the goal of computational materials science.

The data. Materials databases (Materials Project, started 2011) provide millions of DFT-computed materials properties. The database of known stable materials (those that have been synthesized) was around 50,000 before 2023; this number has been transformed by AI4S.

Equivariant neural networks for materials

Materials data is inherently 3D and periodic. The natural neural-network substrate is equivariant GNNs that respect:

Rotational equivariance: rotating the unit cell rotates the prediction correspondingly.
Translational equivariance: the lattice can be translated without changing intrinsic properties.
Periodicity: the unit cell repeats; predictions must respect this.
Permutation equivariance: identical atoms are interchangeable.

Multiple equivariant architectures have matured for materials:

SchNet (Schütt et al., 2017). Early continuous-filter convolutional network for atomistic systems. Rotation-invariant; works on atomic-position data.

NequIP (Batzner et al., 2022). SE(3)-equivariant message-passing with tensor-product operations; substantially more sample-efficient than non-equivariant alternatives. The architectural advance was using higher-order equivariant features (vectors, tensors) rather than just scalars.

MACE (Batatia et al., 2022). Higher body-order messages; better accuracy than NequIP at similar compute. Has become a standard choice for materials simulation.

EquiformerV2 (Liao et al., 2023). Transformer-style architecture with equivariant attention; strong on the Open Catalyst benchmarks.

M3GNet (Chen and Ong, 2022). Materials-3-body graph network with directed graph representation. Production-grade universal potential.

Universal interatomic potentials

A landmark achievement of 2022–2024. Universal interatomic potentials (UIPs) are neural-network potentials trained on data spanning the entire periodic table — covering essentially all chemical elements and most structural patterns. Once trained, they can perform molecular dynamics and property prediction for any element combination, replacing per-system DFT for many tasks.

Notable UIPs:

M3GNet (Chen and Ong, 2022). One of the first universal potentials; trained on Materials Project data.
MACE-MP-0 (Batatia et al., 2024). Trained on millions of Materials Project crystals; covers most chemistry with reasonable accuracy.
MatterSim (Microsoft, 2024). Production-grade universal interatomic potential.
GNoME’s underlying potential (Merchant et al., 2023). Used internally as the screening front-end before DFT validation.

The impact. UIPs enable molecular dynamics simulation for arbitrary materials at near-DFT accuracy at $10^5$ to $10^6$ times the speed of DFT. Materials properties that previously required dedicated DFT campaigns (phonon spectra, thermal conductivity, defect formation energies) can now be computed routinely.

The open issues. UIPs are accurate for bulk properties of typical materials. They are less reliable for surfaces, interfaces, defects, and exotic chemistry. Continued improvement is the work of the field; the trajectory has been steady from 2022 to 2026.

GNoME: 380K stable crystals

The landmark crystal-discovery system. GNoME (Graph Networks for Materials Exploration; Merchant, Batzner, Schoenholz et al., 2023) “Scaling deep learning for materials discovery” used an active-learning loop with a GNN to predict stability and DFT to validate.

The recipe:

flowchart TD
  Gen["$$\text{Generate candidate crystals}$$\nchemical substitutions; random structure search; structure prototypes"]
  Predict["$$\text{GNN predicts stability}$$\nenergy above hull"]
  Prioritize["$$\text{Prioritize}$$\nlowest predicted energy-above-hull candidates"]
  DFT["$$\text{DFT validation (expensive)}$$\ncompute actual energy; confirm stability"]
  Retrain["$$\text{Add DFT-validated structures to training data; retrain GNN}$$"]

  Gen --> Predict --> Prioritize --> DFT --> Retrain
  Retrain -. loop back .-> Gen

  classDef pill fill:#fff,stroke:#1a1a1a,stroke-width:1px,color:#1a1a1a
  class Gen,Predict,Prioritize,DFT,Retrain pill

The candidate-generation step uses several strategies: chemical substitution (take a known stable structure; substitute elements; predict the new structure’s stability); random structure search (sample random structures and predict); structure prototypes (combine known structure types with new compositions).

The result. After many iterations of the active-learning loop, GNoME predicted 380,000 stable crystal structures that had not previously been characterized. Of these, ~2.2 million were close to stability (within 100 meV/atom of the convex hull). The structures spanned diverse chemistries — alkaline earth fluorides, transition-metal oxides, rare-earth borides, etc.

The validation. A subset of GNoME predictions were synthesized in collaborator labs; the hit rate (predicted-stable structures that were actually synthesizable) was substantially higher than random selection. The 380K number is also validated computationally — by full DFT calculation — which is itself a strong but not perfect criterion.

The interpretation. GNoME demonstrated that scaling active-learning loops in materials science could produce orders-of-magnitude expansions in known stable materials. The 380K predicted structures are a starting point for downstream applications: candidate batteries, candidate catalysts, candidate superconductors, candidate magnetic materials. Whether these candidates lead to industrially useful materials is the next question.

MatterGen: generative crystal design

The inverse problem in materials. Given desired properties (band gap, magnetism, density), design a crystal with those properties. MatterGen (Microsoft, 2024) applies equivariant diffusion to this problem.

The setup. Train a diffusion model on the joint distribution of (composition, lattice, atomic positions) over the known-stable-materials database. The model can be conditioned on desired properties — generate materials with band gap in a target range, or with specified magnetic structure.

The architecture uses equivariant denoising — the model respects the symmetries of crystallographic space groups. The diffusion is over the joint structural and compositional space, requiring careful handling of the discrete-composition and continuous-position degrees of freedom.

The result. MatterGen produces novel crystal candidates with target properties. As of 2024–2026 the system is in active use as a hypothesis-generation tool in materials labs. The hit rate (predicted-stable, target-property-matching, experimentally-synthesizable materials) is non-trivial but well below 100% — humans-in-the-loop filtering is still required.

The relationship between GNoME and MatterGen. They are complementary. GNoME generates many candidates and filters for stability; MatterGen generates conditioned-on-property candidates. Both feed the experimental-validation pipeline. The combination is an instance of the generate-and-validate AI4S workflow that is becoming standard.

The active-learning / experimental-validation loop

The cross-cutting methodology of materials AI4S. The loop:

Predict. Use a fast ML model to predict candidate-material properties.
Prioritize. Choose candidates worth experimental validation (based on predicted properties + uncertainty).
Validate. Run the experiment (synthesis, characterization) or higher-accuracy simulation.
Update. Add validated data to training set; retrain the ML model.

This loop is not unique to materials — it appears in chemistry (catalyst discovery), drug discovery, and increasingly in scientific reasoning agents (§10). Materials is the cleanest exemplar because the validation step (DFT) is computationally tractable and the loop can iterate at scale.

The open question: how to choose candidates well. Active-learning theory (which inputs are most informative to validate) has decades of development, but applying it well to scientific discovery loops is non-trivial. Modern approaches use uncertainty quantification (Bayesian neural networks, ensembles, conformal prediction) to estimate which predictions are uncertain enough to be worth validating.

Where materials AI sits in 2026

The summary. Universal interatomic potentials are mature and widely used; they have substantively replaced per-system DFT for many tasks. GNoME-style active-learning discovery has produced order-of-magnitude expansions in known stable materials. Generative materials design (MatterGen) is in production use as a hypothesis tool. The experimental-validation loop is the methodological substrate.

The unsolved problems. Synthesizability — many predicted-stable materials cannot actually be synthesized with current methods; predicting synthesizability is harder than predicting thermodynamic stability. Property generalization — UIPs trained on bulk crystal properties don’t always work on surfaces, defects, or exotic phases. Time-to-impact — even the best AI-discovered materials require years of additional engineering before commercial deployment.

Materials AI4S is one of the success stories alongside proteins. The field has moved from “potentially useful” to “production tool” in the 2022–2026 period, with active-learning loops the dominant methodology.

§7. Biology Beyond Proteins

Protein structure and design (§3) is the most-developed area of AI4S, but biology is much broader. This section covers the rest: genomics (modelling DNA and its regulatory consequences), variant-effect prediction (which mutations cause disease), single-cell biology (modelling individual cells from their transcriptomes), DNA-scale generative models, and the emerging cross-domain biology foundation models.

Genomics: foundation models for DNA

DNA is, in one sense, the simplest biological data: a sequence over a 4-letter alphabet (A, C, G, T), with the human genome at $\sim 3 \times 10^9$ letters. But the functional consequences of DNA are vastly complex — regulatory elements interact across megabase distances, sequence variants affect phenotype through many pathways, and most of the genome’s function is not yet understood.

Modern genomics AI uses the Foundation Model paradigm. Train large sequence models on DNA; use the pretrained representations for downstream tasks.

Enformer (Avsec et al., 2021). A Transformer-based model for predicting gene expression from DNA sequence. Trained on the ENCODE database, Enformer predicts thousands of regulatory tracks (transcription-factor binding, chromatin accessibility, gene expression) across cell types from a 200kb genomic window. The architectural advance: Transformer attention captured long-range regulatory interactions that previous CNN-based methods (Basset, ExPecto) could not.

AlphaGenome (DeepMind, 2024). A successor and major scaling. AlphaGenome predicts many genomic-output tracks from sequence at substantially higher accuracy than Enformer, leveraging a Transformer architecture scaled to genome scale. The model is trained on a much larger collection of functional-genomics data and produces predictions used in variant-effect interpretation and regulatory-element discovery.

Nucleotide Transformer (Dalla-Torre et al., 2024). A Hugging Face collaboration; multiple models pretrained on diverse genomes (human, 1000-genome variants, multiple species). Provides general-purpose genomic embeddings for downstream tasks.

The pattern. DNA foundation models follow the LLM template: large Transformer; masked-language-model or causal-LM pretraining on genome sequences; fine-tune for downstream prediction tasks. The downstream tasks include variant effect, regulatory-element prediction, sequence design, and cross-species comparison.

Variant-effect prediction: AlphaMissense

A specific, consequential task. The human genome contains millions of missense variants — single-nucleotide changes that alter an encoded amino acid. Some cause disease (Huntington’s, sickle-cell anemia, many forms of cancer); most are benign. Predicting which is which has been a long-standing problem.

AlphaMissense (Cheng et al., 2023) is the AI4S landmark for this task. The system combines AlphaFold-style structural prediction with a missense-effect prediction head. The pipeline:

For a given protein and a candidate missense variant, predict structures for both the wild-type and variant.
Use features from the structural prediction (and the underlying sequence model) to predict pathogenicity.

Trained on millions of variants with known phenotype annotations, AlphaMissense provides pathogenicity predictions for ~71 million possible human missense variants. The predictions agree with experimental and clinical annotations at substantially higher accuracy than prior methods.

The clinical impact. Variant-effect databases (ClinVar, gnomAD) now incorporate AlphaMissense predictions as supplementary annotations. Clinical-genetics labs use the predictions to triage variants of uncertain significance (VUSs) — a major bottleneck in clinical diagnosis.

Single-cell biology: scGPT and friends

A different scale. Modern single-cell sequencing produces gene-expression profiles for millions of individual cells. Each cell is a vector of expression levels across $\sim 20{,}000$ genes; large datasets contain tens of millions of such vectors across tissues, conditions, and species.

The task. Given a single-cell expression profile, predict its cell type, cell state, response to perturbation, or developmental trajectory. Cross-experiment integration (combining data from different labs, technologies, conditions) is a major challenge — the batch effects in single-cell data are substantial.

The FM approach. Pretrain a Transformer-style model on millions of single-cell expression profiles using self-supervised objectives (masked-gene prediction, contrastive learning). Use the pretrained model as a substrate for downstream cell-type classification, batch correction, and perturbation prediction.

Notable systems:

scGPT (Cui et al., 2023). Generative Pretrained Transformer for single-cell biology. Trained on tens of millions of cells; provides general-purpose single-cell embeddings.
Universal Cell Embedding (UCE) (Rosen et al., 2023). Cross-species single-cell foundation model; pretrains on cells from many organisms simultaneously.
Geneformer (Theodoris et al., 2023). Transformer pretrained on gene-expression rankings; strong on perturbation-response prediction.
scFoundation (Hao et al., 2024). Large-scale (~100M parameter) single-cell FM.

The empirical state. Single-cell FMs improve over task-specific baselines on standard benchmarks but the gains are modest, and the field is debating whether the FM paradigm is the right fit for single-cell biology — the data has fundamentally different structure from text or images, and the FM gains seen in those domains may not transfer cleanly. This is OP-S-5 (whether biology FMs match LLM-style scaling).

Cell-state and perturbation modelling

A specific application worth flagging. Drug discovery often asks: given a perturbation (drug treatment, gene knockout, environmental change), how do cells respond? Predicting cellular responses to novel perturbations from training data on observed perturbations is a key drug-discovery task.

The Tahoe-100M dataset (2024) contains gene-expression measurements after $\sim 100$ million distinct perturbation conditions — a massive scale-up of perturbation data. Models trained on this data (Tahoe family) make zero-shot predictions for new perturbations.

GEARS (Roohani et al., 2023). Predicts gene-expression responses to combinatorial perturbations using a graph neural network over the gene-regulatory network.

The challenge. The combinatorial space of (cell type) × (perturbation) × (dose) × (time) is vast; even with massive datasets, most points are unmeasured. Generalization across this space is the hard problem.

DNA generative models: Evo and the genome-scale frontier

A different mode. Just as protein language models can generate novel proteins (§3), DNA-scale models can generate novel genomes.

Evo (Nguyen et al., Arc Institute, 2024). A 7-billion-parameter Transformer trained on prokaryotic genomes (bacteria, archaea). Demonstrates generative DNA modelling at scale — Evo can generate biologically-plausible sequences of bacterial-genome-scale length and can be conditioned to generate sequences with specific properties (CRISPR systems, specific gene functions, etc.).

The systems-biology applications. Evo and successors are being used to design synthetic genomes, predict the effects of large-scale genomic changes, and understand the global structure of microbial sequence space. The applications are early-stage but the capability is real.

The 2026 question. Whether DNA-generative models will produce protein-like transformations of microbiology and synthetic biology is open. The substrate is in place; the downstream applications are developing.

Multi-omics integration

Biology in 2026 is a multi-omics discipline. A single experiment can produce:

Genomics (DNA sequence and variants).
Transcriptomics (RNA expression).
Proteomics (protein abundance and modifications).
Metabolomics (small-molecule levels).
Epigenomics (chromatin state and DNA methylation).
Spatial information (where in tissue each measurement comes from).

Each modality provides a partial view of biological state; integration across modalities is essential for full understanding.

The AI4S approach. Multi-modal foundation models that consume multiple omics types simultaneously. OmiCLIP and similar systems use contrastive learning across modalities. scMultiOmics integrates RNA and chromatin data at single-cell resolution. The architectural pattern is borrowed from multimodal vision-language models (cross-attention across modality-specific encoders, contrastive objectives, joint embedding spaces).

This is one of the most active areas of biology AI4S in 2026 and is likely to expand substantially in the coming years.

Where biology AI sits in 2026

The summary. Protein structure (§3) is the success story. Genomics and variant-effect prediction (AlphaMissense, AlphaGenome) are now production tools used in clinical workflows. Single-cell biology FMs are mature but have shown only modest gains. DNA-scale generative models (Evo) are early but with substantial potential. Multi-omics integration is the active frontier.

The unsolved problems. Generalization across species is limited (most data is human or model organism; transfer to non-model organisms is poor). Causal interpretation of model predictions is weak (correlational signals dominate; experimental validation is essential). Translational impact (from model prediction to clinical or biotechnological deployment) is slow and uneven.

Biology AI4S is a substantial subfield where the protein-structure transformation is the exception rather than the rule. Most subdomains are seeing incremental improvements rather than transformative breakthroughs. Whether the AI4S trajectory in biology will produce more AlphaFold-scale results — perhaps in dynamics, cellular response prediction, or multi-omics integration — is an open empirical question.

§8. Climate, Weather, and Earth Sciences

A domain where AI4S has produced direct competitive impact: deep-learning weather forecasting models now outperform 70 years of physics-based numerical weather prediction on many key metrics. This is one of the clearest cases where ML has displaced a well-established traditional methodology.

Numerical weather prediction: 70 years of physics simulation

The context. Numerical Weather Prediction (NWP) has been the workhorse of forecasting since the 1950s. The recipe: discretize the atmosphere into a 3D grid; integrate the physical equations of fluid dynamics, thermodynamics, and radiative transfer; produce a forecast.

The system. Modern NWP systems run at global resolution of 5-10 km grid spacing, with $10^9$ to $10^{10}$ grid points covering atmosphere and oceans. The integration uses ensembles of many simulations to capture uncertainty. The leading systems:

ECMWF IFS (Integrated Forecast System, European Centre for Medium-Range Weather Forecasts). The world-leading deterministic forecast. Generates global forecasts every 6 hours at 9 km resolution.
GFS (Global Forecast System, NOAA). The US counterpart.
MetOffice UM, JMA GSM, etc. Other national systems.

The computational cost is enormous. ECMWF’s supercomputer is among the largest scientific computing systems in the world; a single 10-day forecast takes hours on this hardware. The systems are physically grounded — every term in the discretized equations corresponds to a physical process.

The data. Decades of weather observations and forecast outputs are systematically archived. The ERA5 reanalysis (ECMWF) provides a consistent 80-year retrospective dataset of atmospheric state at hourly intervals — over a petabyte of data. This dataset is the substrate that made deep-learning weather forecasting possible.

The 2022–2023 disruption: GraphCast, Pangu-Weather, FourCastNet

For decades, deep-learning weather forecasting was attempted with limited success. Two factors changed in 2022:

ERA5 made high-quality global atmospheric data available at scale. Previous attempts were data-limited; ERA5 provided enough.
Transformer and GNN architectures with the right inductive biases (handling spherical geometry, multi-scale dynamics, conservation laws) became available.

The breakthrough: three independent systems demonstrated that deep-learning models could outperform IFS on key metrics.

FourCastNet (Pathak et al., 2022). NVIDIA’s Adaptive Fourier Neural Operator approach. The first deep-learning model to match IFS on some metrics at substantially lower compute cost.

Pangu-Weather (Bi et al., 2023). Huawei’s hierarchical Transformer on Earth-system data. Outperformed IFS on standard headline-skill scores (RMSE on geopotential height, temperature, humidity) at 1–5 day lead times. Computationally orders of magnitude cheaper.

GraphCast (Lam et al., DeepMind, 2023). A graph neural network on an icosahedral mesh of Earth. Trained on 40 years of ERA5. Outperformed IFS on 90% of standard verification metrics at 10-day lead times. The result that made the meteorological community take deep-learning forecasting seriously.

The GraphCast architecture, briefly:

Atmospheric state at time

t

(variables on Earth’s surface and at 37 pressure levels)

Encoder GNN

maps to multi-resolution icosahedral graph

Processor: 16 GNN layers

message passing on the icosahedral graph

Decoder GNN

maps back to grid

Atmospheric state at time

t + 6

hours

to forecast longer: iterate, applying the model again to its own output — up to 10 days (40 iterations of 6-hour steps)

The key architectural choices: the icosahedral mesh respects the spherical geometry of Earth (avoids the pole problems of standard latitude-longitude grids); the graph neural network captures local atmospheric interactions; the multi-resolution structure handles both local and global scales; autoregressive rollout (iterating the model) handles arbitrary forecast horizons.

The training. GraphCast was trained on 40 years of ERA5 reanalysis as supervised regression: given the atmospheric state at time $t$ , predict the state at $t + 6$ hours. Total training time was ~3 weeks on 32 TPU v4 chips — substantial but a fraction of an ECMWF operational forecast cycle.

The deployment. ECMWF has integrated GraphCast and similar deep-learning models into its operational ensemble. Production forecasts in 2026 are hybrids — combinations of physical NWP and ML-based predictions, with the ML systems providing speed and certain skill advantages, the physical systems providing physical-consistency guarantees.

Why deep learning works here

A worth-flagging observation. Weather forecasting is a regime where deep learning has the unusual property of beating a mature physics-based method. Several factors enable this:

Massive high-quality data. 40+ years of consistent global atmospheric measurements. The data is the substrate.
Stationary dynamics. Weather statistics (the climate distribution) are roughly stationary over the data period. Models trained on past data generalize to near-term future.
Spatial-temporal locality. Atmospheric dynamics are local in space and time. GNN architectures match this structure.
Smooth, predictable underlying process. Despite the chaos at long timescales, short-range weather is dominated by smooth, low-dimensional dynamics that ML can capture.

These factors do not hold for many AI4S domains — most do not have ERA5-scale data, and many do not have stationary dynamics. Weather is a particularly favorable setting for AI4S.

GenCast: probabilistic ensemble forecasting

A more recent advance. GenCast (Price et al., DeepMind, 2024) extends the GraphCast recipe to probabilistic forecasting. The system uses diffusion (Generative Models §6) to sample ensemble forecasts — many plausible weather futures rather than a single deterministic prediction.

The architectural pattern: take a GraphCast-style network; train it as a denoising network on noisy atmospheric states; at inference, sample diverse plausible forecasts by running the diffusion sampler. Each forecast is a complete coherent atmospheric evolution; the ensemble is a probabilistic picture of forecast uncertainty.

GenCast outperformed ECMWF’s ensemble forecast system on the standard probabilistic-forecasting verification metrics (CRPS, reliability) by substantial margins. The result extended the deep-learning-beats-physical-NWP story from deterministic to probabilistic forecasting.

The 2026 production state. Major operational forecasting centres are integrating GenCast-style ensemble approaches into their forecast pipelines, alongside or replacing parts of the physics-based ensemble systems.

Climate-scale modelling

A distinct problem. Climate modelling asks about decadal-to-centennial atmospheric and oceanic evolution — projections of warming, sea-level rise, regional climate change. The dynamics are similar to weather but the time horizon is longer (decades vs days), the boundary conditions matter more (greenhouse-gas concentrations evolve), and the validation data is sparser (we have one historical climate record; we cannot run controlled experiments on Earth).

The traditional approach: Earth System Models (ESMs) are extended NWP models that integrate atmosphere, ocean, ice, land surface, and biosphere over centuries. The major systems (CESM, GFDL CM4, UKESM, EC-Earth) are at the heart of IPCC climate assessments.

The AI angle. AI4S in climate is more nascent than AI4S in weather. Three modes:

ML emulators for ESMs. Train neural networks to emulate slow ESM components (cloud microphysics, ocean mixing, ice dynamics). The trained emulators run faster than the underlying simulations; ML-accelerated ESMs can run more scenarios at lower cost. Multiple groups (NCAR, DeepMind, NVIDIA Earth-2) are pursuing this.

Hybrid climate models. Combine physical simulation of resolved processes (atmospheric flow at $\sim 100$ km resolution) with ML representations of sub-grid processes (clouds, turbulence). This is an emerging approach with substantial promise but no mature production system as of 2026.

Direct AI climate projection. Train large models on climate-model outputs to predict future climates from current state. Less mature; raises concerns about physical-consistency and extrapolation. The community is cautious here.

The honest accounting. AI4S for climate is much less developed than AI4S for weather. The decadal time horizon, the non-stationary dynamics (climate is changing, so past data does not perfectly predict future), and the high-stakes nature of climate projections combine to make the field appropriately cautious about AI-only approaches. The expected near-term role of AI is as accelerator and emulator for physical models, not as replacement.

Limitations: extreme events, long-horizon, physical consistency

Three issues that affect deep-learning weather and climate models.

Extreme events. Standard forecasting metrics (RMSE, anomaly correlation) measure performance on typical weather. Extreme events (hurricanes, heat waves, floods) are tails of the distribution and may be poorly captured by ML models trained on aggregate skill. Recent benchmark work (extreme-event-specific verification metrics) is making this explicit; the picture is mixed (some extreme-event categories handled well by GraphCast/GenCast; others less so).

Long-horizon forecasts. Deep-learning models excel at 1-10 day forecasts. At longer horizons (weeks to months), forecast skill degrades for both physical and ML models. ML models can have additional failure modes here — the autoregressive rollout can drift, producing physically implausible states at long horizons. Constraining ML models to remain physically plausible (energy conservation, mass conservation) is an active research direction.

Physical consistency. A traditional NWP model is physically consistent by construction — every variable evolves according to physical laws. ML models are not so constrained; the predicted state might violate conservation laws or thermodynamic constraints. For most forecasting purposes this does not matter; for some downstream applications (driving an ocean model with atmospheric forcing, computing impacts) it does. Hybrid physics-ML approaches and physics-constrained training (PINNs, Lagrangian regularization) address this but are not fully solved.

Where climate and weather AI sits in 2026

The summary. Weather forecasting has been substantively transformed by deep learning. Production forecasts now include AI components; ensemble forecasting is being revolutionized by GenCast and successors. The deep-learning approach is faster, cheaper, and on most metrics more accurate than the physical methods it has displaced.

Climate modelling is less transformed. AI emulators are accelerating ESM components; hybrid approaches are emerging; but the field’s response to AI4S has been (appropriately) cautious. Climate’s stakes, time horizons, and physical-consistency requirements demand higher standards than weather forecasting.

The trajectory. Weather AI4S is largely a success story. Climate AI4S is at an earlier stage with substantial potential but real challenges. The next 5–10 years will likely see climate AI4S grow substantially as the methodologies mature.

§9. Physics and Scientific Simulation

A broader and more heterogeneous domain than the previous sections. Physics spans astronomy, particle physics, fluid dynamics, plasma physics, quantum systems, and condensed matter — each with its own data, methods, and AI applications. We cover the major threads: neural-network PDE solvers, physics-informed neural networks (PINNs), symbolic regression, ML in high-energy and astrophysics, and neural-network quantum states.

Neural-network PDE solvers

Many physical systems are described by partial differential equations (PDEs) — Navier-Stokes for fluid flow, Maxwell’s equations for electromagnetism, the Schrödinger equation for quantum systems, the Einstein equations for general relativity. Solving these PDEs numerically is the workhorse of computational physics.

The traditional approach. Discretize the equation on a spatial grid, march forward in time, apply boundary conditions. Methods include finite differences, finite elements, spectral methods, and lattice-Boltzmann. These are physically grounded and well-understood; they are also expensive — high-resolution simulations of turbulent flows or plasma physics consume major supercomputing resources.

The AI angle. Train a neural network to learn the solution operator of a class of PDEs — a map from initial conditions and boundary conditions to solutions, learned from a database of pre-computed solutions. Once trained, the network is much faster than direct simulation.

Notable architectures:

Fourier Neural Operators (FNO) (Li et al., 2021). Represent solutions in Fourier space; learn convolutional filters in Fourier representation. Particularly effective for periodic and translation-invariant PDEs.
DeepONet (Lu et al., 2021). Branch-and-trunk architecture for learning operators between function spaces. More general but typically less sample-efficient than FNO on translation-invariant problems.
Graph Neural Operator (GNO, Li et al., 2020). For non-uniform grids and irregular geometries.
Transformer-based operators (Cao, 2021; OFormer, GAOT). Recently popular; competitive with FNO on standard benchmarks.

The strengths and limits. Trained neural operators run 10³ to 10⁶ times faster than the underlying numerical PDE solvers. They generalize well within the training distribution (same equation family, similar boundary conditions, similar parameter ranges). They generalize poorly outside — applied to PDE parameters or geometries unlike training, they degrade rapidly. For now, neural-network PDE solvers are best understood as fast approximators within a well-defined problem class, not as general-purpose physics simulators.

Physics-informed neural networks (PINNs)

A different mode. PINNs (Raissi, Perdikaris, Karniadakis, 2019) embed physical constraints directly into the neural-network loss function. Train a neural network $u_\theta(x, t)$ to satisfy:

Boundary conditions (penalize deviation at the domain boundary).
Initial conditions (penalize deviation at $t = 0$ ).
PDE residual (penalize the PDE’s residual computed via automatic differentiation through the network: $\partial u / \partial t - f(u, \nabla u, \nabla^2 u) = 0$ ).

The loss combines these terms. Training is unsupervised — the network learns to satisfy the PDE without needing pre-computed solutions.

PDE

u_t = F(u, u_x, u_{xx})

with boundary conditions

Sample collocation points

inside the domain; on the boundary; at

t = 0

Neural network

u_\theta(x, t)

Use autograd to compute

u_t, u_x, u_{xx}

automatically through the network

Loss = MSE residual of PDE + boundary terms + initial-condition terms

Gradient descent on

\theta

The appeal. PINNs work for individual problems — no training data required, just the PDE. They are flexible (any PDE, any geometry, any boundary conditions). They are interpretable (the loss explicitly encodes the physics).

The limits. PINNs are slow. For a single problem instance, a PINN may take longer to converge than the underlying numerical solver. They struggle with stiff PDEs (multi-scale dynamics, sharp gradients) and with long-time-horizon dynamics (the loss landscape becomes intractable). Recent variants (XPINN, hp-VPINN, Gauss-Newton optimizers for PINNs) address some of these but no approach dominates all problem classes.

The honest assessment. PINNs are an interesting research direction with promising results for specific PDE classes; they are not yet a replacement for traditional numerical solvers in production scientific computing. The hype around PINNs in 2019–2021 has cooled appropriately; the field is now working on the genuine technical problems.

Symbolic regression and Lagrangian discovery

A different kind of physics AI. Symbolic regression is the problem of finding a symbolic formula (a closed-form mathematical expression) that fits a dataset — not a neural-network black box, but an interpretable equation.

The motivating example. Given experimental measurements, can an AI rediscover Kepler’s laws? Newton’s law of gravitation? Maxwell’s equations? The hope is that AI could not only fit data but discover laws — symbolic relationships that have explanatory and predictive power beyond the training data.

Notable systems:

Eureqa (Schmidt and Lipson, 2009). An evolutionary-search system for symbolic regression. Demonstrated discovery of simple physical laws from data.
AI Feynman (Udrescu and Tegmark, 2020). Combines neural-network-based regression with symbolic-decomposition strategies. Rediscovered many of the equations in Feynman’s lectures.
PySR (Cranmer, 2023). A modern, scalable symbolic-regression library. Used in several genuine scientific applications.
Lagrangian Neural Networks (Cranmer et al., 2020). Learn the Lagrangian of a dynamical system from trajectory data; the Lagrangian implies conserved quantities and equations of motion by Hamilton’s principle.

The applications. AI symbolic regression has produced real scientific discoveries in restricted settings — equations of state in materials science, scaling laws in physics, conservation laws in cosmology. The discoveries are typically rediscoveries of known physics (validating the technique) or modest extensions in specific subdomains.

The deep limitation. The space of possible symbolic formulas is vast; without strong inductive biases, symbolic regression is computationally intractable for complex problems. The successful applications all have substantial prior physics knowledge baked in (allowed operators, dimensional analysis, expected structure). The fully-open question — can AI discover fundamental new physical laws from raw data alone — remains genuinely open and is unlikely to be solved soon.

ML in high-energy and astrophysics

Two physics subdomains with substantial AI activity worth flagging.

High-energy particle physics. Data from the Large Hadron Collider (LHC) is petabyte-scale per year. Reconstructing particle trajectories from detector readouts, identifying particle species, separating signal from background — these are tasks where ML has been used routinely since the 2010s. Modern applications:

Jet tagging. Identifying jet origin (quark, gluon, b-quark, Higgs decay) from jet substructure using graph neural networks or Transformers on particle constituents.
Anomaly detection. Searching for new physics by identifying events unlike standard-model predictions. Autoencoder-based and contrastive approaches dominate.
Calorimeter simulation. GANs and normalizing flows used to generate simulated detector events much faster than full Geant4 simulation.

The ATLAS, CMS, and other LHC collaborations now integrate ML throughout the experimental pipeline. The field is mature; further gains are incremental rather than transformative.

Astrophysics and cosmology. Several active areas:

Gravitational-wave detection. ML used to classify LIGO/Virgo signal candidates and to denoise detector data. Real discoveries (binary black-hole mergers, neutron-star mergers) have been ML-assisted at multiple stages.
Galaxy classification. Convolutional networks classify galaxy morphology from telescope images; have processed millions of galaxies from Sloan Digital Sky Survey, DES, Euclid, and (soon) LSST.
Cosmological parameter estimation. Simulation-based inference (SBI) uses ML to invert cosmological simulations — given observed large-scale structure, infer the cosmological parameters that produced it. Truncated Marginal Neural Ratio Estimation (TMNRE) and Sequential Neural Likelihood Estimation (SNLE) are dominant techniques.
N-body simulation acceleration. Neural networks learn to predict the evolution of cosmological N-body systems faster than direct simulation. HaloFlow, CAMELS-trained emulators, and others.
Exoplanet detection. ML-classifying transit signals in Kepler/TESS data.

The pattern. Astrophysics has structured, well-understood data and clear physics. ML is used widely as a fast classifier and emulator. Discovery-of-novel-physics applications are rare; the AI is mostly accelerating known workflows.

Neural-network quantum states

A research frontier with a different flavour. The problem: representing quantum many-body wavefunctions is generally exponentially expensive. Variational Monte Carlo approximates the ground state by parameterizing a wavefunction and minimizing the energy expectation value. Neural networks have proven effective as variational ansätze for many-body wavefunctions.

Notable work:

NetKet / RBM-based ansatze (Carleo and Troyer, 2017). Restricted Boltzmann machines as variational wavefunctions; demonstrated competitive performance on quantum spin systems.
FermiNet (Pfau et al., 2020). A neural-network ansatz for many-electron systems satisfying fermion antisymmetry. Achieves chemical-accuracy ground-state energies for small molecules with a fundamentally different architecture from quantum-chemistry tradition.
PauliNet (Hermann et al., 2020). Related architecture with chemistry-specific structure.

The applications. Neural-network quantum states are used for strongly-correlated systems where conventional quantum chemistry (DFT, CCSD(T)) is inaccurate or intractable — transition-metal complexes, magnetic materials, model condensed-matter systems. The technique competes with quantum Monte Carlo and tensor-network methods rather than replacing them.

The state in 2026. An active research area with substantial promise for hard quantum problems; not yet a dominant tool in the physics workflow. The methods are sufficient for research demonstrations and certain niche applications; production-grade adoption is partial.

Where physics AI sits in 2026

The summary. Physics AI4S is heterogeneous. Some subdomains (jet tagging, galaxy classification, calorimeter simulation) are mature and routine. Others (PINNs, symbolic regression, neural quantum states) are active research with promising but partial results. Others still (fundamental-law discovery) are essentially open problems.

The unifying observation. Physics is unusual among AI4S domains in that it has strong theoretical foundations — equations and conservation laws derived from first principles, often with experimental confirmation to many decimal places. AI’s role in physics is mostly acceleration and pattern-finding, not replacement of theory. The trajectory in physics AI4S has been gradual rather than transformative — no AlphaFold-equivalent breakthrough has yet appeared. Whether one is coming is open.

§10. Scientific Reasoning Agents and Autonomous Research

The most recent and most speculative section of the chapter. The 2024–2026 period has seen the emergence of AI systems that conduct scientific research — proposing hypotheses, designing experiments, executing them, analyzing results, writing them up. The capability is at an early stage; the trajectory is rapid; the implications are substantial.

LLMs as scientific reasoning systems

The substrate. Modern LLMs (GPT-4, Claude, Gemini) demonstrate substantial scientific reasoning capability off the shelf. A well-prompted LLM can:

Explain scientific concepts at expert level across most natural-science fields.
Generate hypotheses about new phenomena, given background context.
Analyze experimental designs and identify flaws.
Interpret figures and tables (in their multimodal variants).
Write code to perform scientific computations.
Read and summarize scientific literature.

The LLM’s scientific reasoning is broad but shallow. It is comparable to an interdisciplinary postdoc — able to engage substantively with most scientific topics but rarely with the depth of a specialist. The reasoning is also unreliable — LLMs hallucinate citations, misremember details, and confidently produce incorrect technical claims.

The implication. LLMs are useful research tools when used carefully: literature search, brainstorming, drafting, code generation. They are unsafe as scientific authorities — claims need verification.

Tool-use for science

The natural extension. Augment the LLM with tools — external computation, search, simulation, lab equipment. The LLM becomes an agent that can take actions in the scientific world.

Standard tools for scientific agents:

Literature search. Semantic Scholar, arXiv, PubMed APIs. The agent retrieves papers and reads them.
Code execution. Python with NumPy, SciPy, RDKit, etc. The agent writes and runs analysis code.
Domain-specific computation. DFT (for chemistry), molecular-dynamics simulators (for biology), telescopes (for astronomy), instruments (in lab settings).
Database queries. Materials Project, PDB, UniProt, etc.
Symbolic mathematics. SymPy, Wolfram Alpha, Lean (for proof checking).
Visualization. matplotlib, ggplot, RDKit drawing, etc.

The agentic loop is the same as in any LLM tool-use system (LLM §8): the LLM decides which tools to call, parses the results, decides next actions. The scientific specifics are in the tools chosen and the evaluation framework (what does scientific success look like for this task?).

ChemCrow and Coscientist: autonomous chemistry agents

Two landmark systems from 2023–2024.

ChemCrow (Bran et al., 2023, “Augmenting large language models with chemistry tools”). An LLM agent with chemistry-specific tools — retrosynthesis planners, reaction-yield predictors, molecule-property estimators, RDKit utilities. Demonstrated competence at chemistry tasks (proposing syntheses, predicting reaction outcomes, designing molecules with target properties) at a level approaching that of a junior chemist.

The architecture: standard LLM-with-tools, with the tools selected for chemistry. The LLM (GPT-4 in the original work) handled high-level planning and tool selection; the tools handled domain-specific computation.

Coscientist (Boiko et al., 2023, “Autonomous chemical research with large language models”). A more autonomous system that executes chemistry experiments using a robotic lab. Given a target reaction, Coscientist plans the synthesis, selects reagents, programs the robotic platform, executes the experiment, and analyzes the results.

The result. Coscientist successfully planned and executed several non-trivial chemistry experiments without per-experiment human intervention. The system demonstrates that the LLM-agent + robotic-lab combination can produce autonomous scientific output.

The honest accounting. ChemCrow and Coscientist work within bounded problem classes — well-characterized chemistry with known reagents and predictable outcomes. They fail outside those bounds. The systems are valuable demonstrations of capability; they are not autonomous scientists in the unbounded sense.

AI Scientist: end-to-end research automation

AI Scientist (Lu et al., Sakana AI, 2024) is an early demonstration of end-to-end AI research. The system:

Generates research ideas in a given area.
Writes code to implement experiments.
Runs experiments (in the original work, on standard ML benchmarks).
Analyzes results.
Writes a paper describing the findings, including figures and references.
Performs (LLM-based) peer review of its own paper.

The output. AI Scientist produced multiple complete research papers in machine learning. The papers are real — they have experimental results, contain LaTeX, are formatted like NeurIPS submissions, and report findings the authors did not have in advance. Several were submitted to (and rejected from) actual conferences.

The realistic assessment. The papers are at the level of weak research output — they make incremental contributions on well-studied problems, sometimes with methodological flaws that human reviewers caught. They are not approaching the contribution level of strong human research. But they exist, and the system runs autonomously.

The trajectory. The Sakana team and others have followed up with improvements: better hypothesis generation, more reliable code execution, more sophisticated experimental design. The 2025–2026 systems are substantively better than the 2024 demonstration. Where the trajectory leads is one of the most-watched open questions in AI4S.

Lab automation and self-driving labs

A complementary thread. Robotic and automated laboratories have existed for decades for high-throughput tasks (drug screening, materials synthesis). The combination with AI agents is producing the self-driving laboratory (SDL).

The vision. A laboratory in which the experimental cycle is fully automated:

AI agent proposes experiment.
Robotic equipment prepares samples.
Instruments measure.
AI agent analyzes results.
AI agent decides next experiment.

Notable implementations:

Ada (Roch et al., 2018). Early self-driving optical-materials lab.
Cooper (Burger et al., 2020). Self-driving lab for photocatalyst development.
Polybot (Argonne National Lab, 2020s). Multi-instrument materials synthesis and characterization.
A-Lab (Berkeley, 2023). Materials-synthesis lab for the GNoME-predicted crystals — partial automation of synthesis cycles.

The state. SDLs are practical for well-defined experimental cycles in chemistry and materials. They are not yet general — each SDL is bespoke for its specific experimental modality. The cost is substantial; the throughput improvement is real but bounded.

Limits: novelty, hypothesis quality, experimental rigor

The honest critiques of autonomous research agents.

Novelty is hard. AI systems excel at combining known ideas — applying technique X to problem Y, where both are well-documented. They are weaker at truly novel directions — proposing problems no one has framed, suggesting techniques no one has tried. The AI Scientist papers are illustrative: they make incremental contributions on well-studied benchmarks, not paradigm-shifting proposals.

Hypothesis quality varies. AI-generated hypotheses are often plausible-sounding but unimportant — the system can produce hundreds of testable predictions but cannot reliably identify which ones matter. Human scientists do this through taste, experience, and engagement with the field; AI systems do it (when they do it well) by leveraging the LLM’s broad knowledge but inconsistently.

Experimental rigor. AI-conducted experiments can have methodological flaws — inappropriate controls, statistical errors, data leakage, p-hacking. Detecting these requires careful review; autonomous systems that don’t pause for review can compound errors. The need for human-in-the-loop verification is real.

Reproducibility. AI-driven research carries amplified reproducibility risks — LLM outputs are inherently stochastic, computational pipelines may not be deterministic, code may not be archived. Best practices are emerging but not yet standard.

Where autonomous research sits in 2026

The summary. AI-as-tool for science is substantively useful — literature search, code generation, brainstorming, draft writing are standard scientist workflows that now involve LLM assistance. AI-as-collaborator is emerging — systems like ChemCrow and Coscientist demonstrate that autonomous-action within bounded domains is possible. AI-as-researcher is aspirational — AI Scientist demonstrates the capability exists but at weak-research-output quality.

The trajectory is rapid. The 2024 AI Scientist was demonstrably weak; 2025-2026 successors are stronger; the 2027–2028 systems will be substantially more capable than current systems. Where this ends — at human-PhD level? human-postdoc level? human-PI level? somewhere different from human research entirely? — is one of the most consequential open questions in AI4S and in AI broadly.

The institutional response is also evolving. Journals are developing AI-disclosure policies; conferences are wrestling with AI-authored submissions; granting agencies are considering AI-impact assessments. The community-level adjustments are at least as consequential as the technical capabilities.

Editorial note. Sections §10 and the surrounding material on autonomous research will date faster than other parts of the chapter. The 2024–2026 systems described here are the state-of-the-art at writing; the systems available a year from publication will likely be substantively more capable. We treat this section as a snapshot of a rapidly-evolving frontier rather than a description of a stable state.

§11. Evaluation in Scientific Domains

Evaluation in AI for Science is harder than evaluation in mainstream ML. Standard ML benchmarks have well-defined train/test splits, clean ground truth, and reproducible scoring. Scientific domains often have none of these. This section develops what evaluation looks like in AI4S, the specific failure modes that arise, and the practices that have emerged.

The “ground truth” problem in science

Standard ML evaluation assumes a labelled test set with ground-truth answers. For an image classifier, ground truth is a human-assigned label. For an LM benchmark, ground truth is a correct response. The evaluator computes performance against this ground truth.

Scientific domains have several different relationships to “ground truth.”

Experimentally measured ground truth. Protein structures (from X-ray crystallography, cryo-EM, NMR), reaction outcomes (from synthesis), materials properties (from physical characterization). The ground truth exists but is expensive to acquire — a single experimental protein structure can take months of work. Test sets are constrained by what has been experimentally measured.

Computationally derived ground truth. DFT calculations for materials properties, molecular-dynamics simulations for protein dynamics. The “ground truth” here is itself an approximation — DFT is exact only in principle, and the functional choice introduces error. Models trained against DFT are evaluated against DFT, with the actual physics one step removed.

Theoretical ground truth. Mathematics has the cleanest case: a proof is either correct or not, verified by a kernel. Variant-effect prediction has clean ground truth for known pathogenic variants but uncertain ground truth for variants of uncertain significance (the variants of greatest practical interest).

No clear ground truth. Drug efficacy at the molecular level, long-term climate projections, the “interestingness” of a discovered material — these have no single accepted ground-truth measure. Evaluation here is necessarily multi-faceted and partial.

Each domain has its own evaluation conventions shaped by these constraints.

Domain-specific benchmarks

The major AI4S benchmarks worth knowing.

Proteins.

CASP (Critical Assessment of Structure Prediction). The biennial competition that revealed AlphaFold 2. Tests on prospective (released-during-the-competition) protein structures with experimental ground truth. The gold standard.
CASP-Q and CAMEO (Continuous Automated Model EvaluatiOn). Continuous monitoring of structure-prediction systems against new experimental structures.

Materials.

MatBench (Dunn et al., 2020). Benchmark for materials property prediction. Multiple tasks (formation energy, band gap, refractive index, etc.); standard splits.
Open Catalyst Challenge (OC20, OC22, OC25). Catalyst-surface property prediction.
NOMAD challenges. Materials-informatics community challenges.

Chemistry.

MoleculeNet (Wu et al., 2018). Curated benchmark with multiple property-prediction tasks. The de facto standard for chemistry-FM evaluation despite well-known limitations.
Therapeutics Data Commons (TDC). Drug-discovery-focused benchmarks.

Mathematics.

MiniF2F (Zheng, Han, Polu, 2022). A formal-mathematics benchmark with Olympiad-level problems translated into Lean and Isabelle. Used by AlphaProof and competitors.
ProofNet, FIMO, ProofGym. Other formal-mathematics benchmarks.

Weather and climate.

WeatherBench (Rasp et al., 2020; Rasp et al., 2024 v2). The standardised benchmark for ML-based weather forecasting on ERA5 data. WeatherBench 2 expanded the metric suite and is used by GraphCast, Pangu-Weather, GenCast.
ClimateBench, ChaosBench. Long-horizon and climate-specific benchmarks.

Single-cell biology.

Open Problems in Single-Cell Analysis. Community benchmark for cell-type prediction, batch correction, perturbation response.
CellxGene Census. Standardized single-cell-data resource that benchmarks build upon.

Each benchmark has its own community conventions, its own known limitations, and its own pace of evolution. The benchmark ecosystem is decentralized and uneven; some domains (proteins, weather) have authoritative benchmarks, others (drug discovery, materials beyond MatBench) have multiple competing or partial benchmarks.

Held-out vs prospective evaluation

A specific methodological issue. Most benchmarks use retrospective splits — given an existing dataset, designate some fraction as “test” and don’t train on it. This works when train and test are truly independent; it fails when they are not.

Failure modes:

Random splits leak signal. A random 80/20 split of the PDB into train/test gives the model many close homologs of test proteins in training — making structure prediction artificially easy. Best practice: split by sequence similarity (no train-test homolog above 30% identity), or by temporal cutoff (train on pre-date, test on post-date).

Temporal splits. Use date as the splitting criterion: train on all data published before a cutoff; test on data published after. This is the closest retrospective analogue to prospective evaluation (testing on truly new data). CASP uses this implicitly — the test structures are released during the competition and were unknown to participants. Most other benchmarks use temporal splits in their best-practice versions.

Distribution shift across splits. Even with temporal or similarity-based splits, the test data may differ systematically from train (different organism distributions, different chemistry families). This is intrinsic to scientific data, where data accumulation is not random.

Experimental validation as ultimate metric

The cleanest evaluation. Synthesize the AI-designed protein/material/molecule; measure its actual properties; compare to predictions. This is prospective experimental validation and is the gold standard for AI4S that involves design.

The cost. Each experimental validation is expensive — for proteins, weeks of synthesis and characterization; for materials, similar; for drugs, months to years. Benchmarks based on experimental validation are necessarily small (tens to hundreds of validated cases) compared to computational benchmarks (millions of test points).

The interpretation. Experimental validation rates (“the model produces successful designs X% of the time”) are the most-meaningful AI4S metrics but the hardest to obtain at scale. RFdiffusion’s experimental validation involved synthesizing dozens of designed proteins; the high hit rate was the headline result. Similar validation campaigns are emerging in materials (GNoME synthesizable subsets) and chemistry (autonomous-chemistry experiments).

Memorization and data leakage in scientific FMs

A serious concern in 2026. Scientific foundation models are trained on essentially-all-publicly-available data in their domains — the PDB for proteins, ChEMBL for molecules, PubMed for literature. Test sets in the literature are mostly subsets of the same corpora.

The risk: a scientific FM may memorize test-set entries through pretraining, and then “predict” them at evaluation time by recall rather than generalization. The model is then evaluated as if it generalized when it actually memorized.

Detection techniques:

Date-cutoff filtering. Evaluate only on data published after the model’s pretraining cutoff. The model could not have seen it.
Held-out-by-construction sets. Curated test sets specifically excluded from any public pretraining data.
Per-example overlap detection. Search the pretraining data for near-duplicates of test examples; flag matches.

The honest accounting. Many widely-reported AI4S results are at least partially contaminated by data leakage. The community is beginning to require date-cutoff and similarity-based controls; the practice is uneven. Specific recent results (some single-cell FM claims, some chemistry FM claims) have been called into question on contamination grounds.

The Theoretical Foundations of Learning chapter §8 develops a related discussion in the context of standard ML; the scientific domain has its own specifics but the underlying problem is the same. This is reflected in OP-S-8 and the §13 critiques.

Where AI4S evaluation sits in 2026

The summary. AI4S evaluation is less mature than mainstream ML evaluation. Some domains have authoritative benchmarks (CASP, WeatherBench); others have fragmented or contested practices. Experimental validation is the gold standard but is expensive. Memorization concerns are real and increasingly recognized. Best practices (temporal splits, similarity-based controls, prospective evaluation, experimental validation) exist but are not uniformly applied.

The trajectory is toward more rigorous evaluation. The community is becoming more attentive to leakage, more demanding of experimental validation, and more careful about benchmark hygiene. The 2026 standards are higher than 2022 standards; further improvement is likely.

§12. Connections to Other Chapters

This chapter is dense with cross-references; the dependency statements below replace the simple bullets that were in the outline.

Generative Models is the methodological substrate for protein design (§3, RFdiffusion), molecule design (§5, equivariant diffusion), and crystal design (§6, MatterGen). The diffusion and flow-matching machinery of Generative Models §6 and §7 applies directly to scientific domains; AlphaFold 3’s diffusion-based structure prediction (§3) is a particularly clean instance.
Foundation Models is the framing for scientific foundation models — ESM (proteins), AlphaGenome (DNA), MolFormer (molecules), scGPT (cells), MatterSim (materials), and weather FMs. FM §6 (scaling laws) and FM §7 (adaptation) apply to scientific FMs with domain-specific variations.
Reinforcement Learning §10 develops verifiable-reward RL; AlphaProof (§4 of this chapter) is the canonical AI4S application. RL also appears in retrosynthesis (§5, RetroSynRL) and in autonomous chemistry agents (§10 of this chapter via Coscientist’s policy layer).
Self-Supervised Learning is the pretraining substrate for scientific FMs across all subdomains — masked-amino-acid prediction for proteins, masked-token prediction for molecules and DNA, contrastive learning for cells. SSL §4 (generative objectives) and §5 (contrastive) provide the technique base.
Deep Learning provides the architectures — equivariant networks (§6 of DL extends to molecular/material equivariance), graph neural networks (§4 of DL covers GNN basics), Transformers, attention mechanisms. The cross-cutting equivariance theme of this chapter is the DL chapter applied to physical-symmetry data.
Large Language Models is the substrate for scientific reasoning agents (§10 of this chapter). LLM §8 (Tool Use) is directly instantiated by ChemCrow, Coscientist, AI Scientist. LLM §12 (Reasoning Models) is the substrate for AlphaProof’s RL-on-reasoning training.
Multimodal Models (planned) develops cross-modal systems; protein-language-and-structure models (some AlphaFold 3 variants), protein-and-molecular-property models, and multi-omics integration are scientific instances.
Theoretical Foundations of Learning §8 (modern generalization puzzle) is directly relevant to OP-S-1 (generalization in scientific domains) — scientific FMs face the same overparameterization-generalization puzzle as general FMs but with additional constraints from physical principles and sparse experimental data.
Causality (planned) underlies a substantial critique of AI4S — many AI4S predictions are correlational rather than causal, and the gap matters for applications like drug discovery and clinical genomics. OP-S-8 is the cross-reference.
Evaluation (planned) develops cross-cutting evaluation methodology; this chapter’s §11 is the scientific-domain instance.
Mechanistic Interpretability (planned) studies internal representations of models, including scientific ones. Interpretability of AlphaFold-style and chemistry FM representations is an active area; what these models have learned about physics, chemistry, or biology is partly open.
Alignment / Ethics treats safety implications of capable AI systems, including AI4S systems. Dual-use concerns (biology), autonomous-agent risks (autonomous research), and reproducibility-and-validation requirements all interact with the broader alignment discussion.

§13. Critiques and Alternative Perspectives

This section presents critiques of AI for Science as substantive intellectual positions held by working researchers. The chapter does not adjudicate.

“Scientific AI is curve-fitting, not understanding”

The classical critique. AI systems make predictions without producing theoretical understanding. AlphaFold predicts protein structures with high accuracy but does not produce new biophysical theories about why proteins fold; GraphCast forecasts weather without elucidating atmospheric dynamics; GNoME discovers stable crystals without illuminating principles of crystal stability.

The deeper critique. Science is supposed to produce understanding, not just predictions. A predictive system without theoretical content is a useful tool but not science in the traditional sense. AI4S systems are tools, not theories; calling them “AI for Science” risks confusion about what they are.

The pushback. Tools have always been part of science — microscopes, telescopes, computers, statistical software. AI4S systems are tools that enable science by accelerating routine prediction; the resulting bandwidth allows scientists to focus on theoretical questions. Furthermore, AI4S sometimes suggests theoretical content — AlphaFold’s representation may encode physical principles that could be extracted by interpretability methods (a hope rather than a present reality).

The honest position. AI4S is not a replacement for theoretical understanding; it is a powerful complement that has changed which questions are tractable. Whether AI4S will eventually contribute to theory development directly is open.

The reproducibility crisis in AI4S

A concrete and current critique. AI4S has its own version of the reproducibility crisis that has affected biomedical sciences and ML more broadly. Specific issues:

Code availability. Many AI4S papers do not release code, or release code that is incomplete or hard to run.
Data availability. Training data is sometimes proprietary; test data is sometimes not clearly specified.
Compute requirements. Frontier AI4S results require substantial compute; independent reproduction is expensive.
Random-seed sensitivity. Some results depend non-trivially on training-time randomness; reported metrics may be best-case rather than typical-case.
Benchmark contamination. As discussed in §11, data leakage between train and test affects many reported results.

The community response is uneven. Some venues (NeurIPS Datasets and Benchmarks track, journal special issues on AI4S reproducibility) push for stronger standards; many do not enforce them. The 2026 state is “improving but not adequate.”

The Reproducibility Crisis is shared with mainstream ML but is more consequential in AI4S because the downstream applications (clinical genomics, drug discovery, materials engineering) depend on the reported results. A non-reproducible result that affects a clinical decision has stakes that a non-reproducible result on an image-classification benchmark does not.

Benchmarking artefacts and data leakage

A specific instance of the reproducibility critique. Many reported AI4S improvements turn out, on closer inspection, to be benchmark artefacts:

Test sets contain close homologs of training proteins — apparent generalization is partly memorization.
Training and test data share systematic biases — high apparent accuracy reflects shared bias rather than meaningful prediction.
Evaluation metrics are sensitive to ranking — small differences in score reflect noise rather than signal.

The community has become more attentive to these issues, with stricter splitting practices, contamination audits, and (in some cases) public retraction-and-correction of overclaim. The trajectory is improving; the past record is uneven.

The capabilities-vs-impact gap

A specific 2024–2026 observation. AI4S has produced striking technical capabilities (AlphaFold 2/3, GraphCast, GNoME, AlphaProof) — but the scientific impact of these capabilities has not always matched the capability hype. Specifically:

AlphaFold 2 has enabled substantial structural biology but has not, on its own, cured a single disease or directly produced a single approved drug. The impact pathway is slow.
GNoME predicted 380K stable crystals but only a small fraction have been experimentally validated; even fewer have been engineered into useful materials.
AI Scientist produces papers, but the papers are not high-impact and the productivity gain over human researchers is unclear.

The critique. AI4S systems are technically impressive but the science is downstream of them is what matters. Treating capability demonstrations as scientific accomplishments confuses tool development with science.

The pushback. Capability gains take time to translate into impact. The integrated-circuit transistor took decades to produce computers; AlphaFold may take a similar time to produce its clinical and biotechnological impact. Premature dismissal of AI4S based on near-term lack of impact is the symmetric error to premature celebration.

The position taken in this chapter. Both errors are real; the chapter has tried to describe capabilities accurately without overclaim about impact. The trajectory matters more than the snapshot.

The role of domain expertise

A practical critique. The successful AI4S projects (AlphaFold, GraphCast, GNoME, RFdiffusion) involved deep collaboration between ML researchers and domain scientists — structural biologists at AlphaFold; meteorologists at DeepMind; materials chemists at GNoME; protein biochemists at the Baker lab.

The critique: AI4S can fail when ML teams over-rely on their ML expertise and under-rely on domain expertise. Specific failure modes:

ML researchers building “the obvious system” without consulting domain experts on what is actually useful.
Domain scientists asked to evaluate ML outputs they don’t understand technically.
Cross-domain claims (we built X for chemistry; we will now scale it to biology) that miss domain-specific structure.

The healthy state of AI4S involves substantial intellectual integration — ML and domain expertise jointly shaping the project from the start. This is what made AlphaFold succeed where decades of pure-ML and pure-biology approaches had failed. The interdisciplinary integration is harder than it sounds and is one of the underrated requirements for AI4S success.

A meta-critique: AI4S as a field

A final position. AI4S is not uniformly impressive across domains. Proteins are a substantial success; weather is a substantial success; mathematics is making real progress; chemistry and materials are improving incrementally; biology beyond proteins is mixed; physics is heterogeneous; autonomous research is aspirational.

Generalizing from the success cases (AlphaFold) to all of AI4S overclaims. Generalizing from the underperforming cases (some biology FMs, climate-AI-replacing-physical-models) underclaims. The honest summary is the heterogeneous one: AI4S works very well in some domains, modestly in others, and is genuinely uncertain in still others.

§14. Limitations and Open Problems

Consolidated open-problems list. Each item carries an OP-S-N identifier so other chapters can cross-reference.

OP-S-1. Generalization beyond training distribution in scientific domains. A specific instance of the general open problem (OP-TH-1, OP-FM-3). Scientific models often perform well within the training distribution and degrade sharply outside it. AlphaFold underperforms on proteins very different from training-set homologs; chemistry FMs degrade on novel chemistry; weather models trained on ERA5 era may not extrapolate to climate-shifted future weather. Whether structural inductive biases (equivariance, physics-informed constraints, foundation-model pretraining) can be made strong enough to ensure out-of-distribution generalization is open.
OP-S-2. Protein dynamics, allostery, and disordered regions. AlphaFold-style models predict static ground-state structures. Real proteins are dynamic — they fluctuate, transition between conformations, and have functionally important disordered regions. Predicting dynamics (which states, how often, transition kinetics) is much less developed. Several systems (Distributional Graphformer, AlphaFlow) extend toward conformational ensembles; the field is far behind static-structure prediction.
OP-S-3. AI-discovered important mathematics. AlphaProof solves known hard problems; FunSearch improves on known combinatorial constructions; Davies et al. find patterns in known mathematical objects. AI has not yet posed and solved a substantial new mathematical question that humans had not already framed. Whether this is a quantitative or qualitative gap is debated and unresolved.
OP-S-4. Active learning and experimental-loop efficiency. AI4S workflows increasingly use active-learning loops (predict, validate, retrain). Choosing which candidates to validate is a substantial optimization problem; standard active-learning theory provides foundations but doesn’t fully solve the practical problem at scientific scales. Reducing the number of expensive experiments required to converge on a discovery is the practical metric.
OP-S-5. Foundation models for biology that match LLM-style scaling. Single-cell and genomics FMs show modest but not dramatic improvements over task-specific baselines. The protein-language-model case (ESM, ProtT5) is partial — large gains in some downstream tasks but not paradigm-shifting transformations. Whether the FM-as-substrate pattern can be made as dominant in biology as it is in language is an open empirical question.
OP-S-6. Long-horizon climate modelling with physical consistency. Deep-learning weather forecasting works at days-to-weeks; climate-scale modelling at decades-to-centuries is much less developed. ML emulators for ESMs are accelerating physics-based models; ML climate models with physical-consistency guarantees are an open frontier. The stakes (climate-projection trustworthiness for policy decisions) are high enough that progress here is consequential.
OP-S-7. Trustworthy autonomous research agents. Current autonomous-research systems (AI Scientist, Coscientist) produce weak-research-quality output and have known failure modes (poor hypothesis selection, methodological flaws, lack of novelty). Building autonomous research agents that produce reliably-trustworthy science — not just impressively-formatted papers — is the central open challenge of §10. Whether this is months away or decades away is contested.
OP-S-8. Causal vs correlational scientific predictions. Many AI4S systems make correlational predictions — they predict outputs from inputs based on training-data patterns. Scientific understanding often requires causal predictions — what happens if I intervene on the system. Drug discovery, clinical genomics, materials-property design, and policy-driven climate scenarios all need causal reasoning. The Causality chapter develops the broader framework; AI4S-specific causal methodology is largely open.
OP-S-9. Reproducibility and benchmark hygiene. The §13 critique. AI4S systems are sometimes evaluated on contaminated benchmarks, with sensitivity to random seeds and infrastructure choices not fully reported. Establishing community standards that exceed mainstream ML standards (because of the higher downstream stakes) is in progress but uneven.
OP-S-10. Bridge from capability to scientific impact. Capability demonstrations (AlphaFold 2 at CASP 14) do not automatically translate to scientific impact (curing a disease, designing an industrial material). The bridge requires careful integration with experimental science, downstream engineering, and institutional adoption. How to make this bridge faster and more reliable is a meta-question that affects all AI4S subfields.

§15. Further Reading

Opinionated annotated list. Not exhaustive; intended as a reading-order recommendation for someone entering AI for Science.

Cross-cutting overviews

Wang, H., et al. (2023). “Scientific discovery in the age of artificial intelligence.” Nature. A high-level overview of AI4S progress as of 2023.
Bommasani, R., et al. (2021). “On the Opportunities and Risks of Foundation Models.” The FM-as-substrate framing relevant to scientific FMs.

Proteins

Jumper, J., et al. (2021). “Highly accurate protein structure prediction with AlphaFold.” Nature. The AlphaFold 2 paper.
Abramson, J., et al. (2024). “Accurate structure prediction of biomolecular interactions with AlphaFold 3.” Nature. AlphaFold 3.
Lin, Z., et al. (2023). “Evolutionary-scale prediction of atomic-level protein structure.” ESMFold.
Watson, J. L., et al. (2023). “De novo design of protein structure and function with RFdiffusion.” Nature.
Ingraham, J. B., et al. (2023). “Illuminating protein space with a programmable generative model.” Chroma.
Dauparas, J., et al. (2022). “Robust deep learning-based protein sequence design using ProteinMPNN.”

Mathematics

DeepMind blog (2024). “AlphaProof and AlphaGeometry 2.” Technical overview of IMO-level theorem proving.
Trinh, T. H., et al. (2024). “Solving olympiad geometry without human demonstrations.” Nature. AlphaGeometry.
Romera-Paredes, B., et al. (2024). “Mathematical discoveries from program search with large language models.” Nature. FunSearch.
Davies, A., et al. (2021). “Advancing mathematics by guiding human intuition with AI.” Nature. Conjecturing.

Chemistry and materials

Merchant, A., et al. (2023). “Scaling deep learning for materials discovery.” Nature. GNoME.
Zeni, C., et al. (2025). “A generative model for inorganic materials design.” Nature. MatterGen.
Chanussot, L., et al. (2021). “Open Catalyst 2020 (OC20) Dataset and Community Challenges.”
Batzner, S., et al. (2022). “E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials.” NequIP.
Batatia, I., et al. (2024). “A foundation model for atomistic materials chemistry.” MACE-MP-0.

Biology beyond proteins

Cheng, J., et al. (2023). “Accurate proteome-wide missense variant effect prediction with AlphaMissense.” Science.
Avsec, Ž., et al. (2021). “Effective gene expression prediction from sequence by integrating long-range interactions.” Enformer.
DeepMind (2024). “AlphaGenome.” Technical overview.
Nguyen, E., et al. (2024). “Sequence modeling and design from molecular to genome scale with Evo.”
Cui, H., et al. (2024). “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.”

Climate and weather

Lam, R., et al. (2023). “Learning skillful medium-range global weather forecasting.” Science. GraphCast.
Bi, K., et al. (2023). “Accurate medium-range global weather forecasting with 3D neural networks.” Nature. Pangu-Weather.
Price, I., et al. (2024). “Probabilistic weather forecasting with machine learning.” Nature. GenCast.
Rasp, S., et al. (2024). “WeatherBench 2: A benchmark for the next generation of data-driven global weather models.”

Physics

Karniadakis, G. E., et al. (2021). “Physics-informed machine learning.” Nature Reviews Physics. PINN review.
Li, Z., et al. (2021). “Fourier Neural Operator for Parametric Partial Differential Equations.”
Pfau, D., et al. (2020). “Ab initio solution of the many-electron Schrödinger equation with deep neural networks.” FermiNet.
Cranmer, M., et al. (2020). “Discovering symbolic models from deep learning with inductive biases.” Symbolic regression.

Autonomous research and agents

Bran, A. M., et al. (2024). “Augmenting large language models with chemistry tools.” Nature Machine Intelligence. ChemCrow.
Boiko, D. A., et al. (2023). “Autonomous chemical research with large language models.” Nature. Coscientist.
Lu, C., et al. (2024). “The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery.” Sakana.

Evaluation and methodology

Beam, A. L., and Kohane, I. S. (2018). “Big Data and Machine Learning in Health Care.” Early articulation of evaluation issues in clinical ML.
Walters, W. P., and Murcko, M. (2020). “Assessing the impact of generative AI on medicinal chemistry.” Reproducibility-and-evaluation discussion.

Reading-order recommendation

For someone entering AI for Science: start with the Wang et al. (2023) overview for orientation. Then read AlphaFold 2 carefully — the architecture, the training, the impact. Then GraphCast and GNoME as the canonical non-protein landmark systems. Then add domain-specific papers as research interest dictates. The MACE/NequIP papers are good entry points to the equivariance machinery; the AI Scientist paper is the canonical autonomous-research reference; AlphaProof is the mathematics frontier.

§16. Exercises and Experiments

Research-style exercises spanning the chapter’s domains. Each is designed to develop hands-on understanding of one or two AI4S techniques.

E1. AlphaFold 2 on a CASP target. Pick a recent CASP target (publicly released after AlphaFold 2’s training cutoff). Run AlphaFold 2 (via ColabFold or local installation). Compare the predicted structure to the experimental ground truth using GDT-TS and RMSD. Investigate the relationship between pLDDT confidence and actual accuracy across residues. Identify low-confidence regions and check if they correspond to disordered or flexible parts.
E2. Equivariant GNN on QM9. Train a NequIP or MACE small variant on the QM9 dataset for molecular property prediction. Compare to a non-equivariant graph neural network (MPNN) baseline at matched parameter counts. Verify the sample-efficiency advantage of equivariance by training both with subsets of the data.
E3. Molecular generation on ZINC. Train a simple molecular generative model (a small VAE on SMILES, or a graph-based generator) on a subset of ZINC. Generate 1000 molecules; check what fraction are valid (parseable, charge-balanced) and what fraction are novel (not in training set). Compute distribution statistics of generated vs training molecules (molecular weight, logP, ring counts) to assess distributional match.
E4. Lean tooling and basic theorem. Install Lean 4 and mathlib4. Formalize a basic theorem (e.g., “the sum of two even numbers is even”) from scratch in Lean. Attempt to use a tactic-based approach where the LLM proposes proof candidates; verify each in Lean. Reflect on the experience: what’s hard, what’s surprising.
E5. GraphCast-style small forecast. Using public ERA5 data (or a downsampled version), train a small graph-neural-network weather model on a subset of the variables (e.g., 500hPa geopotential height, 2m temperature). Evaluate on held-out dates; compare to a persistence baseline and to a simple regression baseline. Plot forecast skill as a function of lead time.
E6. Scientific reasoning agent. Build a small LLM-with-tools agent for chemistry. Tools: a SMILES validator, a property estimator (RDKit), a literature-search wrapper (PubMed API). Give the agent a chemistry task (e.g., “find a molecule similar to aspirin with higher predicted logP”); evaluate its output qualitatively. Identify failure modes (hallucinated molecules, misused tools, off-target outputs).
E7. Data-leakage audit. Pick a published AI4S benchmark result. Audit it for data leakage: check whether train-test similarity is properly controlled, whether the model’s pretraining data could have included the test set, whether reported metrics differ from what a careful split would produce. Write up findings.
E8. Symbolic regression on physics data. Use PySR or AI Feynman on a synthetic physics dataset (e.g., Kepler-like orbital motion with noise). Verify the system can rediscover the correct equation. Then add real-world complications (additional dimensions, distractor variables) and observe how the system degrades.
E9. PINN for a simple PDE. Implement a PINN for the 1D heat equation or 1D advection equation. Compare to a finite-difference solver on the same problem. Compare training time, accuracy, and computational cost. Investigate failure modes by changing the equation parameters (advection speed, diffusion coefficient) outside the range you initially trained on.
E10. Reproduce a small AI4S result. Choose a published AI4S paper with available code (e.g., a property-prediction model on MoleculeNet). Reproduce one of the headline results. Document the experience: what worked, what didn’t, how many hours of compute were required, what was unclear from the paper. Reflect on what this says about reproducibility in the field.