Reasoning Models

All sixteen sections are in draft status. Open problems are flagged inline and consolidated in §14.

This chapter is a new chapter with no AIMA 4e antecedent. Reasoning models are large language models specifically trained or prompted to perform extended deliberative computation - generating long chains of intermediate reasoning (“thinking”) before producing answers. The 2024-2026 reasoning-model paradigm (o1, o3, DeepSeek-R1, Claude extended thinking, Gemini Thinking) represents a substantive shift in how LLMs solve hard problems: from single-pass next-token prediction toward iterative, search-like deliberation supported by reinforcement learning on verifiable rewards.

The chapter consolidates reasoning-specific material referenced across other chapters: LLM §7 (chain-of-thought prompting), RL §12 (reasoning RL with GRPO), Alignment §4 (RLHF and preference-tuning), AI Agents §5 (agent planning), Evaluation §4 (reasoning benchmarks). This chapter develops the comprehensive treatment.

The chapter assumes the Large Language Models chapter, the Reinforcement Learning chapter (especially §10-§12 on RLHF and reasoning-RL), and basic familiarity with neural-network training.

Scope and What This Chapter Is About

The chapter develops reasoning models - LLMs that perform extended deliberative computation. We cover the conceptual framework (what reasoning means in the LLM context; the inference-time-compute scaling axis), the prompting-based reasoning lineage (CoT, self-consistency, ToT), the modern reasoning-RL paradigm (o1, R1, GRPO), process supervision, inference-time scaling laws, distillation and small reasoning models, multimodal and tool-augmented reasoning, evaluation, deployment, and connections to agents and alignment.

Approximate length target: 15,000–22,000 words.

§1. Motivation and Scope

Three worked instances

Three concrete instances spanning the modern reasoning-model landscape.

Instance 1: o3 solves a hard math problem. A user gives o3 (OpenAI, December 2024) a problem from Humanity’s Last Exam: a multi-step physics calculation that requires deriving an equation, applying domain-specific knowledge, and computing numerical answers. The model generates extended internal reasoning (visible only as a summary in the API; raw chain hidden) - explores possible approaches, applies physical principles, checks dimensional consistency, computes numerics, verifies. After perhaps 30,000 tokens of internal deliberation lasting several minutes, the model produces a concise final answer. The same model with reasoning disabled (or a comparable-size non-reasoning model) would fail the same problem.

Instance 2: DeepSeek-R1 solves a competition programming task. A developer asks DeepSeek-R1 (DeepSeek, January 2025) to solve a hard LeetCode problem. The model generates a visible chain of thought: parses the problem, considers data structures, sketches a brute-force solution, identifies it’s too slow, develops an optimized algorithm, traces through it on the example, identifies edge cases, refines, produces the final code. The visible reasoning trace makes the model’s process inspectable. The model was trained with GRPO on a verifiable code-execution reward - the reward signal was “does the code pass the test cases” - and learned to produce these long deliberative traces.

Instance 3: Claude with extended thinking debugs a research codebase. A researcher gives Claude (with extended thinking enabled, mid-2025) a codebase with a subtle bug: training-loss diverges after epoch 5 for unclear reasons. The model produces extended thinking - examines the code structure, hypothesizes potential causes (learning-rate schedule, data leakage, numerical instability), traces through the training loop, identifies that the gradient-clipping threshold was set incorrectly relative to the batch size after a recent refactor. The thinking is exposed to the user (transparency) but separated from the final response. The reasoning is exploratory (multiple hypotheses, some discarded) rather than linear.

These three instances span hard-knowledge problems (math/physics), verifiable algorithmic tasks (competitive programming), and exploratory practical problems (debugging). They share the long deliberation before answer pattern; they differ in whether the chain is visible, the source of training signal, and the application style.

What reasoning models are

A working definition. Reasoning models are LLMs specifically designed to perform extended deliberative computation before producing final answers. The defining properties:

Extended generation. Reasoning models generate substantially more tokens for hard problems than standard LLMs - often 10x-100x more.
Trained for deliberation. They are typically post-trained (via RLHF-style methods, specifically reasoning-RL) to produce long internal chains.
Inference-time scaling. Quality improves with more thinking time/tokens, in addition to model size scaling.
Separation of thinking and answer. The reasoning trace is often distinguished from the final output (visible as a “thinking” section, hidden behind summaries, or held entirely internally).

The basic architecture:

   REASONING-MODEL PIPELINE

   User query
       │
       ▼
   <thinking>
     extended chain of reasoning
     - hypothesize
     - try approach
     - verify
     - backtrack
     - refine
     ...
   </thinking>
       │
       ▼
   Final answer (concise)

The crucial property. The reasoning content is functionally distinct from the answer: it provides intermediate computation that improves answer quality but isn’t itself the response.

Why reasoning models matter

Four structural reasons.

1. Hard problems require deliberation. Many problems (proofs, complex code, multi-step reasoning) are simply not solvable in a single forward pass. Extended chain-of-thought enables solving these problems.

2. Inference-time compute as new scaling axis. Pre-training scaling (Kaplan et al. 2020, Chinchilla) suggested model-size-and-data are the primary scaling axes. Reasoning models add inference-time compute as a third axis - more thinking time → better answers. This has substantial implications for compute allocation and the future trajectory of model capability.

3. Verifiable reward enables RL. RLHF (Alignment §4) relied on preference data - expensive, subjective, hard to scale. Reasoning RL (RL §12) uses verifiable rewards (does the code run? does the answer match?) - cheap, objective, scalable. This unlocks substantially more RL data than RLHF.

4. Reasoning quality improves with scaled reasoning RL. The 2024-2026 evidence (o1, o3, R1) shows reasoning quality improves substantially with scaled reasoning-RL training. This is a new capability frontier.

The combined argument. Reasoning models address persistent LLM limitations (one-shot generation failure on hard problems; pre-training scaling diminishing returns; expensive RLHF data) with a substantively-better architecture. By 2026, reasoning models are state-of-the-art for hard tasks across math, science, coding, and complex reasoning.

The 2024-2026 reasoning inflection

A specific industry transition. Pre-September 2024, LLMs were predominantly single-pass next-token generators with optional CoT prompting. After o1’s release (September 2024), the field rapidly developed reasoning models as a distinct paradigm.

The trajectory.

September 2024. OpenAI releases o1-preview and o1-mini. Demonstrates substantial gains on math, science, coding via inference-time reasoning. Reasoning chains hidden; only summaries shown.
December 2024. OpenAI releases o3 with substantial gains on ARC-AGI, FrontierMath, code competitions. Pricing reflects substantial inference-time compute cost.
January 2025. DeepSeek-R1 released. Open weights; visible reasoning chains; methodology paper. Demonstrates reasoning-RL is reproducible outside top labs.
February 2025. Anthropic releases Claude with extended thinking. Visible thinking; user-controlled thinking-time.
2025. Most frontier labs ship reasoning-capable variants. Reasoning becomes standard offering.
2025-2026. Small reasoning models via distillation (R1-distilled variants); reasoning specialized for domains (math, code, agents).

The 2026 state. Reasoning models are the dominant paradigm for hard problems. Standard non-reasoning LLMs remain for low-latency, low-complexity tasks; reasoning models for hard problems.

Boundaries with adjacent chapters

Large Language Models §7 covers CoT prompting; this chapter covers the broader reasoning-model paradigm including RL-based reasoning.
Reinforcement Learning §12 covers reasoning-RL methodology; this chapter applies it to the reasoning-model context.
AI Agents §5 covers planning; this chapter covers reasoning as one form of internal planning.
Alignment §4 covers RLHF; this chapter covers reasoning-RL as related but distinct.
Evaluation §4 covers reasoning benchmarks; this chapter develops what these benchmarks measure for reasoning models.
Multimodal Models §10 covers VLMs; this chapter covers multimodal reasoning extensions.
Mechanistic Interpretability §10 covers reasoning-circuit analysis (partial).

What this chapter does not try to do

We do not cover classical search and planning in depth (covered in classical-AI textbooks and AIMA Ch 10-11). We focus on neural-LLM reasoning.
We do not develop formal mathematical reasoning theory.
We do not provide implementation tutorials for reasoning-RL frameworks.

Position taken in this chapter

The chapter takes reasoning models seriously as a substantive new paradigm. The 2024-2026 advances are real and substantial; the methodology is replicable (R1 demonstrated this); the open questions (process vs outcome supervision, reasoning-RL data, multimodal reasoning, agent integration) are substantive. The chapter is neutral on contested questions about whether models “truly reason” - we describe what reasoning models do and what evidence supports their capabilities.

§2. Historical Context

This section traces reasoning models from classical AI reasoning through modern LLM-based extended deliberation.

A timeline of the inflection points:

   1950s-2000s  Classical AI reasoning. Logic-based systems
                  (resolution, Prolog); planning (STRIPS,
                  PDDL); search (A*, MCTS); expert systems.
                  Symbolic reasoning era.
                                  │
                                  ▼
   2017-2020    Transformer-based language models emerge.
                  GPT-2, GPT-3 show emergent capabilities
                  including basic reasoning. No explicit
                  reasoning training.
                                  │
                                  ▼
   2022 Jan     Chain-of-thought prompting (Wei et al.,
                  Google 2022). Demonstrates that prompting
                  models to "think step by step" improves
                  reasoning task performance substantially.
                                  │
                                  ▼
   2022 Mar     Self-consistency (Wang et al., Google 2022).
                  Sample multiple chains; take majority vote.
                  Demonstrates inference-time scaling helps.
                                  │
                                  ▼
   2023         Tree-of-thoughts (Yao et al., 2023). Explicit
                  search over reasoning steps with evaluation
                  and pruning. Extends CoT to tree search.
                                  │
                                  ▼
   2023         Process supervision (Lightman et al., OpenAI
                  2023): "Let's Verify Step by Step."
                  Process reward models trained on per-step
                  correctness labels.
                                  │
                                  ▼
   2024 Sept    OpenAI o1-preview and o1-mini released.
                  First major reasoning-RL model. Extended
                  thinking via RL; substantial gains on
                  math/science/coding benchmarks.
                                  │
                                  ▼
   2024 Dec     OpenAI o3 announced. ARC-AGI 87.5% (vs 25%
                  for o1); FrontierMath 25.2% (vs 2% for
                  o1). Substantial inference-time compute
                  cost; demonstrates scaling.
                                  │
                                  ▼
   2025 Jan     DeepSeek-R1 released. Open-weights reasoning
                  model; methodology paper. GRPO training
                  with verifiable rewards; visible chains.
                  Demonstrates reasoning-RL reproducible
                  outside top labs.
                                  │
                                  ▼
   2025 Feb     Anthropic Claude with extended thinking.
                  Visible thinking; user-controlled budget.
                                  │
                                  ▼
   2025         Most frontier labs ship reasoning variants.
                  Gemini Thinking (Google), Qwen-QwQ
                  (Alibaba), small distilled models proliferate.
                                  │
                                  ▼
   2025-2026    Reasoning specializations: math-specific
                  (DeepSeek-Math, AlphaProof), code-specific
                  (o3-mini, CodeR1), agent-tuned variants
                  (DeepResearch). Inference-time-compute
                  budget management as standard.

We develop each phase below.

Classical AI reasoning

Pre-LLM reasoning systems. Classical AI focused on symbolic reasoning - manipulating logical formulas, performing planning over discrete state spaces, applying expert-defined rules.

Logic-based systems. Resolution theorem proving (Robinson 1965), Prolog (Colmerauer-Roussel 1972). Reasoning as logical inference over explicit knowledge representations.

Classical planning. STRIPS (Fikes-Nilsson 1971), PDDL standard. Reasoning as search over action sequences from initial state to goal.

Expert systems. MYCIN (1970s), DENDRAL. Reasoning via rule-firing on encoded expert knowledge.

Game-playing search. Minimax with alpha-beta (1950s); Monte Carlo Tree Search (Coulom 2006). Reasoning as tree search.

The 2026 perspective. Classical reasoning approaches have substantial limitations (knowledge-acquisition bottleneck; combinatorial explosion; brittleness). The LLM-based reasoning paradigm takes a different approach - learn reasoning patterns from massive data rather than encode them explicitly. Some classical methods (MCTS especially) still play roles in modern systems (AlphaZero, AlphaProof).

LLM-emergent reasoning

The 2020-2021 baseline. GPT-3 (Brown et al. 2020) demonstrated that scaled language models acquire some reasoning capability without explicit reasoning training - solving multi-step arithmetic, answering chained queries.

The pattern. Pretraining on text that contained worked examples (math homework, programming tutorials, scientific explanations) embedded reasoning patterns in model weights. Models could apply these patterns to new problems.

The limits. Pre-CoT GPT-3 was substantially limited on harder reasoning tasks. Models often produced confident-sounding wrong answers; multi-step arithmetic failed; complex logical chains broke down. Models had reasoning capability but not reliable deliberative computation.

Chain-of-thought prompting

The 2022 inflection. Wei et al. (Google Research, January 2022) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Demonstrated that prompting models to produce step-by-step reasoning substantially improved performance on math and reasoning benchmarks.

The recipe. Include in the few-shot prompt examples of step-by-step reasoning. Model learns from in-context examples to produce step-by-step reasoning on new problems.

The result. Substantial gains on GSM8K (math word problems): from ~18% (direct prompting) to ~57% (CoT prompting) with PaLM-540B. Established that eliciting reasoning was substantially helpful.

The trajectory. CoT became standard prompting practice. Many subsequent papers refined: zero-shot CoT (“Let’s think step by step”); least-to-most prompting; self-consistency; tree-of-thoughts.

Self-consistency (Wang et al., March 2022). Sample multiple reasoning chains; take majority-vote on final answer. Substantial improvement over single-sample CoT. Demonstrates inference-time scaling - using more compute at inference helps.

Tree-of-thoughts (Yao et al., 2023). Explicit search over reasoning steps. Generate multiple thoughts; evaluate; expand promising; prune. Substantial gains on Game of 24, creative writing.

The pattern. By 2023, prompting-based reasoning was substantial but had clear limits - required clever prompting; quality depended on prompt engineering; substantial variance across runs.

Process supervision

A 2023 advance. Lightman et al. (OpenAI, 2023) “Let’s Verify Step by Step.” Demonstrated that training reward models on per-step correctness (process supervision) substantially outperformed training on only final-answer correctness (outcome supervision).

The recipe. Use human annotators to label per-step correctness of model-generated reasoning chains. Train a process reward model (PRM) on per-step labels. Use the PRM for best-of-N selection or RL.

The result. PRM-based selection substantially outperformed ORM (outcome reward model) selection on MATH benchmark. Suggested that intermediate-reasoning quality matters for reasoning quality.

The trajectory. Process supervision became one approach to reasoning improvement. Later work (DeepSeek-R1 specifically) demonstrated that outcome-only RL with sufficient scale could achieve comparable or better results - process supervision is not strictly necessary. The debate remains open.

The o1 inflection

September 2024. OpenAI released o1-preview and o1-mini. The first major reasoning-RL model: extended thinking via RL training; substantial gains on math, science, coding benchmarks.

The methodology (partially disclosed). RL training on reasoning tasks with rewards; produces models that generate long internal reasoning chains before answers. OpenAI hid the raw chains, showing only summaries - citing safety considerations and competitive concerns.

The benchmark results.

AIME 2024 (math competition). o1: 56.7% (vs GPT-4o: 13.4%).
GPQA Diamond. o1: 78.0% (vs GPT-4o: 50.6%).
Codeforces percentile. o1: 89th (vs GPT-4o: 11th).

The pattern. Substantial gains on hard reasoning tasks; limited gains on knowledge-heavy or non-reasoning tasks. The model didn’t get smarter at everything - it got substantially better at reasoning-intensive tasks.

The cost. o1 inference was substantially more expensive than GPT-4o per request - both API price and latency. The “thinking time” was minutes for hard problems.

The o3 and December 2024

OpenAI o3 announced December 2024. Substantial further gains, particularly on benchmarks designed to resist memorization.

ARC-AGI. o3: 87.5% (vs ~25% for o1, ~5% for GPT-4o). The first model to substantially solve ARC-AGI (a benchmark explicitly designed for generalization, not memorization).
FrontierMath. o3: 25.2% (vs ~2% for o1). FrontierMath problems are research-level math.
SWE-bench Verified. o3: 71.7% (vs ~49% for o1).

The cost. o3 ran in two modes - “low compute” (~ $20/task) and "high compute" (~$ 3,000/task on ARC-AGI). Demonstrated inference-time-compute scaling - more thinking time → better answers, at substantially scaled cost.

DeepSeek-R1 and the open reasoning era

January 2025. DeepSeek released R1 with open weights and a methodology paper. Reproduced reasoning-RL outside top US labs; visible reasoning chains; demonstrated GRPO training pipeline.

The methodology (disclosed). R1 was trained in stages:

R1-Zero. Apply RL directly to a pretrained base model (DeepSeek-V3 base) with verifiable rewards (math correctness, code execution). Skip the supervised fine-tuning step entirely.
Cold-start data. Generate some examples of good reasoning from R1-Zero; use as cold-start for SFT.
Reasoning-oriented RL. Apply GRPO with verifiable rewards.
Rejection sampling + SFT. Sample high-quality reasoning from the RL model; use as SFT data.
RL for helpfulness and safety. Final RL pass for general-helpfulness.

The R1 paper revealed that outcome-only RL with sufficient scale produces substantial reasoning capability - process supervision is helpful but not strictly necessary.

The “aha moment” finding. The R1 paper described an “aha moment” - during training, the model spontaneously developed sophisticated reasoning behaviors (backtracking, alternative-approach generation, self-verification) that were not explicitly demonstrated in training data. The reasoning capability emerged from RL training, not just from imitation.

The impact. R1 demonstrated reasoning-RL was reproducible with available compute (substantial but not Anthropic/OpenAI scale). Several follow-up papers replicated; the methodology became standard.

The 2025-2026 reasoning landscape

The expansion. Through 2025, most frontier labs shipped reasoning-capable models:

OpenAI. o1-mini, o1, o3, o3-mini (with reasoning effort tuning).
DeepSeek. R1, R1-Distill variants.
Anthropic. Claude with extended thinking (user-controlled budget).
Google. Gemini 2.0 Flash Thinking, Gemini 2.5 Pro Deep Think.
xAI. Grok 3 with thinking.
Alibaba. Qwen-QwQ, Qwen3 Reasoning.

The specialization trend. Reasoning models specialized for domains:

Mathematics. DeepSeek-Math, AlphaProof, math-specific reasoning models.
Code. o3-mini-high (code-tuned variants), CodeR1.
Agents. DeepResearch, computer-use-tuned reasoning models.
Science. Domain-specific reasoning for biology, chemistry.

The small-model trend. R1-Distill (32B, 14B, 7B parameter distilled variants) demonstrated that reasoning capability could be substantially transferred to smaller models. Standard distillation: train smaller model on reasoning traces from larger model.

Where this leaves us in 2026

Reasoning models are the dominant paradigm for hard problems. The methodology (post-training RL on verifiable rewards; extended generation; inference-time-compute scaling) is standard. Active frontiers include: process vs outcome supervision; scaling reasoning RL; multimodal reasoning; agent integration; evaluation and reliability.

§3. The Reasoning-Model Framework

What “reasoning” means in this context

A specific operational definition. Reasoning in the LLM context refers to:

Multi-step computation. The model generates a sequence of intermediate steps to solve a problem.
Working-memory-like usage of tokens. Intermediate tokens serve as scratch space for computation.
Deliberative behavior. The model explores possible approaches, makes hypotheses, verifies, backtracks.

The distinction from one-shot generation. Standard LLM generation produces an answer in a single forward pass through the network. Reasoning generation produces a long sequence of intermediate tokens that compose the eventual answer.

The argument for token-as-scratch-space. Each generated token is processed by the full network on subsequent steps - the model can use intermediate computations as working memory. This enables substantially more computation than the fixed network depth allows in one forward pass.

The reasoning-model architecture

Reasoning models share architecture with standard LLMs (decoder-only transformers - cross-reference LLM §4) but are trained differently to produce extended deliberative outputs. The key architectural elements:

Standard transformer backbone. Same architecture as base LLM; same layer count, head count, embedding dimension.
Long context support. Reasoning models often have longer context windows than non-reasoning counterparts (to accommodate long reasoning chains).
Optional separator tokens. Some implementations use special tokens (<thinking>...</thinking>) to separate reasoning from final answer.

The training makes a model a reasoning model, not the architecture.

The thinking-vs-answer separation

A common implementation pattern. The model produces two distinct outputs:

   REASONING-MODEL OUTPUT STRUCTURE

   <thinking>
     Let me approach this problem...
     First, I'll consider...
     Wait, that's not quite right because...
     Let me try a different approach...
     ...
     OK, so the answer is...
   </thinking>

   <answer>
     The answer is 42.
   </answer>

The implementations vary.

OpenAI o-series. Reasoning hidden by default; only summaries visible. Optional “show reasoning” for some endpoints.
DeepSeek-R1. Reasoning visible in <think>...</think> tags before final answer.
Anthropic Claude extended thinking. Reasoning visible in dedicated thinking section; user controls thinking-time budget.
Gemini Thinking. Reasoning visible.

The trade-offs of visibility. Visible reasoning enables transparency, debugging, trust. Hidden reasoning protects training methodology, prevents distillation, may permit raw unfiltered thinking. Field consensus by mid-2025: visible is preferable for most use cases; transparency arguments outweigh competitive concerns.

The inference-time compute axis

A central concept. Reasoning models introduce inference-time compute as a quality-improving axis, alongside the classical model size and training compute axes.

The classical scaling axes (Kaplan et al. 2020, Chinchilla 2022 - cross-reference Foundation Models §6).

Model size $N$ . Bigger model → better.
Training data $D$ . More data → better.
Training compute $C$ . $C = 6ND$ .

The new axis.

Inference compute. More tokens at inference → better, for reasoning models.

The implication. For a fixed model, quality is a function of inference budget. A reasoning model with 32K-token budget may substantially outperform the same model with 4K-token budget on hard problems. This changes the deployment economics - you choose model AND inference budget per query.

The reasoning-quality knob

A specific deployment pattern. Reasoning models often expose reasoning effort as a user-controllable parameter:

OpenAI o3. “reasoning_effort” parameter: low/medium/high.
Claude extended thinking. Token budget parameter.
Gemini Thinking. Thinking-time parameter.

The pattern. Higher reasoning effort = more thinking tokens = higher quality on hard problems = higher cost and latency. The user chooses the trade-off per query.

The 2026 norm. Most reasoning-model APIs expose this knob. Standard practice: low for simple queries, high for hard ones.

Capability profile

What reasoning models are good at. Empirical observations from 2024-2026:

Mathematics. Substantial gains on AIME, FrontierMath, math competitions.
Coding. Substantial gains on competitive programming, SWE-bench, complex debugging.
Science. Gains on GPQA, multi-step scientific reasoning.
Logical puzzles. Gains on logical-deduction tasks, multi-step reasoning.

What reasoning models are less differentiated on.

Pure knowledge questions. Don’t help much beyond base-model knowledge.
Simple instruction-following. Standard models suffice.
Creative writing. Limited evidence reasoning helps significantly.
Conversational tasks. Often unnecessary; can add latency without quality gain.

The 2026 pattern. Reasoning models are substantially better for hard reasoning tasks; similar for simple tasks; sometimes worse (over-thinking) for trivial queries.

The deployment trade-off

Reasoning models are typically:

Slower. Long generation → longer wall-clock time. Often 10-100x.
More expensive. Output-token cost dominates; reasoning models generate substantially more tokens.
Sometimes worse at simple things. Over-thinking can hurt on simple queries.

Production pattern in 2026: route queries by complexity. Simple queries → standard LLM. Hard queries → reasoning model. Routing itself often uses a small LLM classifier.

§4. Chain-of-Thought Prompting

The original CoT

The 2022 paper. Wei et al. (Google Research, 2022) “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Demonstrated that few-shot prompts including step-by-step reasoning substantially improved performance on reasoning benchmarks.

The recipe.

   FEW-SHOT COT PROMPT EXAMPLE

   Q: Roger has 5 tennis balls. He buys 2 more cans of
   tennis balls. Each can has 3 tennis balls. How many
   tennis balls does he have now?
   A: Roger started with 5 balls. 2 cans of 3 balls each
   is 6 balls. 5 + 6 = 11. The answer is 11.

   Q: [actual question]
   A: [model produces step-by-step reasoning + answer]

The result on GSM8K (Cobbe et al. 2021 math word problems): PaLM-540B accuracy went from ~18% (direct prompting) to ~57% (CoT prompting). Substantial gain.

The interpretation. The model has reasoning capability but doesn’t use it spontaneously - prompting elicits it.

Zero-shot CoT

Kojima et al. (2022) “Large Language Models are Zero-Shot Reasoners.” Demonstrated that simply appending “Let’s think step by step” to a question elicits CoT-style reasoning without explicit examples.

The recipe.

   ZERO-SHOT COT

   Prompt: [Question]
           Let's think step by step.
   Model: [Step-by-step reasoning]
          Therefore, the answer is X.

Substantial gains on math benchmarks. Zero-shot CoT often approaches few-shot CoT quality.

Self-consistency

Wang et al. (Google, 2022) “Self-Consistency Improves Chain of Thought Reasoning in Language Models.” Sample multiple reasoning chains; take majority vote on the final answer.

The procedure.

   SELF-CONSISTENCY

   For i in 1..N:
       chain_i = sample_cot(prompt, temperature > 0)
       answer_i = extract_answer(chain_i)

   final = majority_vote(answer_1, ..., answer_N)

The result. Substantial improvement over single-sample CoT. With N=40, GSM8K accuracy with PaLM-540B reached ~75% (vs ~57% single-sample CoT).

The interpretation. Multiple reasoning paths sample different errors; majority vote averages out errors. The first form of inference-time scaling - using more compute at inference improves quality.

Least-to-most prompting

Zhou et al. (2022) “Least-to-Most Prompting Enables Complex Reasoning in Large Language Models.” Decompose problem into easier sub-problems; solve sequentially.

The pattern.

   LEAST-TO-MOST

   1. Prompt model to decompose problem into sub-problems.
   2. Solve sub-problems in order, each sub-problem
      conditioned on solutions of earlier ones.
   3. Combine to final answer.

Substantial gains on multi-step problems (compositional generalization).

Other prompting variants

The 2022-2024 prompting-research explosion produced many variants:

Step-back prompting. Ask model to consider abstract principles before specifics.
Self-ask. Model asks itself sub-questions and answers them.
Plan-and-solve. Model generates plan; then executes plan.
Chain-of-verification. Generate; then verify each claim; revise.
Self-refine. Iterative refinement with self-feedback (cross-reference §8).

The 2026 picture. Prompting-based reasoning is substantial but limited. Best prompting approaches plateau around the model’s baseline capability - they elicit existing capability rather than create new capability. The 2024 reasoning-RL paradigm substantially extends beyond prompting.

Why prompting-only reasoning has limits

Three structural limitations.

Quality ceiling. Prompting elicits the model’s existing capability. If the model can’t reason well, no prompt fixes that.

Brittleness. Prompt quality matters substantially. Small prompt changes can substantially change quality.

Format dependence. Different formats (Q&A, plain text, JSON) yield different quality.

The implication. To substantially improve reasoning, you need to train the model for reasoning - not just prompt it. The RL-based approach (§5-§7) does exactly this.

§5. Tree-of-Thoughts and Search-Based Reasoning

Tree of Thoughts

Yao et al. (2023) “Tree of Thoughts: Deliberate Problem Solving with Large Language Models.” Extends CoT with explicit search over reasoning steps.

The structure.

   TREE-OF-THOUGHTS PATTERN

                    [Problem]
                       │
            ┌──────────┼──────────┐
            ▼          ▼          ▼
        Thought_1  Thought_2  Thought_3  ← Generation
            │          │          │
        [Evaluator scores each]              ← Evaluation
            │          │          │
            ▼          ▼          ▼
        Score: 0.3  Score: 0.8  Score: 0.5
                       │                      ← Selection
                       ▼                      (pick top-K)
            ┌──────────┼──────────┐
            ▼          ▼          ▼
        Thought_2a Thought_2b Thought_2c     ← Expansion
            │          │          │
            ...continue until terminal nodes
            │
            ▼
        Best final answer

The four components.

Thought generation. Model produces candidate next-steps.
Evaluation. Model (or learned evaluator) scores thought quality.
Search. BFS or DFS or beam over the tree.
Pruning. Drop low-scoring branches.

The result. Substantial gains on Game of 24 (mathematical puzzle), creative writing, mini-crosswords.

When tree search helps

The empirical pattern. Tree search helps most when:

Problem has decomposable structure. Sub-problems are tractable.
Quality evaluation is feasible. Model can score intermediate states.
Search budget is available. Each branch explored costs tokens.

It helps less when:

Problem is monolithic. Hard to decompose.
Evaluation is unreliable. Model can’t reliably score partial solutions.
Linear chain works. Many problems don’t need branching exploration.

Search in modern reasoning models

A key question. Do o1, o3, R1 do explicit tree search? Probably not. The visible reasoning chains in R1 (and likely o-series) look more like linear chains with backtracking - the model explores a path, identifies issues, returns to an earlier state, tries differently. This is structurally similar to depth-first search with backtracking but executed within a single linear chain.

The interpretation. Modern reasoning models have internalized search-like behaviors into linear generation. Rather than an external search loop calling the model, the model itself generates exploratory-and-backtracking content.

MCTS in reasoning systems

AlphaProof (Google DeepMind, 2024) and AlphaGeometry. Use MCTS with neural-network-guided search over proof steps. Achieved IMO 2024 silver-medal performance.

The pattern.

Tree search over proof steps.
Neural-network policy prioritizes promising steps.
Verifier (Lean theorem prover) certifies proof correctness.

The integration. Reasoning + symbolic verification produces substantial capability. Cross-reference AI for Science §4.

The contrast with pure-LLM reasoning. AlphaProof uses an external search loop; o-style reasoning models have search-like behavior internalized in their generation. Both approaches work; trade-offs exist.

Search-augmented agents

Agent systems that combine reasoning models with explicit search loops. Cross-reference AI Agents §5. Patterns include:

Branch-and-bound for code generation. Generate multiple solutions; test each; keep best.
MCTS for agent planning. Search over action sequences with reasoning-model rollouts.
Beam search over reasoning traces. Maintain top-K reasoning traces; expand best.

The 2026 status. External search around reasoning models is one approach. Internal-search-like behavior in reasoning models is another. Both used in practice; combinations exist.

§6. The Reasoning-RL Paradigm

The motivating shift

The 2024 transition. Pre-2024 reasoning improvements came primarily from prompting. The 2024 inflection (o1, R1) introduced training models for reasoning - using reinforcement learning to optimize reasoning quality directly.

The motivation. Prompting has a quality ceiling. RL can push beyond - train the model to produce reasoning chains that lead to correct answers. This is a substantively different paradigm.

Reinforcement learning for reasoning

A specific RL setup. Cross-reference RL §10-§12 for general RL background.

The setup.

Environment. Reasoning tasks (math problems, coding problems, scientific reasoning).
State. Current partial reasoning + problem.
Action. Next token.
Reward. Final-answer correctness (verifiable: does the math work? does the code pass tests?).

The objective. Train the policy (the LLM) to produce reasoning chains that maximize the probability of correct final answers.

The training procedure (canonical reasoning-RL):

   REASONING-RL TRAINING LOOP

   Inputs:
     - Pretrained base LLM
     - Training prompts (math/code/reasoning problems)
     - Verifiable reward function (correctness checker)

   Loop:
     1. Sample a batch of prompts.
     2. For each prompt:
        Generate K rollouts (reasoning chains + answers).
     3. Score each rollout via verifiable reward.
     4. Compute advantages (group-relative).
     5. Update policy via clipped surrogate (PPO/GRPO).
     6. Periodic checkpoint.

   Output: Reasoning-tuned model.

The key innovations vs RLHF.

Verifiable rewards. No reward model needed; reward is computed automatically (math: does answer match? code: do tests pass?).
Scalable data. Verifiable problems are abundant (math libraries, code-test-suite repos).
Long generations rewarded. RL learns that long reasoning chains lead to correct answers.

GRPO: the canonical reasoning-RL algorithm

Group Relative Policy Optimization (GRPO) - introduced in DeepSeekMath (Shao et al., 2024) and used prominently in DeepSeek-R1. Cross-reference RL §10 for full algorithmic details.

The key idea. Standard PPO uses a value function (critic) to estimate baselines. GRPO eliminates the critic - uses group-relative baselines instead.

The procedure.

   GRPO STEP

   For each prompt q:
     1. Sample G rollouts: o_1, ..., o_G from current policy.
     2. Compute rewards r_1, ..., r_G via verifiable reward.
     3. Compute group-relative advantages:
          A_i = (r_i - mean(r_1..r_G)) / std(r_1..r_G)
     4. Compute per-token GRPO loss (similar to PPO clipped):
          L = -E[ min(ratio * A, clip(ratio, 1-eps, 1+eps) * A) ]
          + beta * KL(policy, reference_policy)
     5. Take gradient step.

Why GRPO works for reasoning RL.

No critic needed. Saves ~2x compute (no value-network training).
Group baselines are natural for verifiable rewards. When you sample multiple solutions and some are right, the right ones are clearly better than average - natural baseline.
Stable training. KL constraint to reference policy prevents drift.

The empirical evidence. DeepSeek-R1 paper demonstrates GRPO produces substantial reasoning capability.

The “aha moment” finding

A specific empirical observation. DeepSeek-R1 paper (DeepSeek, 2025) described an “aha moment” during R1-Zero training - the model spontaneously developed sophisticated reasoning behaviors:

Backtracking. “Wait, let me reconsider...”
Self-verification. “Let me check this answer by...”
Alternative approaches. “Actually, a different approach is...”
Longer chains. Average chain length grew during training.

The interpretation. These behaviors emerged from RL training on verifiable rewards. They were not explicitly demonstrated in training data; the RL signal alone produced them. The model learned that reasoning behaviors led to correct answers.

The significance. Provides evidence that reasoning capability can be elicited via RL alone, without explicit demonstrations. This is a substantively different finding from imitation-learning-based approaches.

Verifiable reward sources

What “verifiable” means in practice.

Mathematics. Final-answer correctness (compare against known answer).
Code. Tests pass / fail.
Logical puzzles. Solver verification.
Theorem proving. Lean/Coq proof check.

What’s not easily verifiable.

Creative writing quality. Subjective.
Conversational helpfulness. Subjective.
Open-ended scientific questions. Often no ground truth.

The 2026 pattern. Reasoning RL works substantially better in verifiable domains. RLHF (for subjective domains) and reasoning RL (for verifiable domains) coexist.

Outcome vs process supervision

A key methodological question. Should reward signal apply at:

Outcome level. Final answer correct/wrong → single reward per chain.
Process level. Each step correct/wrong → per-step rewards.

Process Reward Models (PRMs) (Lightman et al., OpenAI 2023). Train reward model on per-step labels (often human-labeled). Substantial gains over outcome-only supervision.

Outcome Reward Models (ORMs) are simpler. Just check final answer.

The 2024-2025 finding. R1 demonstrated that outcome-only RL with sufficient scale produces substantial reasoning capability. Process supervision is helpful but not strictly necessary.

The trade-offs.

Process supervision. Higher signal density; more data per chain; requires per-step labels (expensive).
Outcome supervision. Lower signal density; only final reward; cheap data (just need answer key).

The 2026 picture. Both approaches exist; outcome is more scalable; process is more sample-efficient. Hybrid approaches (PRM for initial training; outcome for scaling) are common.

Scaling reasoning RL

The empirical scaling. R1 paper and follow-ups demonstrate substantial scaling.

More training compute. Substantial gains.
More diverse problems. Better generalization.
Larger base models. Reasoning capability composes with base-model capability.

The cost. Reasoning RL is expensive. Each training step requires generating long rollouts (thousands of tokens). The compute is substantial.

The 2026 picture. Reasoning RL is the dominant approach for state-of-the-art reasoning. Scaling continues; substantive gains continue. Likely the primary driver of reasoning capability improvement in 2026 and beyond.

§7. Inference-Time Compute Scaling

The new scaling axis

A substantive 2024-2026 development. Reasoning models introduced inference-time compute as a quality-improving axis.

The classical view (pre-2024). Quality is primarily a function of model size and training compute. Larger models trained on more data are better.

The new view (2024-2026). Quality is also a function of inference compute. Reasoning models can substantially improve quality by thinking longer.

The implication. The compute-cost-quality landscape is now multi-dimensional. You can trade off:

Train bigger model + standard inference vs
Train smaller model + more inference compute

These produce different cost/quality profiles for different applications.

Empirical inference-time scaling

The o3 ARC-AGI result. o3 ran in two modes - “low compute” and “high compute” - with substantially different ARC-AGI scores. Demonstrated explicit inference-time-compute scaling.

The ARC-AGI numbers.

Low compute (~$20/task). ARC-AGI: 75.7%.
High compute (~$3,000/task). ARC-AGI: 87.5%.

The interpretation. Substantial quality improvement from substantial inference-compute increase. The high-compute regime is not economical for production but demonstrates the scaling.

Inference-time scaling vs training-time scaling

A substantive comparison. Both axes provide quality. Which is more cost-efficient?

The empirical observation. Both matter; both scale. For some problems, inference-time scaling is highly cost-effective (more thinking solves hard problems faster than retraining). For other problems, training-time scaling is necessary (knowledge problems benefit more from larger training).

The 2026 picture. The cost-effective frontier combines moderate model size + substantial inference compute for hard reasoning tasks. Pre-training scaling has hit diminishing returns; inference-time scaling is fresh territory.

Best-of-N and verifier-based inference scaling

A specific pattern. Generate N reasoning chains; pick the best one. The “best” can be determined by:

Self-consistency (majority vote). Pick the most common answer.
Verifier (PRM/ORM). Score each chain with a separate reward model; pick the highest-scoring.
External tool. For math, run a checker. For code, run tests.

The recipe.

   BEST-OF-N INFERENCE

   For i in 1..N:
       chain_i = sample_reasoning(prompt, temperature)
       answer_i = extract(chain_i)
       score_i = verifier(chain_i, answer_i)

   final = chain with highest score

The result. Substantial quality improvement over single-sample. N scales sub-linearly (diminishing returns).

Process reward models for inference

Process reward models trained per §6 can be used for inference-time scaling:

   PRM-GUIDED INFERENCE

   At each reasoning step:
       Sample K candidate next-steps.
       Score each with PRM.
       Select top step or sample weighted by PRM score.

   Continue until terminal step.

This is similar to guided search - PRM guides the model toward better reasoning chains.

The 2026 status. PRM-guided inference is one technique. Substantial empirical gains in some settings. Computational overhead substantial (PRM is a second model running in parallel).

Reasoning effort knobs

The 2026 production pattern. APIs expose reasoning effort as a tunable parameter:

OpenAI o3. reasoning_effort: low/medium/high.
Claude. thinking_budget (token budget).
Gemini. Thinking-time parameter.

The user choice. Per-query selection of effort. Production systems often:

Default to medium for general traffic.
Use low for simple categorized queries (latency-sensitive).
Use high for hard categorized queries (quality-sensitive).

The optimization. Substantial cost savings possible from intelligent routing.

Inference-time compute limits

The bottlenecks.

Latency. Long reasoning chains take many seconds to minutes. User-facing applications have latency constraints.
Cost. Long chains cost substantially more per query. May not be economical for high-volume applications.
Diminishing returns. Quality improves with compute, but with diminishing returns.

The pattern. Inference-time scaling is substantial but bounded by economics. Not every query gets unlimited thinking time.

The training-compute / inference-compute trade-off

A specific 2026 question. For a fixed total compute budget (training + inference for expected query volume), how should the budget be allocated?

The classical answer (pre-reasoning models). Train the biggest model the budget allows; inference is cheap.

The 2026 answer. Depends. For hard-reasoning applications, substantially shift toward inference. The crossover depends on problem type, query volume, model size.

This is an active research area. Cross-reference OP-RM-2 below.

A specific pattern. After generating an initial answer, prompt the model to critique and improve.

Self-Refine (Madaan et al., 2023). Iterative refinement: generate → critique → revise → critique → revise → ...

The procedure.

   SELF-REFINE LOOP

   answer = generate(prompt)
   for i in 1..max_iters:
       feedback = critique(prompt, answer)
       if feedback indicates good enough: break
       answer = revise(prompt, answer, feedback)

   return answer

The result. Substantial gains on writing tasks; mixed results on reasoning tasks. The challenge: the model providing feedback may have the same blind spots as the model generating.

Reflexion

Reflexion (Shinn et al., 2023). Add memory across episodes. Model reflects on past failures and incorporates lessons.

The procedure.

   REFLEXION

   For attempt in 1..max_attempts:
       answer = generate(prompt, memory)
       evaluate against ground truth (if available)
       reflection = reflect(prompt, answer, evaluation)
       memory.append(reflection)

The 2026 status. Self-refinement and Reflexion are substantial techniques for some tasks. Reasoning models with built-in deliberation often subsume these patterns (the deliberation includes critique and revision).

Chain-of-verification

Chain-of-Verification (CoVe) (Dhuliawala et al., 2023). After generating an answer, generate verification questions and check the answer against them.

The recipe.

   COVE

   1. Generate initial answer.
   2. Generate verification questions about the answer.
   3. Answer verification questions independently.
   4. Refine final answer based on verification.

The result. Substantial gains on factual-answer tasks. Reduces hallucination.

Self-verification in reasoning models

A pattern internalized in modern reasoning models. R1, o-series models often spontaneously self-verify within their reasoning chains:

“Let me check this answer...”
“Wait, let me verify...”
“Actually, let me redo this calculation...”

The interpretation. Reasoning-RL encourages self-verification because verification leads to corrections that lead to correct answers. The “aha moment” finding (§6) included self-verification as a learned behavior.

Critic models and reward modeling

A different approach. Train a separate model to critique/verify reasoning chains. Use the critic for inference-time selection or training feedback.

Process Reward Models (PRMs). Trained on per-step correctness. Used for best-of-N selection. Used as RL reward signal.

Verifier models. Trained to predict whether final answer is correct. Used for best-of-N selection.

LLM-as-judge. Use a strong LLM as critic. Cross-reference Evaluation §7.

The 2026 picture. Critic-based approaches complement internal self-verification. Critic models add capability beyond what reasoning models do internally.

§9. Distillation and Small Reasoning Models

The distillation pattern

Knowledge distillation transfers capability from a large teacher model to a smaller student model. Cross-reference LLM §5 distillation subsection.

The reasoning-specific application. Train smaller models on reasoning traces from larger reasoning models. The student learns to produce reasoning chains similar to the teacher.

The procedure.

   REASONING DISTILLATION

   1. Generate reasoning traces from teacher (e.g., R1).
      For each problem, sample chain + answer.
   2. Filter for high-quality traces (correct answers,
      good reasoning structure).
   3. SFT a smaller base model on the filtered traces.
   4. Optionally further RL-tune the student.

The result. Smaller models can acquire substantial reasoning capability via distillation.

The R1-Distill series

DeepSeek-R1-Distill (released January 2025). Distilled R1’s reasoning capability into smaller open-weights models.

The models.

R1-Distill-Qwen-32B. Distilled into 32B-parameter Qwen base.
R1-Distill-Llama-70B. Distilled into 70B Llama.
R1-Distill-Qwen-14B.
R1-Distill-Qwen-7B.
R1-Distill-Qwen-1.5B. Substantially smaller, still meaningful reasoning.

The methodology. Pure SFT on R1-generated reasoning data. No further RL. Demonstrated substantial reasoning transfer to smaller models.

The benchmark numbers (illustrative).

R1-Distill-Qwen-32B. AIME 2024: ~70%, MATH-500: ~94%.
R1-Distill-Qwen-7B. AIME 2024: ~55%, MATH-500: ~92%.
R1-Distill-Qwen-1.5B. AIME 2024: ~29%, MATH-500: ~83%.

The pattern. Reasoning capability transfers substantially. Even 1.5B-parameter models acquire meaningful reasoning. Larger distilled models approach R1 quality.

Why distillation works for reasoning

Three structural reasons.

Reasoning is a learnable behavior. Not just a knowledge property. The student learns to produce reasoning structure (decomposition, verification, backtracking) - this is structural pattern learning.

Long traces are information-rich. Each reasoning trace contains substantial signal (~5-30K tokens per problem). Students learn substantially from these traces.

Teacher provides correct demonstrations. Filtered traces eliminate teacher errors. Student learns from teacher’s best behaviors.

Limitations of distillation

The gaps.

Quality ceiling. Students don’t exceed teachers. Some applications need teacher-level quality.
Specialization risk. Heavy distillation on math problems may hurt other capabilities.
Distribution shift. Student’s behavior may diverge from teacher’s on out-of-distribution problems.

The economic implications

A substantive 2026 development. Reasoning capability can be substantially compressed into smaller models. This enables:

Edge deployment. Smaller reasoning models can run on consumer hardware.
Cost reduction. Smaller models are cheaper to serve.
Open access. Open-weights small reasoning models democratize reasoning capability.

The trade-off. Smaller models are still smaller - knowledge is reduced, some hard problems remain unsolved. But the ratio of reasoning capability to model size has improved substantially.

Synthetic data from reasoning models

A related pattern. Use reasoning models to generate training data for other models.

Generate reasoning chains. Train other models on these chains.
Generate synthetic problems. Reasoning models can produce diverse problem variations.
Generate verified solutions. Reasoning models with verifier check can produce high-quality SFT data.

The cross-reference. LLM §5 synthetic-data subsection. Reasoning models are substantial sources of synthetic data.

The model-collapse concern. If reasoning models train future reasoning models, does the cycle stabilize or degrade? Active research. Cross-reference Generative Models §10 model-collapse.

§10. Multimodal and Tool-Augmented Reasoning

Multimodal reasoning

The extension. Reasoning over multimodal inputs (images, audio, video). Cross-reference Multimodal Models §4-§7.

Visual reasoning. Solving problems that require reasoning about images:

Geometric problems. “What is the angle in this diagram?”
Scientific figures. “What does this graph show about X?”
Document understanding. “What does this table report?”
Physical reasoning. “What will happen next in this scene?”

The 2026 systems.

OpenAI o1/o3 with image input. Reasoning over visual problems.
Claude with image + extended thinking. Visual reasoning with deliberation.
Gemini multimodal Thinking. Native multimodal reasoning.

The benchmarks. MathVista, MMMU (knowledge), ChartQA, DocVQA. Reasoning models substantially improve on these.

The pattern. Multimodal inputs are processed through the multimodal encoder; reasoning chains include references to visual content; final answers grounded in visual reasoning.

Tool-augmented reasoning

A specific pattern. Reasoning models with access to tools (calculators, code execution, search, databases). Cross-reference AI Agents §4-§5.

The 2026 examples.

OpenAI o-series with code interpreter. Model can execute Python code during reasoning.
Claude with computer use. Reasoning + computer interaction.
DeepResearch (OpenAI, 2025). Reasoning + search for research tasks.

The integration pattern.

   TOOL-AUGMENTED REASONING

   Reasoning:
     "Let me check this by computing..."
     <tool_call>execute Python: x = 10**6 + 7; print(x)</tool_call>
     <tool_result>1000007</tool_result>
     "OK so the value is 1,000,007. Now..."

The benefit. Tools substantially extend capability. Math problems become solvable that aren’t with pure-LLM reasoning. Code can be tested for correctness. External information can be retrieved.

Computer-use reasoning

A specific 2024-2026 frontier. Reasoning models that interact with computer interfaces.

Anthropic Computer Use (October 2024). Reasoning + screen interaction.
OpenAI Operator (January 2025). Browser agent with reasoning.
DeepResearch (2025). Research agent with browsing and reasoning.

The pattern.

   COMPUTER-USE REASONING

   Reasoning:
     "I need to book a flight. Let me check Google Flights..."
     <action>screenshot</action>
     <observation>[Google Flights interface visible]</observation>
     "I see the search form. Let me fill in..."
     <action>click [Date input]</action>
     ...

Cross-reference AI Agents §7 for the computer-use pattern.

Reasoning + retrieval

Reasoning models combined with retrieval (RAG - Retrieval-Augmented Generation, cross-reference Retrieval-Augmented Generation §9).

The 2026 pattern.

Agentic retrieval. Reasoning model decides when to retrieve, what queries to issue.
Multi-hop retrieval. Reasoning model decomposes complex queries; retrieves for each sub-query; integrates.
DeepResearch-style. Reasoning model orchestrates extended research with multiple retrievals.

The cross-reference. RAG §9 agentic-RAG pattern. Reasoning models substantially enable agentic RAG.

Math + tools

A specific high-value combination. Reasoning model + calculator/symbolic-math tool.

WolframAlpha integration. Standard mathematical operations.
Python with NumPy/SciPy/SymPy. Numerical and symbolic math.
Lean/Coq. Formal proof verification.

The pattern. Reasoning model identifies what to compute; tool computes; reasoning model interprets result; continues.

The result. Substantial accuracy gains on numerical and symbolic problems. Reduces arithmetic errors. Enables formal verification.

Code + execution

A specific high-value combination. Reasoning model + code-execution environment.

Code interpreter (Python sandbox). Run generated code.
Test execution. Run tests against generated code.
Iterative debugging. Generate → test → identify errors → fix.

The pattern. Reasoning model writes code; execution confirms or rejects; reasoning model refines.

The 2026 state. Code + execution is standard for coding tasks in reasoning models.

§11. Evaluating Reasoning Models

What’s hard about evaluating reasoning

Specific evaluation challenges.

Quality of reasoning vs quality of answer. Reasoning may be correct even if answer is wrong (computational error in final step). Conversely, answer may be right by luck despite bad reasoning.

Process vs outcome. Should evaluation measure final answer (outcome) or reasoning quality (process)?

Reasoning length. Longer reasoning is sometimes better, sometimes worse. How to measure?

Domain-specific. Math reasoning ≠ scientific reasoning ≠ planning ≠ code reasoning.

Reasoning benchmarks

The 2024-2026 standard benchmarks. Cross-reference Evaluation §4.

Mathematics.

MATH (Hendrycks et al. 2021). High-school competition math. Largely saturated by 2024.
AIME 2024-2025. American Invitational Mathematics Examination. Used to evaluate top reasoning models (o1, R1).
MATH-500. Subset of MATH for evaluation.
FrontierMath (Epoch AI, 2024). Research-level mathematics. Hard for o1; substantial progress with o3.
USAMO/Putnam. Olympiad-level. Still challenging.

Code.

HumanEval (Chen et al. 2021). Programming function-completion. Largely saturated.
MBPP. Programming basics. Largely saturated.
SWE-bench (Jimenez et al. 2024). Real-world software engineering. Substantial 2024-2025 progress (12% → 70%+ in 18 months).
SWE-bench Verified. Filtered SWE-bench.
LiveCodeBench. Newer code-generation benchmark.
Codeforces. Competitive programming. o-series and R1 substantially compete.

Science.

GPQA Diamond (Rein et al., 2023). Graduate-level science questions. Substantial improvements with reasoning models.
MMLU (Hendrycks et al. 2020). Knowledge across subjects. Largely saturated.

General reasoning.

Humanity’s Last Exam (2024). Hard knowledge + reasoning across domains. Latest contamination-resistant benchmark.
ARC-AGI (Chollet 2019, ARC-AGI-2 in 2024). Abstraction and reasoning. o3 substantial breakthrough.
BIG-Bench Hard. Diverse reasoning challenges.

Reasoning quality metrics

Beyond accuracy. Various attempts to measure reasoning quality.

Step-level correctness. Per-step verification of reasoning chain. Requires per-step labels (expensive).

Reasoning length. Token count. Crude proxy for thinking time.

Self-consistency. Variance across multiple samples. Lower variance = more confident reasoning.

Process reward model scores. PRM-assessed quality.

LLM-as-judge. Use a strong LLM to evaluate reasoning quality. Cross-reference Evaluation §7.

The 2026 picture. Accuracy is the dominant metric; reasoning-quality metrics complement. No standard reasoning-quality metric has emerged.

Inference-time-compute benchmarks

Specifically measuring inference-time-compute scaling.

Pass@K. Probability that at least one of K samples is correct. Standard for inference-time scaling.

Compute-vs-accuracy curves. Plot accuracy as a function of inference compute. Reveals scaling efficiency.

Cost-effectiveness. Quality per dollar. Increasingly important for production.

Reasoning evaluation pitfalls

Specific issues.

Contamination. Reasoning models trained on reasoning data may have seen evaluation problems. Cross-reference Evaluation §8.

Saturation. Many benchmarks saturated quickly. Need fresh, hard problems.

Specialization. Models specifically tuned for benchmarks may not generalize.

Evaluation gaming. Models that produce reasoning chains that look impressive but aren’t reliably correct.

The 2026 response. Continued benchmark development (FrontierMath, Humanity’s Last Exam, ARC-AGI-2). Living benchmarks. Multi-benchmark evaluation. Process evaluation alongside outcome evaluation.

Reasoning faithfulness

A subtle issue. Does the model’s reasoning chain actually explain its answer? Or does the model produce a plausible chain that doesn’t reflect its actual computation?

Faithfulness research. Lanham et al. (2023) “Measuring Faithfulness in Chain-of-Thought Reasoning.” Demonstrated some CoT reasoning is faithful, some is not.

Why this matters. If reasoning isn’t faithful, we can’t trust the model’s reasoning as explanation. Cross-reference Mechanistic Interpretability §10.

The 2026 picture. Faithfulness is uneven. Reasoning models do some of their computation in the visible chain, but not all. Some computation remains in hidden states.

The implication. Treat visible reasoning as evidence of model thinking, not as complete explanation.

The honest 2026 status

Reasoning evaluation is substantial but imperfect. We have good benchmarks for narrow capabilities (math, code, science). We have weaker tools for general reasoning, faithfulness, and quality. The field continues to develop evaluation methodology.

§12. Production Reasoning Systems

Mathematical and scientific assistants

Reasoning models deployed for math/science tasks.

OpenAI o1/o3. General reasoning assistant; math/science-strong.
DeepSeek-R1. Open-weights reasoning model.
AlphaProof (Google DeepMind). Specialized for formal mathematical proofs (Lean).
Med-PaLM 2 and successors. Medical reasoning.

The deployment pattern. Strong on hard problems; substantial inference cost; users tolerate latency for quality.

Coding assistants

Reasoning models deployed for software engineering.

GitHub Copilot Workspace. Code generation with reasoning.
Cursor with o-series. IDE-integrated reasoning.
Devin (Cognition Labs). Autonomous coding agent with reasoning.
Claude Code. Anthropic’s CLI agent with extended thinking.
Aider, Cline, etc. Open-source coding assistants with reasoning models.

The deployment pattern. Reasoning helps for complex bugs, architectural decisions, multi-file changes. Simple completions don’t need reasoning.

Research assistants

A specific deployment pattern. Reasoning + retrieval + extended deliberation for research tasks.

OpenAI DeepResearch (2025). Reasoning model with web search and extended research. Multi-step research projects.
Gemini Deep Research. Similar pattern.
Claude with web search and extended thinking. Research workflows.

The deployment pattern. User submits research question; system performs extended research (multiple searches, reasoning over sources, synthesis); produces multi-page report.

Agent backbones

Reasoning models as the reasoning component of agent systems. Cross-reference AI Agents §5.

The pattern. Agent loop uses reasoning model for planning, decision-making, complex reasoning. Tool use is delegated to the agent harness; reasoning model produces plans and decisions.

The 2026 examples. Most production agents use reasoning models for the “thinking” component:

Devin, Cognition’s autonomous coding agent.
Microsoft Copilot Studio agents.
Custom enterprise agents.

Specialized domain reasoners

Reasoning models specialized for specific domains.

Math. DeepSeekMath, AlphaProof.
Code. o3-mini-high (code-optimized), CodeR1.
Theorem proving. Lean-tuned models.
Medical reasoning. Med-specific reasoning models (limited public availability).

The pattern. Domain specialization via continued training (cross-reference LLM §5 continual pretraining) or fine-tuning on domain-specific reasoning data.

Deployment patterns

Common production patterns.

Reasoning model as backend. Reasoning model behind a simpler interface. User doesn’t see the reasoning; just gets answers.

Reasoning model with thinking visible. Transparency-emphasized deployment. User sees thinking trace.

Reasoning model + non-reasoning router. Cheap classifier routes queries - simple to non-reasoning, hard to reasoning. Cost optimization.

Reasoning model with effort knob. User chooses reasoning effort per query. Quality-cost trade-off.

Cost-performance management

A substantive production concern. Reasoning models are expensive - both per-query and in aggregate.

Per-query cost. Long reasoning chains generate many output tokens. Output tokens often cost 2-5x input tokens. Reasoning queries can cost 10-100x simple queries.

Cost-optimization strategies.

Caching. Cache common reasoning chains. Particularly effective for hot queries.
Routing. Send only complex queries to reasoning models.
Effort tuning. Use low effort for routine queries.
Distilled models. Use smaller distilled reasoning models when teacher-quality not needed.

The 2026 economics. Reasoning models are substantial infrastructure cost. Production deployments invest in cost optimization. The cost is justified for tasks where reasoning provides substantial value.

Latency considerations

Reasoning models are slow. Production reasoning queries often take seconds to minutes. This constrains use cases.

Latency-tolerant use cases. Research, complex analysis, code review, planning. Multi-minute responses acceptable.

Latency-sensitive use cases. Real-time chat, autocomplete, real-time decisions. Reasoning models often unsuitable.

The 2026 pattern. Streaming reasoning chains (showing thinking to user in real-time) helps perceived latency.

§13. Connections to Other Chapters

This chapter sits at the intersection of multiple chapters:

Large Language Models (§3-§5) provides the base-model substrate. Reasoning models are LLMs trained specifically for reasoning. §7 covers CoT prompting (predecessor to reasoning RL).
Reinforcement Learning (§10, §12) provides the reasoning-RL methodology. §12 reasoning-RL paradigm. §10 RLHF.
AI Agents (§5, §10) covers planning and agent backbones. Reasoning models are the reasoning component of modern agents.
Alignment (§4, §8) covers RLHF (related but distinct) and reasoning-specific safety concerns.
Evaluation (§4, §8) covers reasoning-relevant benchmarks (MATH, GSM8K, FrontierMath, ARC-AGI) and evaluation methodology.
Multimodal Models (§4, §10) covers visual reasoning extension.
Retrieval-Augmented Generation (§9) covers reasoning + retrieval combinations.
Mechanistic Interpretability (§10) covers analyzing reasoning circuits.
AI for Science (§4) covers AlphaProof and mathematical reasoning systems.
Foundation Models (§6) covers scaling laws (precursor to inference-time-compute scaling).
Generative Models (§9) covers generation in general; reasoning models generate extended chains.
Self-Supervised Learning (§5) covers pretraining (substrate for reasoning models).

The pattern. Reasoning models depend on substantial infrastructure from other chapters; reasoning capability extends and transforms many adjacent areas.

§14. Critiques, Limitations, and Open Problems

The reasoning-model paradigm is substantively new; critiques and open problems are substantial.

Critique 1: Reasoning chains aren’t truly reasoning

The position. Visible reasoning chains may not reflect what the model is actually doing. The chain is generated by next-token prediction; nothing forces faithfulness between chain and answer.

The evidence. Lanham et al. (2023) faithfulness research shows partial faithfulness. Models can produce reasoning chains that don’t accurately explain their outputs.

The response. Reasoning chains are evidence of model thinking; not perfect explanation. Reasoning RL pressure increases faithfulness somewhat (correct chains lead to correct answers). But faithfulness is incomplete.

The honest reading. Reasoning models do substantial reasoning visibly; some computation happens in hidden states. Reasoning chains are useful but imperfect explanations.

Critique 2: Just memorization with extra steps

The position. Reasoning models memorize patterns from training data. They don’t truly reason; they pattern-match.

The evidence. Models often fail on slightly-modified versions of training problems. GSM-Symbolic (Mirzadeh et al. Apple 2024) demonstrated that minor variations to GSM8K substantially reduce performance.

The response. Pattern-matching versus reasoning may be a false dichotomy. Modern models exhibit substantial generalization within reasoning patterns even if they don’t perfectly compositionally generalize. The “true reasoning” criterion may be too strict.

The honest reading. Reasoning models exhibit substantial capability that wasn’t predicted from pre-2024 baselines. Whether to call it “reasoning” is partly definitional.

Critique 3: Inefficient compared to specialized systems

The position. Reasoning models use general LLM-based reasoning. Specialized systems (theorem provers, dedicated math solvers) are substantially more efficient on their domains.

The evidence. Lean theorem provers solve math problems in milliseconds that reasoning models take minutes for. Specialized chess engines (Stockfish, AlphaZero) substantially exceed reasoning models on chess.

The response. Reasoning models are general - they handle diverse problems without domain-specific tuning. The flexibility is worth the inefficiency for many applications.

The honest reading. For narrow well-defined problems, specialized systems are better. For broad reasoning across domains, reasoning models excel.

Critique 4: Concerns about transparency and safety

The position. Hidden reasoning chains (o-series) raise safety and trust concerns. Models could engage in deceptive reasoning that we can’t inspect.

The evidence. Hubinger et al. (2024) “Sleeper Agents” demonstrated models can produce reasoning chains that don’t reflect their internal goals. Greenblatt et al. (2024) “Alignment Faking” raised similar concerns.

The response. Visible reasoning (R1, Claude extended thinking) addresses some of these concerns. Faithfulness research provides partial assurance. Mechanistic interpretability research targets deeper inspection.

The honest reading. Transparency of reasoning is a substantial open issue. The field is moving toward visible reasoning; faithfulness is improving but incomplete.

Critique 5: Reasoning RL may have ceiling

The position. Verifiable rewards constrain reasoning RL to problems with verifiable answers. Many important reasoning tasks (creative problem-solving, scientific hypothesis generation, ethical reasoning) lack verifiable answers.

The evidence. Reasoning models are substantially better at math and code than at open-ended reasoning.

The response. Verifiable reasoning capability transfers somewhat to non-verifiable tasks. Hybrid approaches (verifiable + RLHF for non-verifiable) extend capability. Reasoning research continues.

The honest reading. Verifiable-reward training has a substantive ceiling. Extending reasoning RL to non-verifiable domains is an open research problem.

Open problems

OP-RM-1. Process vs outcome supervision. Optimal balance unclear. Process supervision is data-expensive but information-rich; outcome is cheap but sparse. The R1 paper suggested outcome alone suffices with scale, but the optimal mix for different problem types remains open.

OP-RM-2. Training-compute / inference-compute trade-off. Given a total compute budget, how to allocate between training and inference? The 2024-2026 evidence suggests shifting toward inference for hard reasoning; precise optimal allocation depends on problem distribution.

OP-RM-3. Reasoning RL for non-verifiable tasks. How to apply reasoning RL when reward signal is subjective? Hybrid with RLHF? Reward modeling? Self-play? Active research area.

OP-RM-4. Multimodal reasoning at scale. How to extend reasoning RL to multimodal inputs? Limited progress as of 2026; substantial open opportunity.

OP-RM-5. Reasoning faithfulness. How to ensure visible reasoning faithfully represents model computation? Critical for transparency and safety. Mechanistic interpretability provides partial answers; substantial work remaining.

OP-RM-6. Reasoning compositionality. Can reasoning models compose multiple skills (math + physics + coding) on novel multi-step problems? Empirical evidence mixed.

OP-RM-7. Long-horizon reasoning. Reasoning over hours or days (vs minutes). Requires persistent state, memory, planning. Cross-reference Agents §5-§6. Open frontier.

OP-RM-8. Reasoning model alignment. Reasoning models have distinctive alignment concerns (long deliberation could enable deceptive reasoning). How to align reasoning specifically? Active research.

OP-RM-9. Reasoning model evaluation. Current benchmarks largely measure accuracy on tractable problems. How to evaluate genuine reasoning capability? Faithfulness? Process quality? Open.

OP-RM-10. Distillation limits. What’s the smallest model that can acquire substantial reasoning capability? R1-Distill-1.5B has meaningful reasoning; how small can we go? Implications for edge deployment.

§15. Further Reading

Below is an annotated reading list for further study. Selection emphasizes high-impact, accessible-yet-deep references.

Foundational chain-of-thought

Wei et al. (2022). “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models.” Foundational CoT paper.
Wang et al. (2022). “Self-Consistency Improves Chain of Thought Reasoning.” Self-consistency.
Kojima et al. (2022). “Large Language Models are Zero-Shot Reasoners.” Zero-shot CoT.
Zhou et al. (2022). “Least-to-Most Prompting.” Decomposition.
Yao et al. (2023). “Tree of Thoughts.” Search-based reasoning.

Process supervision

Lightman et al. (2023). “Let’s Verify Step by Step.” Process reward models.
Uesato et al. (2022). “Solving math word problems with process- and outcome-based feedback.”

Reasoning RL

DeepSeek-R1 paper (DeepSeek-AI, 2025). “DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning.” The methodology paper.
DeepSeekMath (Shao et al., 2024). “DeepSeekMath.” GRPO algorithm introduction.
OpenAI o1 system card (OpenAI, September 2024). Limited detail; references RL for reasoning.

Madaan et al. (2023). “Self-Refine.” Iterative refinement.
Shinn et al. (2023). “Reflexion.” Memory across episodes.
Dhuliawala et al. (2023). “Chain-of-Verification.”

Search and planning

Romera-Paredes et al. (2023). “Mathematical discoveries from program search with large language models.” FunSearch.
Trinh et al. (2024). “Solving Olympiad Geometry without human demonstrations.” AlphaGeometry.
AlphaProof and AlphaGeometry paper (Google DeepMind, 2024). IMO silver-medal result.

Inference-time scaling

Snell et al. (2024). “Scaling LLM Test-Time Compute Optimally.” Inference-time-compute scaling laws.
Brown et al. (2024). “Large Language Monkeys.” Sampling-based inference scaling.

Distillation

DeepSeek-R1-Distill release notes (DeepSeek, 2025). Methodology and results.

Reasoning faithfulness and safety

Lanham et al. (2023). “Measuring Faithfulness in Chain-of-Thought Reasoning.”
Hubinger et al. (2024). “Sleeper Agents.” Reasoning safety implications.
Greenblatt et al. (2024). “Alignment Faking.”

Evaluation

Hendrycks et al. (2021). “Measuring Mathematical Problem Solving with the MATH Dataset.”
Cobbe et al. (2021). “Training Verifiers to Solve Math Word Problems.” GSM8K.
Rein et al. (2023). “GPQA: A Graduate-Level Google-Proof Q&A Benchmark.”
Glazer et al. (2024). “FrontierMath.” Research-level math benchmark.
Chollet (2019). “On the Measure of Intelligence.” ARC introduction.
Mirzadeh et al. (2024). “GSM-Symbolic.” Reasoning generalization.

Surveys

Huang and Chang (2023). “Towards Reasoning in Large Language Models: A Survey.”
Plaat et al. (2024). “Reasoning with Large Language Models: A Survey.”

Reading order

For a structured study path:

Start with CoT origins. Wei 2022, Kojima 2022.
Then self-consistency and ToT. Wang 2022, Yao 2023.
Then process supervision. Lightman 2023.
Then the reasoning-RL paradigm. DeepSeek-R1 paper, DeepSeekMath GRPO.
Then inference-time scaling. Snell 2024.
Then faithfulness and safety. Lanham 2023, Hubinger 2024.
Then evaluation. Cobbe, Hendrycks, Rein, Glazer, Chollet, Mirzadeh.

§16. Exercises and Experiments

The following exercises are research-style. Each is designed to develop hands-on understanding of reasoning models.

E1. CoT prompting baseline. Use a strong LLM (GPT-4o, Claude, Gemini) on GSM8K with and without CoT prompting. Measure accuracy difference.
E2. Self-consistency scaling. With a non-reasoning model on GSM8K, measure accuracy as a function of self-consistency N (1, 5, 10, 40). Plot the scaling curve.
E3. Tree-of-thoughts implementation. Implement ToT on Game of 24. Compare to single-pass CoT.
E4. Reasoning RL on toy domain. Implement GRPO on a small reasoning task (e.g., simple arithmetic). Train a small base model. Document the “aha moments” if any emerge.
E5. Inference-time-compute scaling. Take a reasoning model (e.g., R1-Distill-7B). Measure accuracy as a function of inference compute (token budget). Plot the scaling curve.
E6. Process vs outcome supervision. On a math task, train two models: one with process supervision (PRM-style), one with outcome supervision. Compare.
E7. Distillation experiment. Distill R1’s reasoning into a small base model on a specific domain. Measure capability transfer.
E8. Reasoning faithfulness audit. Take a reasoning model. For 100 problems, manually inspect whether visible reasoning chains correctly explain answers. Measure faithfulness rate.
E9. Reasoning + tool use. Implement a math reasoning agent: reasoning model + Python execution tool. Measure improvement over reasoning-only on a math benchmark.
E10. Reasoning-quality vs cost analysis. For a fixed task, vary reasoning effort (low/medium/high). Plot quality vs cost Pareto frontier. Identify the optimal operating point for representative use cases.

Reasoning Models

Scope and What This Chapter Is About

§1. Motivation and Scope

Three worked instances

What reasoning models are

Why reasoning models matter

The 2024-2026 reasoning inflection

Boundaries with adjacent chapters

What this chapter does not try to do

Position taken in this chapter

§2. Historical Context

Classical AI reasoning

LLM-emergent reasoning

Chain-of-thought prompting

Process supervision

The o1 inflection

The o3 and December 2024

DeepSeek-R1 and the open reasoning era

The 2025-2026 reasoning landscape

Where this leaves us in 2026

§3. The Reasoning-Model Framework

What “reasoning” means in this context

The reasoning-model architecture

The thinking-vs-answer separation

The inference-time compute axis

The reasoning-quality knob

Capability profile

The deployment trade-off

§4. Chain-of-Thought Prompting

The original CoT

Zero-shot CoT

Self-consistency

Least-to-most prompting

Other prompting variants

Why prompting-only reasoning has limits

§5. Tree-of-Thoughts and Search-Based Reasoning

Tree of Thoughts

When tree search helps

Search in modern reasoning models

MCTS in reasoning systems

Search-augmented agents

§6. The Reasoning-RL Paradigm

The motivating shift

Reinforcement learning for reasoning

GRPO: the canonical reasoning-RL algorithm

The “aha moment” finding

Verifiable reward sources

Outcome vs process supervision

Scaling reasoning RL

§7. Inference-Time Compute Scaling

The new scaling axis

Empirical inference-time scaling

Inference-time scaling vs training-time scaling

Best-of-N and verifier-based inference scaling

Process reward models for inference

Reasoning effort knobs

Inference-time compute limits

The training-compute / inference-compute trade-off

§8. Self-Refinement and Verification

Self-refinement

Reflexion

Chain-of-verification

Self-verification in reasoning models

Critic models and reward modeling

§9. Distillation and Small Reasoning Models

The distillation pattern

The R1-Distill series

Why distillation works for reasoning

Limitations of distillation

The economic implications

Synthetic data from reasoning models

§10. Multimodal and Tool-Augmented Reasoning

Multimodal reasoning

Tool-augmented reasoning

Computer-use reasoning

Reasoning + retrieval

Math + tools

Code + execution

§11. Evaluating Reasoning Models

What’s hard about evaluating reasoning