Three years ago, a deceptively simple insight transformed how we think about language model capabilities: asking a model to show its work dramatically improves its answers. What began as a prompting trick has evolved into a foundational technique that underpins modern AI reasoning systems.
Chain-of-thought (CoT) prompting, introduced by Wei et al. in their landmark 2022 paper, demonstrated that large language models could solve complex reasoning tasks simply by being prompted to generate intermediate reasoning steps before arriving at a final answer. The technique has since accumulated over 14,000 citations and spawned an entire research field.
The Core Insight: Why Intermediate Steps Matter
Standard Prompting vs. Chain-of-Thought
Consider a simple arithmetic word problem:
Standard prompting:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 balls. How many tennis balls does he have now?
A: 11
Chain-of-thought prompting:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each,
so he bought 2 * 3 = 6 balls. Therefore, he has 5 + 6 = 11 balls.
The answer is 11.
The difference seems trivial, yet the Wei et al. (2022) paper demonstrated that this simple modification produces dramatic improvements on complex reasoning benchmarks:
| Benchmark | Standard Prompting | Chain-of-Thought | Improvement |
|---|---|---|---|
| GSM8K (Math) | 17.9% | 58.0% | +224% |
| SVAMP (Math) | 68.9% | 79.0% | +15% |
| StrategyQA | 65.4% | 73.4% | +12% |
| AQuA (Algebra) | 24.4% | 45.3% | +86% |
The Mathematical Framework
Decomposing the Reasoning Process
Let's formalize what happens during chain-of-thought reasoning. Given a question , standard prompting models the direct probability of an answer :
where represents the model parameters. Chain-of-thought introduces intermediate reasoning steps :
In practice, we typically take the greedy or sampled path rather than marginalizing:
where is the generated reasoning chain.
Why Does This Work?
The power of CoT stems from several interconnected mechanisms:
1. Computational Decomposition
Complex problems often require multiple computational steps that exceed what a single forward pass can reliably compute. By generating intermediate tokens, the model effectively gains additional "compute time":
Each generated token allows the model to perform additional attention operations, effectively extending its reasoning capacity.
2. Working Memory Externalization
Transformers have limited working memory within their hidden states. By writing intermediate results to the output sequence, models can reference these values in subsequent steps through attention:
Step 1: Calculate 2 * 3 = 6 [stored in context]
Step 2: Calculate 5 + 6 = 11 [can attend to "6" from Step 1]
3. Error Localization
When reasoning is explicit, errors in intermediate steps can be identified and potentially corrected. This mirrors how humans debug their own thinking.
Self-Consistency: Voting Over Reasoning Paths
The Problem with Greedy Decoding
Standard CoT uses greedy decoding or low-temperature sampling, producing a single reasoning path. But what if that path contains an error?
Wang et al. (2022) introduced self-consistency, a technique that samples multiple diverse reasoning paths and selects the most consistent answer through majority voting.
The Self-Consistency Algorithm
def self_consistency(question, model, num_samples=40, temperature=0.7):
"""
Generate multiple reasoning paths and vote on the final answer.
Args:
question: The input question
model: Language model with CoT capability
num_samples: Number of reasoning paths to generate
temperature: Sampling temperature for diversity
Returns:
most_consistent_answer: The answer with highest vote count
"""
answers = {}
reasoning_paths = []
for i in range(num_samples):
# Sample a reasoning path with temperature > 0 for diversity
reasoning, answer = model.generate_cot(
question,
temperature=temperature
)
reasoning_paths.append(reasoning)
# Aggregate answers
if answer not in answers:
answers[answer] = 0
answers[answer] += 1
# Return the most common answer (majority vote)
most_consistent_answer = max(answers.keys(), key=lambda a: answers[a])
return most_consistent_answer
Mathematical Formulation
Given sampled reasoning paths with corresponding answers , self-consistency selects:
where is the indicator function.
This can also be interpreted as a Monte Carlo estimate of the marginalized probability:
Why Self-Consistency Works
The key insight is that correct reasoning paths tend to converge on the same answer, while incorrect paths produce diverse (wrong) answers. Consider:
Path 1: 5 + (2 * 3) = 5 + 6 = 11 ✓
Path 2: 5 + 2 + 3 = 10 ✗ (misread problem)
Path 3: 5 + (2 * 3) = 5 + 6 = 11 ✓
Path 4: 5 * 2 + 3 = 13 ✗ (wrong operation)
Path 5: 5 + (2 * 3) = 5 + 6 = 11 ✓
Vote count: {11: 3, 10: 1, 13: 1}
Winner: 11 ✓
Performance Improvements
Self-consistency delivers consistent gains across benchmarks:
| Benchmark | CoT (Greedy) | CoT + Self-Consistency | Improvement |
|---|---|---|---|
| GSM8K | 58.0% | 74.4% | +28% |
| SVAMP | 79.0% | 86.8% | +10% |
| AQuA | 45.3% | 55.5% | +23% |
| ARC-c | 85.2% | 90.6% | +6% |
Zero-Shot Chain-of-Thought
The Magic of "Let's Think Step by Step"
A remarkable discovery: simply appending "Let's think step by step" to a question enables chain-of-thought reasoning without any few-shot examples.
Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
and half of the golf balls are blue. How many blue golf balls?
Let's think step by step.
A: Half of 16 balls are golf balls, so there are 16 / 2 = 8 golf balls.
Half of the golf balls are blue, so there are 8 / 2 = 4 blue golf balls.
The answer is 4.
Limitations of Basic Zero-Shot CoT
While powerful, "Let's think step by step" has limitations:
- Missing steps: May skip crucial intermediate calculations
- Variable ordering: Steps may not follow logical sequence
- Error propagation: Early mistakes compound without correction
Plan-and-Solve Prompting: Structured Zero-Shot CoT
The Innovation
Wang et al. (2023) proposed Plan-and-Solve (PS) prompting, which improves zero-shot CoT by explicitly separating planning from execution:
Q: [Question]
Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan and solve the problem step by step.
The PS+ Variant
The enhanced PS+ prompting adds three additional instructions:
Q: [Question]
Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
- Extract relevant variables and their corresponding numerals.
- Calculate intermediate results (pay attention to calculation errors).
- Check your answer for reasonableness.
Implementation
def plan_and_solve(question, model):
"""
Implement Plan-and-Solve prompting for improved zero-shot CoT.
"""
ps_prompt = f"""
Q: {question}
Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
"""
ps_plus_prompt = f"""
Q: {question}
Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
- Extract relevant variables and their corresponding numerals.
- Calculate intermediate results (pay attention to calculation errors).
- Check your answer for reasonableness.
"""
# PS+ typically outperforms standard PS
response = model.generate(ps_plus_prompt)
return response
Performance Comparison
Plan-and-Solve significantly outperforms basic zero-shot CoT:
| Method | GSM8K | AQuA | SVAMP | MultiArith |
|---|---|---|---|---|
| Zero-Shot | 17.9% | 24.4% | 68.9% | 78.7% |
| Zero-Shot CoT | 58.0% | 45.3% | 79.0% | 90.5% |
| Plan-and-Solve | 62.7% | 47.2% | 81.8% | 92.3% |
| PS+ | 65.4% | 49.2% | 83.1% | 93.5% |
Multimodal Chain-of-Thought
Extending CoT Beyond Text
Zhang et al. (2023) extended chain-of-thought reasoning to multimodal problems involving both images and text, addressing a critical limitation: standard CoT can hallucinate when visual information is crucial.
The Two-Stage Framework
Multimodal CoT separates rationale generation from answer inference:
┌─────────────────────────────────────────────────────────────┐
│ STAGE 1: RATIONALE GENERATION │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Image] + [Question] ──▶ Language Model ──▶ [Rationale] │
│ (with vision) │
│ │
│ Example Output: │
│ "The image shows a triangle with sides labeled 3, 4, │
│ and an unknown hypotenuse. Using the Pythagorean │
│ theorem: c² = a² + b² = 3² + 4² = 9 + 16 = 25" │
│ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ STAGE 2: ANSWER INFERENCE │
├─────────────────────────────────────────────────────────────┤
│ │
│ [Image] + [Question] + [Rationale] ──▶ Model ──▶ [Answer]│
│ │
│ Example Output: │
│ "Therefore, c = √25 = 5. The hypotenuse is 5 units." │
│ │
└─────────────────────────────────────────────────────────────┘
Mathematical Formulation
Let denote image features, the question, and the rationale. The two-stage process computes:
Stage 1 (Rationale Generation):
Stage 2 (Answer Inference):
Note that and can be the same or different models, and crucially, both stages have access to the image features .
Why Two Stages?
The key insight from Zhang et al. is that single-stage multimodal CoT often produces hallucinated rationales that ignore visual evidence. By training separate models (or using separate prompts) for each stage, the system learns to:
- Ground rationales in visual evidence (Stage 1)
- Derive answers from grounded rationales (Stage 2)
Performance on Science QA
Multimodal CoT achieved state-of-the-art results on the ScienceQA benchmark:
| Model | Accuracy |
|---|---|
| Human | 88.4% |
| GPT-4 (2-shot) | 82.7% |
| Multimodal-CoT (Large) | 91.7% |
| Multimodal-CoT (Base) | 84.9% |
Advanced CoT Techniques
Tree-of-Thoughts
Building on CoT, Tree-of-Thoughts explores multiple reasoning branches:
[Question]
│
┌────────────┼────────────┐
▼ ▼ ▼
[Thought 1] [Thought 2] [Thought 3]
│ │ │
Evaluate Evaluate Evaluate
│ │ │
▼ ▼ ▼
Score=0.3 Score=0.8 Score=0.5
│
▼
[Continue]
│
┌────────┴────────┐
▼ ▼
[Thought 2a] [Thought 2b]
│ │
... ...
Least-to-Most Prompting
Decomposes complex problems into simpler subproblems:
Q: How many tennis balls does Roger have if he starts with 5
and buys 2 cans with 3 balls each?
Decomposition:
1. How many balls are in the cans Roger bought?
2. How many balls does Roger have in total?
Solving:
1. Roger bought 2 cans with 3 balls each = 2 * 3 = 6 balls
2. Total = 5 + 6 = 11 balls
Implementation Best Practices
Prompt Engineering for CoT
# Effective CoT prompt structure
COT_PROMPT_TEMPLATE = """
{few_shot_examples}
Question: {question}
Let's approach this step-by-step:
1. First, I'll identify the key information:
{extract_variables}
2. Next, I'll determine the required calculations:
{plan_steps}
3. Now, I'll execute each step:
{execute_steps}
4. Finally, I'll verify the answer:
{verification}
The answer is: {final_answer}
"""
Temperature and Sampling Strategy
| Use Case | Temperature | Num Samples | Strategy |
|---|---|---|---|
| Simple questions | 0.0 | 1 | Greedy |
| Moderate complexity | 0.5-0.7 | 5-10 | Self-consistency |
| High complexity | 0.7-1.0 | 20-40 | Self-consistency |
| Creative reasoning | 1.0+ | Multiple | Diverse exploration |
When to Use Each Technique
┌─────────────────────────────────────────────────────────────┐
│ DECISION FLOWCHART │
└─────────────────────────────────────────────────────────────┘
Is the task complex (multi-step reasoning)?
├── No ──▶ Standard prompting is sufficient
└── Yes ──▶ Do you have few-shot examples?
├── Yes ──▶ Few-shot CoT
│ └── Is accuracy critical?
│ ├── Yes ──▶ + Self-Consistency
│ └── No ──▶ Greedy decoding
└── No ──▶ Zero-shot CoT (PS+)
└── Is the task multimodal?
├── Yes ──▶ Multimodal CoT
└── No ──▶ Standard PS+
Theoretical Perspectives
CoT as Computational Extension
From a computational complexity perspective, CoT allows transformers to solve problems that require more sequential computation than a fixed-depth network can provide. The number of generated tokens effectively increases the model's computational depth:
where is the number of transformer layers and is the number of generated tokens.
Emergent Abilities
CoT reasoning appears to be an emergent ability—it only manifests in models above a certain scale (typically >100B parameters). Below this threshold, CoT prompting can actually hurt performance:
| Model Size | Standard | CoT | Delta |
|---|---|---|---|
| 350M | 2.5% | 1.2% | -52% |
| 1.3B | 4.1% | 3.8% | -7% |
| 6.7B | 8.2% | 12.1% | +48% |
| 175B | 17.9% | 58.0% | +224% |
This suggests that CoT leverages capabilities that only emerge at scale.
The Road Ahead
Chain-of-thought reasoning has fundamentally changed how we prompt and deploy large language models. The technique bridges the gap between the pattern matching that transformers excel at and the sequential reasoning that complex problems require.
Current research frontiers include:
- Faithful reasoning: Ensuring that generated rationales actually reflect the model's internal computation
- Automatic CoT: Learning to generate reasoning chains without manual prompt engineering
- Efficient CoT: Reducing the computational cost of generating long reasoning chains
- Verification: Training models to check and correct their own reasoning
The papers discussed here—from Wei et al.'s original CoT work to self-consistency, Plan-and-Solve, and multimodal extensions—have established chain-of-thought as a cornerstone of modern AI reasoning. Three years on, we're still discovering new ways to make machines think step by step.
This article draws on peer-reviewed research from Semantic Scholar. For complete bibliographic details, see the hyperlinked citations throughout the text.
