AI/ML/NLPNovember 15, 202514 min read

Chain-of-Thought Prompting Guide: Make LLMs Think Step-by-Step

Master chain-of-thought prompting - the technique that makes GPT-4 and Claude solve complex problems. Includes zero-shot CoT, self-consistency, and implementation examples.

Space Services

ai tools

Three years ago, a deceptively simple insight transformed how we think about language model capabilities: asking a model to show its work dramatically improves its answers. What began as a prompting trick has evolved into a foundational technique that underpins modern AI reasoning systems.

Chain-of-thought (CoT) prompting, introduced by Wei et al. in their landmark 2022 paper, demonstrated that large language models could solve complex reasoning tasks simply by being prompted to generate intermediate reasoning steps before arriving at a final answer. The technique has since accumulated over 14,000 citations and spawned an entire research field.

Chain-of-Thought Visualization

See how breaking problems into steps improves reasoning accuracy

import numpy as np
import matplotlib.pyplot as plt

# Simulate accuracy with and without chain-of-thought
problems = ['Simple Math', 'Word Problems', 'Logic Puzzles', 'Multi-step', 'Complex']
no_cot = [85, 45, 40, 30, 20]  # Without chain-of-thought
with_cot = [90, 78, 72, 68, 55]  # With chain-of-thought

x = np.arange(len(problems))
width = 0.35

fig, ax = plt.subplots(figsize=(10, 6))
bars1 = ax.bar(x - width/2, no_cot, width, label='Direct Answer', color='#5c5270')
bars2 = ax.bar(x + width/2, with_cot, width, label='Chain-of-Thought', color='#7c6f9c')

ax.set_ylabel('Accuracy (%)')
ax.set_title('Impact of Chain-of-Thought on Problem Solving')
ax.set_xticks(x)
ax.set_xticklabels(problems, rotation=15)
ax.legend()
ax.set_ylim(0, 100)
ax.grid(axis='y', alpha=0.3)

# Add value labels
for bar in bars1 + bars2:
    height = bar.get_height()
    ax.annotate(f'{height}%', xy=(bar.get_x() + bar.get_width()/2, height),
                xytext=(0, 3), textcoords="offset points", ha='center', fontsize=9)

plt.tight_layout()
plt.show()

Ctrl/Cmd + Enter to run

The Core Insight: Why Intermediate Steps Matter

Standard Prompting vs. Chain-of-Thought

Consider a simple arithmetic word problem:

Standard prompting:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 balls. How many tennis balls does he have now?
A: 11

Chain-of-thought prompting:

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.
   Each can has 3 balls. How many tennis balls does he have now?
A: Roger started with 5 balls. He bought 2 cans with 3 balls each,
   so he bought 2 * 3 = 6 balls. Therefore, he has 5 + 6 = 11 balls.
   The answer is 11.

The difference seems trivial, yet the Wei et al. (2022) paper demonstrated that this simple modification produces dramatic improvements on complex reasoning benchmarks:

Benchmark	Standard Prompting	Chain-of-Thought	Improvement
GSM8K (Math)	17.9%	58.0%	+224%
SVAMP (Math)	68.9%	79.0%	+15%
StrategyQA	65.4%	73.4%	+12%
AQuA (Algebra)	24.4%	45.3%	+86%

The Mathematical Framework

Decomposing the Reasoning Process

Let's formalize what happens during chain-of-thought reasoning. Given a question $q$ , standard prompting models the direct probability of an answer $a$ :

$P(a \mid q) = P_\theta(a \mid q)$

where $\theta$ represents the model parameters. Chain-of-thought introduces intermediate reasoning steps $r = (r_1, r_2, \ldots, r_n)$ :

$P(a \mid q) = \sum_{r} P_\theta(r \mid q) \cdot P_\theta(a \mid q, r)$

In practice, we typically take the greedy or sampled path rather than marginalizing:

$P(a \mid q) \approx P_\theta(r^* \mid q) \cdot P_\theta(a \mid q, r^*)$

where $r^*$ is the generated reasoning chain.

Why Does This Work?

The power of CoT stems from several interconnected mechanisms:

1. Computational Decomposition

Complex problems often require multiple computational steps that exceed what a single forward pass can reliably compute. By generating intermediate tokens, the model effectively gains additional "compute time":

$\text{Effective Compute} \propto \text{Number of Generated Tokens}$

Each generated token allows the model to perform additional attention operations, effectively extending its reasoning capacity.

2. Working Memory Externalization

Transformers have limited working memory within their hidden states. By writing intermediate results to the output sequence, models can reference these values in subsequent steps through attention:

Step 1: Calculate 2 * 3 = 6  [stored in context]
Step 2: Calculate 5 + 6 = 11 [can attend to "6" from Step 1]

3. Error Localization

When reasoning is explicit, errors in intermediate steps can be identified and potentially corrected. This mirrors how humans debug their own thinking.

Self-Consistency: Voting Over Reasoning Paths

The Problem with Greedy Decoding

Standard CoT uses greedy decoding or low-temperature sampling, producing a single reasoning path. But what if that path contains an error?

Wang et al. (2022) introduced self-consistency, a technique that samples multiple diverse reasoning paths and selects the most consistent answer through majority voting.

The Self-Consistency Algorithm

def self_consistency(question, model, num_samples=40, temperature=0.7):
    """
    Generate multiple reasoning paths and vote on the final answer.

    Args:
        question: The input question
        model: Language model with CoT capability
        num_samples: Number of reasoning paths to generate
        temperature: Sampling temperature for diversity

    Returns:
        most_consistent_answer: The answer with highest vote count
    """
    answers = {}
    reasoning_paths = []

    for i in range(num_samples):
        # Sample a reasoning path with temperature > 0 for diversity
        reasoning, answer = model.generate_cot(
            question,
            temperature=temperature
        )
        reasoning_paths.append(reasoning)

        # Aggregate answers
        if answer not in answers:
            answers[answer] = 0
        answers[answer] += 1

    # Return the most common answer (majority vote)
    most_consistent_answer = max(answers.keys(), key=lambda a: answers[a])
    return most_consistent_answer

Mathematical Formulation

Given $n$ sampled reasoning paths $\{r_1, r_2, \ldots, r_n\}$ with corresponding answers $\{a_1, a_2, \ldots, a_n\}$ , self-consistency selects:

$\hat{a} = \arg\max_{a} \sum_{i=1}^{n} \mathbb{1}[a_i = a]$

where $\mathbb{1}[\cdot]$ is the indicator function.

This can also be interpreted as a Monte Carlo estimate of the marginalized probability:

$P(a \mid q) \approx \frac{1}{n} \sum_{i=1}^{n} \mathbb{1}[a_i = a]$

Why Self-Consistency Works

The key insight is that correct reasoning paths tend to converge on the same answer, while incorrect paths produce diverse (wrong) answers. Consider:

Path 1: 5 + (2 * 3) = 5 + 6 = 11  ✓
Path 2: 5 + 2 + 3 = 10            ✗ (misread problem)
Path 3: 5 + (2 * 3) = 5 + 6 = 11  ✓
Path 4: 5 * 2 + 3 = 13            ✗ (wrong operation)
Path 5: 5 + (2 * 3) = 5 + 6 = 11  ✓

Vote count: {11: 3, 10: 1, 13: 1}
Winner: 11 ✓

Performance Improvements

Self-consistency delivers consistent gains across benchmarks:

Benchmark	CoT (Greedy)	CoT + Self-Consistency	Improvement
GSM8K	58.0%	74.4%	+28%
SVAMP	79.0%	86.8%	+10%
AQuA	45.3%	55.5%	+23%
ARC-c	85.2%	90.6%	+6%

Zero-Shot Chain-of-Thought

The Magic of "Let's Think Step by Step"

A remarkable discovery: simply appending "Let's think step by step" to a question enables chain-of-thought reasoning without any few-shot examples.

Q: A juggler can juggle 16 balls. Half of the balls are golf balls,
   and half of the golf balls are blue. How many blue golf balls?

   Let's think step by step.

A: Half of 16 balls are golf balls, so there are 16 / 2 = 8 golf balls.
   Half of the golf balls are blue, so there are 8 / 2 = 4 blue golf balls.
   The answer is 4.

Limitations of Basic Zero-Shot CoT

While powerful, "Let's think step by step" has limitations:

Missing steps: May skip crucial intermediate calculations
Variable ordering: Steps may not follow logical sequence
Error propagation: Early mistakes compound without correction

Plan-and-Solve Prompting: Structured Zero-Shot CoT

The Innovation

Wang et al. (2023) proposed Plan-and-Solve (PS) prompting, which improves zero-shot CoT by explicitly separating planning from execution:

Q: [Question]

Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan and solve the problem step by step.

The PS+ Variant

The enhanced PS+ prompting adds three additional instructions:

Q: [Question]

Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
- Extract relevant variables and their corresponding numerals.
- Calculate intermediate results (pay attention to calculation errors).
- Check your answer for reasonableness.

Implementation

def plan_and_solve(question, model):
    """
    Implement Plan-and-Solve prompting for improved zero-shot CoT.
    """

    ps_prompt = f"""
Q: {question}

Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
"""

    ps_plus_prompt = f"""
Q: {question}

Let's first understand the problem and devise a plan to solve it.
Then, let's carry out the plan to solve the problem step by step.
- Extract relevant variables and their corresponding numerals.
- Calculate intermediate results (pay attention to calculation errors).
- Check your answer for reasonableness.
"""

    # PS+ typically outperforms standard PS
    response = model.generate(ps_plus_prompt)
    return response

Performance Comparison

Plan-and-Solve significantly outperforms basic zero-shot CoT:

Method	GSM8K	AQuA	SVAMP	MultiArith
Zero-Shot	17.9%	24.4%	68.9%	78.7%
Zero-Shot CoT	58.0%	45.3%	79.0%	90.5%
Plan-and-Solve	62.7%	47.2%	81.8%	92.3%
PS+	65.4%	49.2%	83.1%	93.5%

Multimodal Chain-of-Thought

Extending CoT Beyond Text

Zhang et al. (2023) extended chain-of-thought reasoning to multimodal problems involving both images and text, addressing a critical limitation: standard CoT can hallucinate when visual information is crucial.

The Two-Stage Framework

Multimodal CoT separates rationale generation from answer inference:

┌─────────────────────────────────────────────────────────────┐
│                    STAGE 1: RATIONALE GENERATION            │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   [Image] + [Question] ──▶ Language Model ──▶ [Rationale]  │
│                              (with vision)                  │
│                                                             │
│   Example Output:                                           │
│   "The image shows a triangle with sides labeled 3, 4,      │
│    and an unknown hypotenuse. Using the Pythagorean         │
│    theorem: c² = a² + b² = 3² + 4² = 9 + 16 = 25"          │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────┐
│                    STAGE 2: ANSWER INFERENCE                │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│   [Image] + [Question] + [Rationale] ──▶ Model ──▶ [Answer]│
│                                                             │
│   Example Output:                                           │
│   "Therefore, c = √25 = 5. The hypotenuse is 5 units."     │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Mathematical Formulation

Let $I$ denote image features, $q$ the question, and $r$ the rationale. The two-stage process computes:

Stage 1 (Rationale Generation): $r = \arg\max_{r'} P_\theta(r' \mid I, q)$

Stage 2 (Answer Inference): $a = \arg\max_{a'} P_\phi(a' \mid I, q, r)$

Note that $\theta$ and $\phi$ can be the same or different models, and crucially, both stages have access to the image features $I$ .

Why Two Stages?

The key insight from Zhang et al. is that single-stage multimodal CoT often produces hallucinated rationales that ignore visual evidence. By training separate models (or using separate prompts) for each stage, the system learns to:

Ground rationales in visual evidence (Stage 1)
Derive answers from grounded rationales (Stage 2)

Performance on Science QA

Multimodal CoT achieved state-of-the-art results on the ScienceQA benchmark:

Model	Accuracy
Human	88.4%
GPT-4 (2-shot)	82.7%
Multimodal-CoT (Large)	91.7%
Multimodal-CoT (Base)	84.9%

Advanced CoT Techniques

Tree-of-Thoughts

Building on CoT, Tree-of-Thoughts explores multiple reasoning branches:

                    [Question]
                        │
           ┌────────────┼────────────┐
           ▼            ▼            ▼
       [Thought 1]  [Thought 2]  [Thought 3]
           │            │            │
       Evaluate     Evaluate     Evaluate
           │            │            │
           ▼            ▼            ▼
        Score=0.3    Score=0.8    Score=0.5
                        │
                        ▼
                   [Continue]
                        │
               ┌────────┴────────┐
               ▼                 ▼
          [Thought 2a]      [Thought 2b]
               │                 │
             ...               ...

Least-to-Most Prompting

Decomposes complex problems into simpler subproblems:

Q: How many tennis balls does Roger have if he starts with 5
   and buys 2 cans with 3 balls each?

Decomposition:
1. How many balls are in the cans Roger bought?
2. How many balls does Roger have in total?

Solving:
1. Roger bought 2 cans with 3 balls each = 2 * 3 = 6 balls
2. Total = 5 + 6 = 11 balls

Implementation Best Practices

Prompt Engineering for CoT

# Effective CoT prompt structure
COT_PROMPT_TEMPLATE = """
{few_shot_examples}

Question: {question}

Let's approach this step-by-step:

1. First, I'll identify the key information:
{extract_variables}

2. Next, I'll determine the required calculations:
{plan_steps}

3. Now, I'll execute each step:
{execute_steps}

4. Finally, I'll verify the answer:
{verification}

The answer is: {final_answer}
"""

Temperature and Sampling Strategy

Use Case	Temperature	Num Samples	Strategy
Simple questions	0.0	1	Greedy
Moderate complexity	0.5-0.7	5-10	Self-consistency
High complexity	0.7-1.0	20-40	Self-consistency
Creative reasoning	1.0+	Multiple	Diverse exploration

When to Use Each Technique

┌─────────────────────────────────────────────────────────────┐
│                    DECISION FLOWCHART                       │
└─────────────────────────────────────────────────────────────┘

Is the task complex (multi-step reasoning)?
├── No  ──▶ Standard prompting is sufficient
└── Yes ──▶ Do you have few-shot examples?
            ├── Yes ──▶ Few-shot CoT
            │          └── Is accuracy critical?
            │              ├── Yes ──▶ + Self-Consistency
            │              └── No  ──▶ Greedy decoding
            └── No  ──▶ Zero-shot CoT (PS+)
                       └── Is the task multimodal?
                           ├── Yes ──▶ Multimodal CoT
                           └── No  ──▶ Standard PS+

Theoretical Perspectives

CoT as Computational Extension

From a computational complexity perspective, CoT allows transformers to solve problems that require more sequential computation than a fixed-depth network can provide. The number of generated tokens effectively increases the model's computational depth:

$\text{Depth}_\text{effective} = L \cdot T$

where $L$ is the number of transformer layers and $T$ is the number of generated tokens.

Emergent Abilities

CoT reasoning appears to be an emergent ability—it only manifests in models above a certain scale (typically >100B parameters). Below this threshold, CoT prompting can actually hurt performance:

Model Size	Standard	CoT	Delta
350M	2.5%	1.2%	-52%
1.3B	4.1%	3.8%	-7%
6.7B	8.2%	12.1%	+48%
175B	17.9%	58.0%	+224%

This suggests that CoT leverages capabilities that only emerge at scale.

The Road Ahead

Chain-of-thought reasoning has fundamentally changed how we prompt and deploy large language models. The technique bridges the gap between the pattern matching that transformers excel at and the sequential reasoning that complex problems require.

Current research frontiers include:

Faithful reasoning: Ensuring that generated rationales actually reflect the model's internal computation
Automatic CoT: Learning to generate reasoning chains without manual prompt engineering
Efficient CoT: Reducing the computational cost of generating long reasoning chains
Verification: Training models to check and correct their own reasoning

The papers discussed here—from Wei et al.'s original CoT work to self-consistency, Plan-and-Solve, and multimodal extensions—have established chain-of-thought as a cornerstone of modern AI reasoning. Three years on, we're still discovering new ways to make machines think step by step.

Test Your Knowledge: Chain-of-Thought Reasoning

Question 1 of 5

What is the primary benefit of chain-of-thought prompting?

This article draws on peer-reviewed research from Semantic Scholar. For complete bibliographic details, see the hyperlinked citations throughout the text.

Interactive

RLHF Explained: How ChatGPT and Claude Learn from Human Feedback

Understand RLHF - the technique behind ChatGPT and Claude. Learn reward modeling, PPO optimization, DPO, and how AI assistants become helpful and safe.

AI/ML/NLPDecember 8, 202513 min read

Interactive

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

Learn Constitutional AI (CAI) - Anthropic's technique for training safe AI without massive human labeling. Understand RLAIF, self-critique, and how it compares to RLHF.

AI/ML/NLPJanuary 21, 202610 min read

Interactive

FlashAttention, Linear Attention & Long Context: Efficient Transformer Attention Explained

Process 100K+ token contexts efficiently. Learn FlashAttention, linear attention, GLA, and how modern LLMs handle long documents without running out of memory.

AI/ML/NLPJanuary 10, 20269 min read

The Core Insight: Why Intermediate Steps Matter

Standard Prompting vs. Chain-of-Thought

The Mathematical Framework

Decomposing the Reasoning Process

Why Does This Work?

Self-Consistency: Voting Over Reasoning Paths

The Problem with Greedy Decoding

The Self-Consistency Algorithm

Mathematical Formulation

Why Self-Consistency Works

Performance Improvements

Zero-Shot Chain-of-Thought

The Magic of "Let's Think Step by Step"

Limitations of Basic Zero-Shot CoT

Plan-and-Solve Prompting: Structured Zero-Shot CoT

The Innovation

The PS+ Variant

Implementation

Performance Comparison

Multimodal Chain-of-Thought

Extending CoT Beyond Text

The Two-Stage Framework

Mathematical Formulation

Why Two Stages?

Performance on Science QA

Advanced CoT Techniques

Tree-of-Thoughts

Least-to-Most Prompting

Implementation Best Practices

Prompt Engineering for CoT

Temperature and Sampling Strategy

When to Use Each Technique

Theoretical Perspectives

CoT as Computational Extension

Emergent Abilities

The Road Ahead

Related Articles

RLHF Explained: How ChatGPT and Claude Learn from Human Feedback

Constitutional AI Explained: How Anthropic Makes Claude Safe Without Human Labels

FlashAttention, Linear Attention & Long Context: Efficient Transformer Attention Explained

SPACE SERVICES