Blogpost · March 21, 2026

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Master modern prompting techniques that unlock reasoning, planning, and multi-step problem-solving in large language models

by Perivitta 24 mins read Advanced
Share
Back to all posts

Advanced Prompt Engineering: Chain-of-Thought, ReAct, and Tree-of-Thoughts Explained

Introduction

A prompt is not just a question you type into a chat box. It is a set of instructions that shapes how the model thinks, what format it follows, and how deeply it reasons. Crafting prompts effectively is one of the highest-leverage skills in AI engineering.

Early applications relied on simple prompts: "Summarize this text" or "Answer this question." These work for straightforward tasks but fail when problems require multi-step reasoning, tool use, or exploring multiple solution paths.

Modern prompt engineering techniques — Chain-of-Thought (CoT), ReAct, and Tree-of-Thoughts (ToT) — unlock significantly better performance on complex tasks by changing how the model structures its thinking. Each technique comes with a different cost and complexity tradeoff, and knowing when to use which is as important as knowing how they work.


The Foundation: Zero-Shot vs Few-Shot Prompting

Before diving into advanced techniques, let us cover the two baseline approaches that everything else builds on.

Zero-shot prompting

Zero-shot prompting means asking the model to perform a task without providing any examples. The model relies entirely on what it learned during pre-training.

Classify the sentiment of this review as positive, negative, or neutral:
"The food was okay but the service was terrible."

Sentiment:

Zero-shot works well for common, well-defined tasks. It fails on specialized, domain-specific, or nuanced tasks where the model has not seen enough training examples to generalize.

Few-shot prompting

Few-shot prompting provides a few examples before the actual task. The model learns the expected input-output format from the examples and applies it to the new input.

Classify the sentiment as positive, negative, or neutral:

Review: "Amazing experience! Highly recommend."
Sentiment: Positive

Review: "Waste of money. Never coming back."
Sentiment: Negative

Review: "It was fine, nothing special."
Sentiment: Neutral

Review: "The food was okay but the service was terrible."
Sentiment:

Few-shot prompting significantly improves performance on custom tasks and unusual output formats. The examples show the model exactly what you want, removing ambiguity.

When to use each

  • Zero-shot: Common tasks, simple queries, when latency and cost are the priority.
  • Few-shot: Domain-specific tasks, custom output formats, when accuracy matters more than speed.

Chain-of-Thought (CoT) Prompting

Chain-of-Thought prompting is a breakthrough technique: instead of asking the model to jump directly to an answer, you ask it to work through the problem step by step. This mirrors how humans solve hard problems — by reasoning aloud, checking each step, and arriving at the answer through a logical chain rather than guessing.

Example: Without CoT

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?

A: 11

The model might get this right through pattern matching, but it has no verified reasoning — and on harder problems, it is much more likely to go wrong.

Example: With CoT

Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 balls. How many tennis balls does he have now?

Let's think step by step:
1. Roger starts with 5 tennis balls.
2. He buys 2 cans of tennis balls.
3. Each can contains 3 balls, so 2 cans contain 2 × 3 = 6 balls.
4. Total balls = initial balls + new balls = 5 + 6 = 11 balls.

A: Roger now has 11 tennis balls.

By generating intermediate steps, the model is forced to engage its reasoning capabilities. Each step also provides a check — if the model makes a logical error, it is visible and catchable.

Few-shot chain-of-thought

The most effective CoT approach provides a few examples with full reasoning chains:

Q: If there are 3 cars in the parking lot and 2 more cars arrive, how many cars are there total?
A: Let's think step by step. There are initially 3 cars. 2 more cars arrive. 3 + 2 = 5. The answer is 5.

Q: John has 4 apples. He gives 2 to his friend. How many apples does John have left?
A: Let's think step by step. John starts with 4 apples. He gives away 2 apples. 4 - 2 = 2. The answer is 2.

Q: A baker made 15 cupcakes. She sold 8 of them. Then she made 6 more. How many cupcakes does she have now?
A: Let's think step by step.

The model learns the reasoning pattern from the examples and applies it to the new question.

Zero-shot chain-of-thought

A surprising discovery from research: simply adding the phrase "Let's think step by step" to a prompt triggers chain-of-thought reasoning without any examples at all.

Q: A farmer has 12 chickens and 8 cows. Each chicken has 2 legs and each cow has 4 legs. How many total legs are there?

Let's think step by step:

This phrase acts as a trigger that activates the model's step-by-step reasoning mode. It is a remarkably effective technique for its simplicity.

When CoT helps most

  • Multi-step arithmetic and algebra problems.
  • Logical reasoning and deduction tasks.
  • Complex question answering requiring inference.
  • Code debugging and algorithm design.
  • Planning and scheduling problems.

CoT helps less for simple factual recall or tasks where reasoning is not required.


Self-Consistency: Improving CoT Reliability

A single reasoning chain can make mistakes. Self-consistency addresses this by generating multiple independent reasoning paths for the same problem and voting on the final answer.

How it works

  1. Generate multiple CoT reasoning chains for the same problem (using a higher temperature setting to get diverse outputs).
  2. Extract the final answer from each chain.
  3. Select the most common answer — majority vote.

The intuition: if five different reasoning paths all arrive at the same answer, that answer is much more likely to be correct than a single chain's output.

Implementation

import openai
from collections import Counter

def self_consistency(question: str, num_samples=5) -> str:
    """Generate multiple CoT reasoning chains and vote on answer"""

    prompt = f"{question}\n\nLet's think step by step:"

    answers = []
    for _ in range(num_samples):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=[{"role": "user", "content": prompt}],
            temperature=0.7  # Higher temperature for diversity
        )

        # Extract final answer from response
        text = response.choices[0].message.content
        answer = extract_final_answer(text)
        answers.append(answer)

    # Majority vote
    most_common = Counter(answers).most_common(1)[0][0]
    return most_common

def extract_final_answer(text: str) -> str:
    """Extract the final numerical or categorical answer"""
    # Implementation depends on answer format
    # Could use regex, parsing, or another LLM call
    pass

Tradeoffs

  • Pros: Significantly higher accuracy on reasoning tasks.
  • Cons: 5–10× more expensive and slower due to multiple LLM calls.

Use self-consistency for high-stakes decisions where accuracy is worth the extra cost.


ReAct: Reasoning and Acting

ReAct (Reasoning + Acting) combines chain-of-thought reasoning with the ability to take actions — calling tools, searching the web, querying databases, running calculations. The model alternates between reasoning about what to do next and actually doing it.

This is the foundation for modern AI agents. Without ReAct (or something similar), an LLM can only work with what is in its context window. With ReAct, it can gather new information dynamically.

The ReAct pattern

Each step follows a Thought → Action → Observation cycle:

  • Thought: The model reasons about what information it needs next.
  • Action: The model calls a tool or function to get that information.
  • Observation: The result comes back from the tool.

This cycle repeats until the model has enough information to answer.

Example: Question answering with search

Question: What is the population of the capital of France?

Thought 1: I need to know the capital of France first.
Action 1: Search["capital of France"]
Observation 1: Paris is the capital of France.

Thought 2: Now I need the population of Paris.
Action 2: Search["population of Paris"]
Observation 2: The population of Paris is approximately 2.2 million.

Thought 3: I now have the answer.
Answer: The population of the capital of France (Paris) is approximately 2.2 million.

ReAct prompt template

You can use the following tools:
- Search[query]: Search for information
- Calculator[expression]: Perform calculations
- Finish[answer]: End with final answer

Use this format:
Thought: [your reasoning]
Action: [tool to use]
Observation: [result will be provided]
... (repeat as needed)
Thought: I now know the final answer
Finish: [final answer]

Question: {user_question}

Let's begin:

Implementation with function calling

def react_agent(question: str, tools: dict, max_steps=10):
    """ReAct agent implementation"""

    conversation = [
        {"role": "system", "content": build_react_prompt(tools)},
        {"role": "user", "content": question}
    ]

    for step in range(max_steps):
        response = openai.chat.completions.create(
            model="gpt-4",
            messages=conversation,
            tools=format_tools_for_api(tools),
            tool_choice="auto"
        )

        message = response.choices[0].message

        # Check if done
        if message.content and "Finish:" in message.content:
            return extract_final_answer(message.content)

        # Execute tool call
        if message.tool_calls:
            tool_call = message.tool_calls[0]
            tool_name = tool_call.function.name
            tool_args = json.loads(tool_call.function.arguments)

            # Execute the tool
            observation = tools[tool_name](**tool_args)

            # Add to conversation
            conversation.append(message)
            conversation.append({
                "role": "tool",
                "tool_call_id": tool_call.id,
                "content": str(observation)
            })
        else:
            break

    return "Failed to reach conclusion"

When to use ReAct

  • Questions requiring external information (web search, database lookup).
  • Multi-step tasks involving tools or APIs.
  • Scenarios where the model needs to gather information dynamically before answering.
  • Workflows that benefit from explicit, inspectable reasoning traces.

Tree-of-Thoughts: Exploring Multiple Paths

Tree-of-Thoughts (ToT) extends chain-of-thought by exploring multiple reasoning paths simultaneously, evaluating each, and selecting the most promising ones — like a chess engine that considers many possible moves before choosing the best one.

With standard CoT, the model follows a single linear chain. If it makes a wrong turn early, it is stuck with that mistake. ToT allows backtracking and branch exploration.

ToT process

  1. Generate: At each step, produce multiple possible next thoughts (typically 3–5).
  2. Evaluate: Score each thought for promise — is this a good direction?
  3. Select: Choose which thoughts to expand, using BFS, DFS, or beam search.
  4. Backtrack: If a path fails, discard it and try alternatives.
Tree graph with nodes and edges showing branching structure
Figure: A tree graph structure — Tree-of-Thoughts uses exactly this topology: the root is the initial problem state, each interior node is a partial reasoning step, and each edge is a possible continuation. BFS explores level by level; DFS follows a branch to its end; the model scores nodes to decide which to expand. Source: ZeroOne / Wikimedia Commons (Public Domain)

Example: Creative writing task

Task: Write a coherent story with exactly 4 sentences about a detective solving a mystery.

Step 1: Generate 3 possible first sentences
Option A: "Detective Miller arrived at the crime scene on a rainy Tuesday morning."
Option B: "The old mansion had been empty for years until tonight."
Option C: "Everyone knew the butler did it, except Detective Sarah."

Step 2: Evaluate each option
Evaluation: Option A provides good setup, B creates intrigue, C subverts expectations.
Selected: Option A (most conventional, easier to build on)

Step 3: Generate next sentences based on Option A
... (continue expanding the tree)

Implementation sketch

def tree_of_thoughts(problem: str, depth=3, breadth=3):
    """
    Simplified Tree-of-Thoughts implementation
    """

    def generate_thoughts(current_state: str, num=3) -> list:
        """Generate possible next reasoning steps"""
        prompt = f"Given this partial solution:\n{current_state}\n\nGenerate {num} different possible next steps. Output as numbered list."
        response = call_llm(prompt)
        return parse_numbered_list(response)

    def evaluate_thought(thought: str) -> float:
        """Score a thought's promise (0-1)"""
        prompt = f"Rate this reasoning step for correctness and promise (0-10):\n{thought}"
        response = call_llm(prompt)
        return extract_score(response) / 10

    # Initialize with problem
    current_best = problem

    for level in range(depth):
        # Generate multiple thoughts
        candidates = generate_thoughts(current_best, breadth)

        # Evaluate each
        scored = [(t, evaluate_thought(t)) for t in candidates]

        # Select best
        best_thought = max(scored, key=lambda x: x[1])[0]

        # Expand
        current_best += "\n" + best_thought

    return current_best

When to use ToT

  • Problems with multiple valid solution paths where choosing the wrong one early is costly.
  • Creative tasks (writing, design, brainstorming) where exploration is valuable.
  • Puzzles and games that require search over possible states.
  • High-value decisions where solution quality justifies the extra cost.

Tradeoffs

  • Pros: Explores the solution space thoroughly, finds better solutions on hard problems.
  • Cons: Very expensive — token usage grows exponentially with breadth and depth. High latency.

Reserve ToT for genuinely difficult problems. For most tasks, CoT is sufficient.


Comparison: Prompting Techniques

Technique Reasoning Type Cost Latency Best For
Zero-Shot Direct answer Low Fast Simple queries
Few-Shot Pattern learning Low–Medium Fast Custom formats
Chain-of-Thought Step-by-step Medium Medium Reasoning tasks
Self-Consistency Multiple chains + vote High Slow Critical decisions
ReAct Reasoning + action Medium–High Slow Multi-step with tools
Tree-of-Thoughts Explore multiple paths Very High Very Slow Complex problems

Prompt Optimization Techniques

1. Role assignment

Explicitly tell the model what role to play. This primes it to adopt domain-specific knowledge and tone:

You are an expert data scientist with 15 years of experience.
Analyze the following dataset and provide insights.

2. Output format specification

Be explicit about the structure of the response you want:

Provide your answer in this format:
1. Summary (2-3 sentences)
2. Key Findings (bullet points)
3. Recommendations (numbered list)

3. Constraint setting

Specify hard requirements to prevent the model from veering off course:

Requirements:
- Answer must be under 100 words
- Use simple language (8th grade reading level)
- Include at least one concrete example
- Avoid jargon

4. Negative examples

Show what not to do. This is often more effective than just specifying what to do:

Good answer: "The project will cost approximately $50,000 and take 3 months."

Bad answer: "It depends on many factors and could cost anything."

Now answer this question: How long will the migration take?

5. Iterative refinement

Use multi-turn conversations to progressively improve a response:

Turn 1: Generate initial answer
Turn 2: "Now make it more concise"
Turn 3: "Add a concrete example"
Turn 4: "Verify the math is correct"

Production Best Practices

1. Version your prompts

Treat prompts like code. Version them, review changes, and track which version is deployed:

# prompt_templates.py

PROMPTS = {
    "customer_query_v1": "Answer customer questions politely...",
    "customer_query_v2": "You are a helpful customer service agent...",
    "customer_query_v3": "...",  # Latest version
}

def get_prompt(name: str, version: str = "latest"):
    if version == "latest":
        version = max([k for k in PROMPTS if k.startswith(name)],
                     key=lambda x: int(x.split("_v")[1]))
    return PROMPTS[version]

2. A/B test prompts

When you change a prompt, test the new version against the old one on real traffic before fully deploying:

import random

def route_to_prompt(user_id: str):
    """A/B test different prompt versions"""
    variant = hash(user_id) % 2

    if variant == 0:
        return PROMPTS["customer_query_v2"]
    else:
        return PROMPTS["customer_query_v3"]

3. Log and monitor performance

def track_prompt_performance(prompt_version: str, response: str, user_feedback: float):
    """Track which prompts perform best"""
    log_to_database({
        "prompt_version": prompt_version,
        "response_length": len(response),
        "user_satisfaction": user_feedback,
        "timestamp": datetime.now()
    })

4. Handle edge cases explicitly

Anticipate common failure modes and address them directly in the prompt:

Important rules:
- If you don't know the answer, say "I don't have enough information"
- If the question is ambiguous, ask for clarification
- If the request violates policy, politely decline
- Never make up facts or statistics

Prompt Engineering Tools

LangChain prompt templates

from langchain.prompts import PromptTemplate

template = """
You are a {role}.
Task: {task}
Context: {context}

Output format: {format}
"""

prompt = PromptTemplate(
    input_variables=["role", "task", "context", "format"],
    template=template
)

final_prompt = prompt.format(
    role="senior software engineer",
    task="Review this code for bugs",
    context=code_snippet,
    format="Bullet points with severity ratings"
)

Prompt optimization with DSPy

DSPy is a framework that automatically optimizes prompts and few-shot examples by treating prompt engineering as an optimization problem:

import dspy

# Define signature
class QuestionAnswering(dspy.Signature):
    """Answer questions based on context"""
    context = dspy.InputField()
    question = dspy.InputField()
    answer = dspy.OutputField()

# Create module
qa = dspy.ChainOfThought(QuestionAnswering)

# Optimize prompts automatically
optimizer = dspy.BootstrapFewShot(metric=answer_correctness)
optimized_qa = optimizer.compile(qa, trainset=examples)

Common Pitfalls

Over-engineering prompts

Always start with the simplest prompt that could work. Add complexity only when you have evidence it helps. Over-engineered prompts are harder to maintain and can actually confuse the model with contradictory instructions.

Ignoring context window limits

Few-shot examples and CoT reasoning consume tokens quickly. For very long prompts, you may hit the model's context limit. Monitor token usage and trim prompts if necessary.

Not testing edge cases

Test your prompts with: empty inputs, extremely long inputs, inputs in unexpected languages, adversarial inputs designed to break your format, and inputs that are ambiguous or contain errors. Prompts that work perfectly on typical inputs often fail on edge cases.

Assuming determinism

Even at temperature 0, LLM outputs can vary slightly across API versions and model updates. Design your system to handle variability gracefully, and re-test prompts after model upgrades.


Conclusion

Prompt engineering is both an art and a science. The techniques here — from simple few-shot examples to full Tree-of-Thoughts search — span a wide range of cost and complexity. The key principle is proportionality: use the simplest technique that solves your problem.

Start with zero-shot. If that fails, add few-shot examples. If the task requires reasoning, add CoT. If it requires tools, use ReAct. Reserve self-consistency and ToT for cases where the extra cost is justified by the stakes. And always version, test, and monitor your prompts in production.


Key Takeaways

  • Chain-of-Thought prompting — triggered simply by "Let's think step by step" — significantly improves accuracy on multi-step reasoning tasks by forcing the model to externalize its logic rather than guess.
  • Self-consistency (majority vote over multiple CoT samples) trades cost for reliability and is worth the expense for high-stakes decisions.
  • ReAct is the right pattern for tasks that require external tools (search, APIs, databases), interleaving explicit reasoning with action calls at each step.
  • Version, A/B test, and monitor your prompts in production the same way you would any other critical code path — prompt changes frequently cause silent regressions.

References

  • Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022.
  • Wang, X., et al. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
  • Yao, S., et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
  • Yao, S., et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. NeurIPS 2023.
  • Kojima, T., et al. (2022). Large Language Models are Zero-Shot Reasoners. NeurIPS 2022.

Related Articles

Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation
Structured Outputs in LLMs: JSON Mode, Function Calling, and Schema Validation
Free-form LLM output breaks parsing pipelines. JSON mode, function calling, grammar-constrained decoding,...
Read More →
LLM Observability: Tracing, Logging, and Debugging AI Applications
LLM Observability: Tracing, Logging, and Debugging AI Applications
You can't debug what you can't trace. Setting up prompt logging, span...
Read More →
Found this useful?