Back to all posts

Reasoning Models Explained: How o1, o3, and DeepSeek R1 Think Before They Answer

Executive Summary

Reasoning models are a new class of large language models trained to generate extended internal reasoning chains before producing a final answer, dramatically improving accuracy on hard multi-step tasks.

The key mechanism is test-time compute scaling: instead of scaling model size or training data, these models spend more computation at inference time to think longer and more carefully.

OpenAI's o1 and o3, together with DeepSeek R1, are the leading examples, each using reinforcement learning to teach the model when and how to reason rather than just what to say.

Reasoning models are not universally better: they are slower, more expensive per query, and often overkill for simple tasks like summarisation or basic question answering.

Understanding when to deploy a reasoning model versus a standard model is one of the most practically important decisions an AI engineer can make in 2026.

Introduction

For most of the history of large language models, the dominant strategy for making a model smarter was straightforward: train a bigger model on more data. More parameters, more tokens, more compute spent during training. The assumption was that intelligence lived inside the weights, baked in during the training run, and inference was just a fast lookup.

That assumption started to crack in 2024 and crumbled in 2025. A new generation of models, beginning with OpenAI's o1 in late 2024, demonstrated that you could make a model dramatically smarter not by changing its size but by changing how it uses time at inference. By training models to think out loud, to draft and revise chains of reasoning before committing to an answer, researchers unlocked a new axis of improvement that had nothing to do with parameter count.

By mid-2026, reasoning models have become a standard part of the AI engineer's toolkit. o3 handles cutting-edge research and mathematics. DeepSeek R1 brought strong reasoning capability to the open-source community at a fraction of the cost. A growing ecosystem of tools, frameworks, and APIs is built around the assumption that some tasks deserve deliberate, structured thinking rather than instant response.

This post explains what reasoning models actually are, how they work under the hood, where they shine, and where they fall flat. No code, no equations, just clear explanations of the mechanisms that matter.

The Problem: What Standard LLMs Get Wrong on Hard Tasks

Standard autoregressive language models generate text one token at a time, left to right, with no ability to revise what they have already written. Each token is conditioned on everything that came before it, and the model has no explicit mechanism to plan ahead, check its work, or explore multiple solution paths before committing.

For many tasks, this is perfectly fine. Summarising a document, translating a sentence, generating a product description: these do not require multi-step planning. A single forward pass through a large model produces a good enough answer most of the time.

The problems emerge with tasks that require chains of deduction. Consider a problem that asks you to figure out the optimal scheduling of five workers across three shifts given a set of constraints. Or a coding problem where the correct solution requires noticing a subtle edge case buried three layers deep in the logic. Or a formal proof that requires holding several intermediate conclusions in mind simultaneously.

When standard models tackle these tasks, they make a characteristic class of errors. They rush to plausible-looking answers without checking whether intermediate steps are consistent. They make arithmetic mistakes because arithmetic is not a pattern the model can recognise the same way it recognises syntax. They fail to backtrack when an early assumption turns out to be wrong. These are not failures of knowledge, they are failures of process.

The insight behind reasoning models is that these failures are addressable not by making the model know more, but by giving the model the computational space to think more carefully before answering.

Core Concepts and Terminology

Term	Definition
Reasoning model	A language model specifically trained to generate extended intermediate reasoning steps (a "thinking trace") before producing its final answer to a query.
Test-time compute	Computational resources spent at inference time (when the model is answering a question), as opposed to training time. Reasoning models scale performance by spending more test-time compute on harder problems.
Chain-of-thought (CoT)	A prompting technique where a model is instructed to write out its reasoning steps before giving a final answer. Reasoning models internalise this behaviour through training rather than relying on prompts alone.
Inference-time scaling	The broader principle that model performance can be improved by allocating more computation during inference, for example by generating multiple candidate answers and selecting the best one, or by allowing longer reasoning traces.
Reward model	A separate model trained to score the quality of outputs. During reinforcement learning training, the reward model provides the signal that teaches the reasoning model what constitutes a good answer.
Process reward model (PRM)	A specialised reward model that scores individual reasoning steps rather than just the final answer. PRMs are critical for training models that reason reliably, because they reward correct intermediate logic rather than just lucky final answers.
o1 / o3	OpenAI's reasoning model series. o1 was released in late 2024 as the first widely deployed reasoning model. o3 followed in early 2025 with substantially stronger performance, particularly on mathematics, coding, and scientific reasoning.
DeepSeek R1	An open-source reasoning model from DeepSeek, released in January 2025. It demonstrated that strong reasoning capability could be achieved at significantly lower training cost, and it made reasoning model weights publicly available for the first time.

How It Works

Understanding reasoning models requires separating two things: how they are trained, and what they actually do at inference time. These are related but distinct.

Start with a capable base model. Reasoning models are not trained from scratch. They begin with a strong foundation model that already has broad knowledge and language ability. The reasoning capability is layered on top through further training, not built from the ground up.
Use reinforcement learning with verifiable rewards. The key training signal is not human preference ratings (as used in standard RLHF) but verifiable correctness. For mathematics problems, the final answer is either right or wrong. For coding tasks, the code either passes tests or it does not. This provides a clear, scalable signal that does not require expensive human labelling for every example.
Train on process, not just outcome. Standard reward models score final answers. Process reward models score individual steps in the reasoning chain. By training with a PRM, the model learns that correct intermediate steps are valuable even when they appear in a trace that ultimately reaches the wrong conclusion. This discourages shortcut reasoning and rewards genuine logical progress.
Let the model discover reasoning strategies autonomously. One of the surprising findings from DeepSeek R1 is that when models are trained with strong enough reinforcement learning signals on hard problems, they spontaneously develop strategies like self-correction, hypothesis testing, and backtracking, without those behaviours being explicitly programmed. The model learns that thinking longer tends to produce better rewards.
At inference time, the model generates a thinking trace first. When you submit a query to a reasoning model, it does not immediately produce an answer. Instead, it generates a long sequence of internal reasoning tokens, working through the problem step by step. This trace is often hidden from the user in deployed products (OpenAI's interface hides it by default), but it is genuinely being computed and it directly informs the final answer.
The final answer is conditioned on the full reasoning trace. Once the thinking process is complete, the model generates its response. Because the response is conditioned on hundreds or thousands of tokens of structured reasoning, it reflects a much richer computational process than a standard single-pass generation.
Scaling compute improves results non-linearly. One of the remarkable properties of reasoning models is that allocating more tokens to the thinking trace tends to improve accuracy in a predictable way. Harder problems benefit more from longer thinking. This is the core of inference-time scaling: you can trade latency for accuracy dynamically, depending on what the task requires.

A Practical Example

Suppose you ask an AI model the following problem: "Alice, Bob, and Carol each have a different favourite colour from the set red, blue, and green. Alice does not like red. Bob does not like blue or green. Who likes which colour?"

A standard language model, responding in a single pass, might correctly identify that Bob must like red (since he dislikes blue and green), and then work out the rest. But it might also jump to an incorrect conclusion if the associative patterns in its training data suggest a plausible-sounding but wrong assignment.

A reasoning model approaches this differently. Its internal thinking trace might look something like: "There are three people and three colours. Bob cannot like blue or green, so Bob must like red. That eliminates red for Alice and Carol. Alice cannot like red (already eliminated) and cannot like red anyway. So Alice can like blue or green. Carol gets whatever Alice does not pick. Since there is no further constraint distinguishing Alice and Carol's preferences between blue and green, I need to re-read the problem... the problem states Alice does not like red, which is already accounted for. Both assignments for Alice (blue or green) appear valid unless I have missed a constraint. The problem as stated has two valid solutions."

That last step is something a standard model almost never does: recognise when a problem is underspecified and say so explicitly rather than picking one answer with false confidence. The reasoning trace gives the model the space to audit its own logic and catch the ambiguity before it commits to a final answer.

On harder problems, the difference is even more pronounced. In a multi-step scheduling problem with six constraints, a reasoning model can explicitly track which constraints it has applied, notice when two constraints conflict, and either resolve the conflict or flag it to the user. A standard model will typically produce an answer that satisfies most constraints but violates one or two without acknowledging the failure.

Advantages

Dramatically higher accuracy on hard tasks. Reasoning models outperform standard models by large margins on mathematical olympiad problems, competitive programming, scientific reasoning, and multi-step logical deduction. On some benchmarks, the improvement is not incremental but categorical.
Better calibration and self-correction. Because the model audits its own work during the thinking trace, it is more likely to catch errors before they reach the final answer. It is also more likely to express uncertainty appropriately rather than confabulating a confident-sounding but wrong response.
Dynamic compute allocation. A reasoning model can spend more time on hard problems and less on easy ones. This is something standard models cannot do, as they process every query with the same fixed forward pass.
Emergent reasoning strategies. Models trained with reinforcement learning on hard problems discover strategies like decomposition, hypothesis testing, and analogy that were not explicitly taught. This gives them a robustness to novel problem types that purely pattern-matching models lack.
Improved multi-step tool use. In agentic workflows where the model must plan a sequence of tool calls, reasoning models produce significantly more coherent and correct plans. They are better at noticing when a tool call returned unexpected output and adapting accordingly.
Open-source accessibility. DeepSeek R1 demonstrated that state-of-the-art reasoning capability can exist in openly available weights, making these techniques accessible to researchers and engineers who cannot afford proprietary API costs at scale.

Limitations and Trade-offs

High latency. Generating a reasoning trace of several hundred to several thousand tokens before producing the final answer adds significant wall-clock time to every response. For interactive applications where users expect sub-second responses, this is often unacceptable.
Higher token costs. Every thinking token costs money and compute. A reasoning model might use five to twenty times as many tokens as a standard model to answer the same question. At scale, this has a major impact on infrastructure costs.
Overkill for simple tasks. Using a reasoning model to answer "What is the capital of France?" is wasteful. The extended thinking process adds latency without improving the answer. Over-relying on reasoning models for simple queries is a common and expensive mistake.
The thinking trace is not always reliable. Research has shown that the visible reasoning trace is not always a faithful representation of how the model reached its answer. In some cases, the reasoning trace is post-hoc rationalisation rather than the actual computation driving the output. This is sometimes called "unfaithful chain-of-thought."
Worse on creative and open-ended tasks. Reasoning models are optimised for tasks with verifiable correct answers. On open-ended creative writing, brainstorming, or tasks where quality is subjective, their tendency to search for "the right answer" can produce outputs that feel over-structured or unnecessarily analytical.
Training complexity and cost. Building a reasoning model from scratch requires careful curriculum design, reliable verifiers, and process reward models that themselves require effort to train. The training pipeline is substantially more complex than standard supervised fine-tuning.
Context window pressure. Long reasoning traces consume context window space. In tasks that require both a long input and extended reasoning, the model can run into context limits that would not be a problem for a standard model.

Common Mistakes

Using reasoning models for every task by default. Some developers, excited by benchmark results, route all queries through reasoning models regardless of complexity. The result is a slow, expensive system that provides no accuracy improvement on the majority of queries that did not need extended reasoning.
Evaluating reasoning models with simple benchmarks. Reasoning models often appear worse than standard models on benchmarks that measure speed, cost per query, or performance on simple tasks. Evaluating them on the wrong tasks leads to incorrect conclusions about their value.
Treating the thinking trace as ground truth. Developers sometimes log and analyse the reasoning trace as if it were a reliable audit trail of the model's logic. Given the evidence for unfaithful CoT, this can produce misleading conclusions about model behaviour.
Adding chain-of-thought prompts to reasoning models. Reasoning models already generate internal reasoning. Adding explicit instructions like "think step by step" to their system prompt is redundant at best and can interfere with their native reasoning behaviour at worst.
Ignoring the latency budget. In production systems with strict SLA requirements, deploying a reasoning model without profiling its response time distribution is a significant operational risk. The tail latency of reasoning models can be very long on hard problems.
Assuming reasoning models do not hallucinate. The extended thinking process reduces but does not eliminate hallucination. Reasoning models still invent facts, misremember training data, and produce confidently wrong answers, particularly on questions that require recent or highly specific factual knowledge.

Best Practices

Route by task complexity. Build a routing layer that sends simple queries (classification, extraction, short-form Q&A) to a fast standard model and reserves the reasoning model for tasks that genuinely require multi-step deduction, planning, or mathematical reasoning.
Use reasoning models for verification, not just generation. Even when you generate content with a standard model, a reasoning model can serve as a high-quality verifier that checks the output for logical consistency, constraint satisfaction, or correctness.
Pair reasoning models with tool use carefully. Reasoning models benefit significantly from access to tools (calculators, code interpreters, search), because these offload the parts of reasoning where LLMs are weakest. But tool-use prompting should be simple and explicit rather than elaborate.
Set appropriate timeout and token limits. In production, cap the maximum reasoning tokens to prevent runaway latency on edge-case inputs. Most problems that benefit from reasoning do not require more than a few thousand thinking tokens.
Benchmark on your actual task distribution. Published benchmarks (MATH, AIME, SWE-bench, GPQA) are useful signals but may not represent your application's query mix. Always evaluate reasoning models on a representative sample of your real workload before committing.
Consider open-source reasoning models for cost-sensitive workloads. DeepSeek R1 and its derivatives offer strong reasoning performance that can be self-hosted. For high-volume use cases where API costs are a constraint, running an open-source reasoning model on dedicated infrastructure may be more economical.
Do not over-engineer prompts. Reasoning models are relatively robust to prompt variation compared to standard models. Elaborate system prompts with extensive instructions about how to reason can conflict with the model's trained behaviour. Keep system prompts focused on context and constraints, not reasoning methodology.

Model Comparison

Model	Strengths	Weaknesses	Best Use Case
OpenAI o1	Reliable multi-step reasoning, strong on STEM tasks, well-integrated with OpenAI ecosystem, consistent safety behaviour	Slower and more expensive than standard GPT-4-class models, thinking trace is hidden, no open weights	Production use cases requiring high accuracy on structured reasoning tasks (scientific Q&A, legal analysis, code generation with complex requirements)
OpenAI o3	State-of-the-art on mathematical and scientific reasoning, significantly stronger than o1 on frontier benchmarks, best-in-class on agentic coding tasks	High cost, highest latency in the series, overkill for most enterprise applications	Research-grade tasks, frontier mathematics, competitive programming, autonomous coding agents where correctness is paramount
DeepSeek R1	Open weights available for self-hosting, competitive with o1 on many benchmarks, much lower cost at scale, strong multilingual reasoning	Safety alignment less robust than OpenAI models, requires infrastructure to self-host, some instability on very long reasoning chains	Cost-sensitive production deployments, research experimentation, applications that require on-premises deployment or fine-tuning
Standard GPT-4-class model	Fast response time, low cost per token, excellent on language tasks, strong instruction following, broad tool ecosystem	Poor accuracy on hard multi-step reasoning, no dynamic compute allocation, prone to confident errors on constraint-heavy problems	High-volume applications where most queries are simple (summarisation, translation, classification, short-form generation, retrieval-augmented Q&A)

Frequently Asked Questions

Is o1 just doing chain-of-thought prompting?

No, and this is one of the most common misconceptions. Chain-of-thought prompting is a technique where you instruct a model to write out its reasoning in the prompt, for example by adding "let's think step by step." o1 and its successors generate internal reasoning as a result of training, not prompting. The model has learned through reinforcement learning to produce structured thinking traces because doing so produces better rewards. You cannot replicate o1's reasoning capability by adding CoT instructions to a standard model, though CoT prompting does improve standard model accuracy by a meaningful amount on many tasks.

Why does thinking longer actually help?

Longer thinking helps because it gives the model more computational steps to work with. In a standard forward pass, the model has a fixed, relatively small number of transformer operations between seeing the question and producing each output token. With a reasoning trace, each token in the trace is a genuine computational step that updates the model's internal representation of the problem. A longer trace means more opportunity to correct early errors, apply additional constraints, and explore alternative solution paths before committing to a final answer. It is analogous to how a human mathematician writes out working rather than computing the answer entirely in their head.

Can reasoning models replace formal verification tools?

Not reliably, and this matters for anyone building safety-critical systems. Reasoning models significantly reduce errors but do not eliminate them. Formal verification tools guarantee correctness through mathematical proof. Reasoning models improve the probability of correctness through better statistical inference. For high-stakes applications (financial systems, medical diagnostics, safety-critical code), reasoning models are a valuable first pass but should not replace formal methods or human expert review.

Why did DeepSeek R1 matter so much?

DeepSeek R1 was significant for two reasons. First, it demonstrated that strong reasoning capability could be trained at a fraction of the cost previously assumed, challenging the narrative that only a handful of well-resourced labs could build frontier reasoning models. Second, it released model weights openly, allowing researchers, companies, and individuals to run, fine-tune, and study a state-of-the-art reasoning model without going through an API. This democratised access to reasoning model research and spurred a wave of follow-on work in the open-source community.

Do reasoning models work better with more context or with tools?

Both help, but in different ways. More context helps when the task requires synthesising information from a long document or conversation history. Tools (code interpreters, calculators, search) help when the task requires operations the model's weights are intrinsically bad at, like precise arithmetic, real-time information retrieval, or executing code. The combination of a reasoning model with access to a code interpreter is particularly powerful: the model can write code to verify its own mathematical reasoning, catching errors that a pure language-based reasoning trace would miss.

References

OpenAI. (2024). OpenAI o1 System Card. OpenAI. https://openai.com/index/openai-o1-system-card/
DeepSeek-AI. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948. https://arxiv.org/abs/2501.12948
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. NeurIPS 2022. https://arxiv.org/abs/2201.11903
Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv:2305.20050. https://arxiv.org/abs/2305.20050
Snell, C., et al. (2024). Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Model Parameters. arXiv:2408.03314. https://arxiv.org/abs/2408.03314
OpenAI. (2025). OpenAI o3 and o4-mini System Card. OpenAI. https://openai.com/index/o3-and-o4-mini-system-card/
Turpin, M., et al. (2023). Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting. NeurIPS 2023. https://arxiv.org/abs/2305.04388

Key Takeaways

Reasoning models generate extended internal thinking traces before answering, trading latency for accuracy on hard multi-step tasks.
The key innovation is inference-time compute scaling: performance improves by spending more tokens on thinking, not by making the model larger.
Training uses reinforcement learning with verifiable rewards and process reward models that score intermediate reasoning steps, not just final answers.
o1, o3, and DeepSeek R1 represent three different points on the capability-cost-openness spectrum, and choosing between them depends on your accuracy requirements, latency budget, and infrastructure constraints.
Reasoning models are not a universal upgrade: they are slower, more expensive, and unnecessary for simple tasks. The right engineering decision is to route by task complexity.
The field is moving fast: inference-time scaling, process reward models, and open-source reasoning weights are all active research areas with significant improvements expected throughout 2026 and beyond.