Artificial-intelligence · May 25, 2026

OpenAI Codex Explained: How LLMs Learn to Write Code

A deep dive into the training data, architecture, and learning objectives that taught a language model to program

by Perivitta 31 mins read
Share
Back to all posts

OpenAI Codex Explained: How LLMs Learn to Write Code

Introduction

AI writes code today. It autocompletes your functions, generates boilerplate, translates natural language into SQL, and passes university-level programming exams. But almost no one explains how. How does a model that has never executed a program learn to write one that works? How does it know that list.append(x) returns None and not the modified list? How does it write a correct merge sort without ever having run merge sort?

This post answers those questions. We'll go inside Codex, the model OpenAI released in 2021 that quietly powered GitHub Copilot and launched the AI coding era, from the raw training data all the way to the evaluation benchmarks and failure modes that still matter today.

1. What Is Codex? A Brief History

In August 2021, OpenAI published a paper titled Evaluating Large Language Models Trained on Code and released an API under the name Codex. On the surface, it was a code-completion engine. In practice, it was a rethinking of what language models could do.

Codex was not a new architecture. It was GPT-3, OpenAI's 175-billion-parameter language model, fine-tuned on a massive corpus of publicly available code. The insight was simple but powerful: GPT-3 already understood structure, context, and meaning in text. Code is structured text with deterministic semantics. If you showed the model enough of it, it would learn to write it.

Three things made Codex notable:

  • Scale of the code corpus. 54 million public GitHub repositories, representing roughly 159 GB of unique Python code, plus code in eleven other languages.
  • Longer context. While GPT-3 handled 2,048 tokens, Codex extended this to 4,096 tokens. That matters enormously when a function defined 400 lines ago needs to be called correctly in the current scope.
  • A new evaluation standard. OpenAI released HumanEval, 164 hand-written programming problems with unit tests, and introduced the pass@k metric, which has since become the standard for measuring code generation quality.

By June 2022, GitHub Copilot launched into general availability, built entirely on Codex. By March 2023, Codex was deprecated in favour of GPT-3.5 Turbo and GPT-4, which had absorbed code training at far greater scale. But the architecture decisions, training techniques, and failure modes Codex introduced are still inside every AI coding tool you use today.

GPT-3 May 2020 Codex API Aug 2021 Copilot GA Jun 2022 GPT-4 / Deprecated Mar 2023 Codex CLI Apr 2025
Figure 1: The Codex timeline, from GPT-3's release through the 2021 API, GitHub Copilot, deprecation in favour of GPT-4, and the 2025 Codex CLI agent.

2. The Training Corpus: Teaching a Model to Read Code

The most underappreciated part of Codex is its training data. The architecture is a transformer. The data is what changed everything.

2.1 Where the Code Came From

OpenAI collected a snapshot of all public GitHub repositories through May 2020, approximately 54 million repositories. From these, they extracted files in 12 programming languages: Python, JavaScript, TypeScript, Ruby, Go, C, C++, Java, C#, PHP, Shell, and Rust.

After deduplication, the Python slice alone was 159 GB of unique source code. The multi-language corpus was several times larger. This dwarfs most earlier code datasets by orders of magnitude.

Critically, Codex was not trained from scratch on this code. The model was initialised with GPT-3's pre-trained weights and then fine-tuned on code. This gave it a running start: GPT-3 already understood syntax, semantics, and documentation from its natural language pre-training on Common Crawl, books, and Wikipedia. This is why Codex could understand a natural-language docstring and produce correct code, it didn't need to learn language and code simultaneously.

2.2 Data Filtering: Quality Over Quantity

Raw GitHub is noisy. The Codex team applied strict filtering before training, removing files that were:

  • Too large, files larger than 1 MB were excluded (likely generated or binary-encoded).
  • Suspiciously long lines, average line length above 100 characters or maximum line length above 1,000 characters indicated minified JavaScript or auto-generated files.
  • Low alphanumeric content, files with less than 25% alphanumeric characters were mostly punctuation, brackets, or encoded blobs, not real code.
  • Duplicate content, exact and near-duplicate files were deduplicated to prevent the model from memorising boilerplate rather than learning to reason.

These filters removed a substantial fraction of the raw corpus. The resulting data skewed toward human-authored, readable code, the kind that appears in textbooks, tutorials, open-source libraries, and professional projects.

2.3 Why More Data Is Not Always Better

A common assumption is that more training data always improves model quality. For code models, this is partially false. The quality and diversity of code matters as much as the volume.

Consider what GitHub actually contains: millions of abandoned tutorial repositories, nearly identical homework submissions, Stack Overflow copy-pastes, and auto-generated CRUD scaffolding. If these dominate the training distribution, the model learns to write common patterns rather than correct patterns. It becomes fluent in beginner idioms while underperforming on production patterns that appear less frequently.

This is a partial explanation for one of Codex's key limitations: it is significantly better at well-trodden paths (sorting algorithms, API wrappers, data parsing) than at novel problem-solving or complex multi-file reasoning.

3. Architecture: The Transformer That Codes

3.1 Decoder-Only Transformer

Codex uses the same decoder-only transformer architecture as GPT-3. There is no encoder. There is no cross-attention over a separate input sequence. The model receives a sequence of tokens and predicts the next token — repeatedly, autoregressively, until it produces a complete function, class, or file.

Multiple model sizes were released under the Codex name. The most capable, referred to internally as davinci-codex, had parameters in the same order of magnitude as GPT-3's largest variant. A smaller, faster model called cushman-codex (~12B parameters) was used in latency-sensitive applications like Copilot's inline completions.

Input Tokens "def merge_sort(arr):" Token + Position Embedding N × Transformer Block Masked Self-Attention Feed-Forward Network Layer Norm + Residual (4096-token context window) LM Head 50,257-way softmax Next Token ↻ repeat
Figure 2: Codex's decoder-only transformer architecture. Tokens are embedded, passed through N stacked transformer blocks with masked self-attention, and projected through a language-model head to predict the next token autoregressively.

3.2 Why Longer Context Changes Everything for Code

Doubling the context window from GPT-3's 2,048 tokens to Codex's 4,096 tokens is not a minor upgrade. For code, context length is architecturally critical.

Consider a realistic Python file:

  • A dataclass defined at the top of the file is referenced in a method 200 lines down.
  • A utility function imported at line 5 is called with specific argument types at line 350.
  • A constant defined in a config block at line 20 determines the logic of a method at line 400.

At 2,048 tokens, much of this context falls outside the model's attention window. Errors cascade: the model calls a function with wrong argument types because it no longer "sees" the definition. At 4,096 tokens, more of these long-range dependencies stay in view.

Modern code models have pushed context further still, StarCoder2 handles 16,384 tokens, and the latest GPT-4 models support 128,000. This is one of the most direct drivers of measurable quality improvement in code generation.

3.3 Tokenisation for Code

Codex uses a Byte Pair Encoding (BPE) tokeniser with a vocabulary of 50,257 tokens, the same as GPT-3's. BPE was designed for natural language, which creates several inefficiencies for code:

Challenge Why It Matters Example
Indentation Python uses spaces for syntax. 8 spaces of indentation = multiple tokens in GPT-3's tokeniser, wasting context.         return x uses more tokens than its semantic weight deserves.
Numbers Numeric literals are split at digit boundaries, inflating token counts for constants. 3.14159265 becomes multiple tokens instead of one semantic unit.
Identifiers camelCase and snake_case identifiers may tokenise differently each time, depending on adjacent characters. getUserById may split as get / User / By / Id.
Repeated punctuation Separators like ==== or ---- (common in Python docstrings and Markdown) consume multiple tokens. 8 equals signs = up to 8 tokens.

Later code-specific models, StarCoder, DeepSeek-Coder, CodeLlama, trained new BPE tokenisers on code-heavy corpora. They merged common whitespace sequences (e.g., 4 spaces) into single tokens and handled numeric literals more efficiently. These improvements meaningfully increase the effective information density per token.

4. How the Model Learns to Write Code

4.1 Next-Token Prediction

The core training objective is identical to GPT-3: predict the next token, given all previous tokens. The loss function is cross-entropy over the vocabulary:

\[ \mathcal{L} = -\frac{1}{T} \sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}) \]

where \(x_t\) is the token at position \(t\) and \(T\) is the sequence length. This objective sounds deceptively simple. But applied to 159 GB of Python code, it forces the model to internalise:

  • Syntax, the rules of indentation, brackets, and colons that make Python valid.
  • Semantics, that list.sort() sorts in-place while sorted(list) returns a new list.
  • APIs, that pd.read_csv() returns a DataFrame with specific column types and methods.
  • Idioms, that Pythonic list comprehensions are preferred over explicit loops in most contexts.
  • Documentation conventions, that a function starting with a docstring is likely well-written and its body should match the described behaviour.

None of this was explicitly taught. It all emerged from predicting tokens.

4.2 Fill-in-the-Middle (FIM)

Pure left-to-right generation has a fundamental limitation: in practice, a developer writing code in the middle of a file needs the model to see both what came before and what comes after the cursor. A raw autoregressive model can only condition on the prefix.

The solution, developed alongside Codex and later formalised for models like StarCoder and CodeLlama, is Fill-in-the-Middle (FIM) training.

During training, a random contiguous span is extracted from the middle of a code sample. The remaining content is split into a prefix and suffix. The training input is restructured as:

<PRE> {prefix} <SUF> {suffix} <MID> → predict {middle span}

This teaches the model to condition on context from both directions simultaneously. The result is dramatically better inline completions, the model can complete a function body knowing what the function returns, or fill in a missing argument knowing how it is later used.

Without FIM (prefix only) PREFIX: def process(data): GENERATE → ??? Model cannot see the suffix. No knowledge of what the function must return. With FIM <PRE> def process(data): <SUF> return result <MID> GENERATE ↳ aware of suffix Model conditions on prefix AND suffix. Generates the middle span with full context.
Figure 3: Fill-in-the-Middle (FIM) vs standard left-to-right generation. FIM lets the model see what comes after the cursor, enabling accurate inline completions inside existing code blocks.

4.3 Instruction Fine-Tuning and RLHF

A model trained purely on next-token prediction on code is good at completing code that looks like code it has seen. It is not naturally good at following instructions like "Write a Python function that takes a list of dictionaries and groups them by a specified key."

To bridge this gap, OpenAI applied two additional training phases:

Supervised Fine-Tuning (SFT): Human contractors wrote (prompt, ideal completion) pairs. The model was fine-tuned on these pairs to learn the instruction-following format, to treat natural language as a specification and code as the solution.

Reinforcement Learning from Human Feedback (RLHF): The model generated multiple completions for each prompt. Human raters ranked them. A separate reward model was trained to predict these rankings. Then the code model was optimised with Proximal Policy Optimisation (PPO) to produce outputs the reward model scores highly:

\[ \mathcal{L}_{\text{RLHF}} = \mathbb{E}_{(x,y) \sim \pi_\theta}\left[r_\phi(x, y)\right] - \beta \cdot \text{KL}\left[\pi_\theta \| \pi_{\text{ref}}\right] \]

Here \(r_\phi\) is the reward model, \(\pi_\theta\) is the fine-tuned policy, \(\pi_{\text{ref}}\) is the original SFT model, and \(\beta\) is a coefficient that prevents the policy from deviating too far from the reference (to avoid reward hacking). The KL penalty keeps the model from producing outputs that maximise the reward model's score but are bizarre in ways human raters didn't anticipate.

For code, execution is a natural additional signal: does the generated code pass the unit tests? This execution feedback can be used to supplement human preferences, since it is cheaper and more objective.

5. Evaluating Code Models: HumanEval and pass@k

The Codex paper introduced HumanEval, a benchmark that has since become the standard for measuring code generation quality. It consists of 164 hand-written Python programming problems, each containing:

  • A function signature and docstring describing what the function should do.
  • Between 7 and 10 unit tests that the generated function must pass.

The metric is pass@k: given \(k\) independently sampled completions per problem, what is the probability that at least one of them passes all unit tests?

\[ \text{pass@}k = 1 - \frac{\binom{n - c}{k}}{\binom{n}{k}} \]

where \(n\) is the total number of completions sampled, and \(c\) is the number that pass. This formulation gives an unbiased estimate of the true pass@k without requiring exactly \(k\) samples per problem.

The motivation for pass@k over simple accuracy is practical: in real use, a developer can look at multiple suggestions and pick the best one. Measuring only pass@1 undervalues models that generate diverse, plausible completions.

Model pass@1 pass@10 pass@100
GPT-3 (fine-tuned on code, 12B tokens) 0.0% 0.0% 3.6%
GPT-J (6B, code fine-tuned) 11.4% 15.7% 27.7%
Codex (cushman, ~12B) 28.8% 46.8% 72.3%
GPT-4 (2023) ~67%
Claude 3.5 Sonnet (2024) ~92%
o3 (2025) ~99.7%

HumanEval is now considered saturated at the top end. Harder benchmarks like SWE-bench (which requires resolving real GitHub issues across multi-file codebases) have emerged to differentiate frontier models, where even the best models currently achieve under 50% on the full verified split.

6. Why Codex Gets Things Wrong

Codex passed 28.8% of HumanEval problems at pass@1 in 2021. The remaining 71.2% failed. Understanding the failure modes is as important as celebrating the successes, and these failure modes haven't fully disappeared in modern models.

6.1 API Hallucination

This is the most common and most dangerous failure. Codex confidently generates function calls, parameters, and return values that do not exist in the actual library.

Why does this happen? The model learned the pattern "when doing X with library Y, call Y.function(argument)" from thousands of examples. When a function was renamed, deprecated, or restructured between library versions, the model retained the old usage because the training data majority still used it. The model has no way to verify its output at inference time, it has never executed code.

Classic examples:

  • pd.DataFrame.append(), deprecated in pandas 1.4, removed in 2.0. Codex suggested it routinely years after its removal.
  • sklearn.cross_validation.train_test_split, the cross_validation module was renamed to model_selection in sklearn 0.18 (2016). Models trained before 2019 still produce the old path.
  • plt.savefig(tight_layout=True), this parameter does not exist. The correct approach is plt.tight_layout(); plt.savefig(...).

6.2 Logic Errors vs Syntax Errors

Codex is extremely good at syntax. It almost never produces Python that fails to parse. But syntactic validity is the lowest bar, what matters is semantic correctness.

Logic errors are much harder for the model to avoid because they are rarely visible in the token sequence. An off-by-one error in a loop produces syntactically valid Python that the model has no way to detect as wrong during training. The model optimised for token prediction, not program correctness. It learned the surface form of correct code, not the underlying invariants.

This is why unit tests, the HumanEval methodology, are the right evaluation framework. Syntactic correctness is necessary but not sufficient. Execution is the only oracle.

6.3 The Distribution Problem

GitHub contains a heavily skewed distribution of code. Tutorial projects, beginner exercises, and introductory courses are vastly overrepresented relative to production systems. This means the model is trained on code written by people learning rather than code written by professionals solving hard problems.

The consequence: Codex excels at isolated function-level problems (exactly what HumanEval tests) and struggles on tasks that require reasoning across multiple files, understanding project-level conventions, or solving problems that are common in production but rare in tutorials, complex concurrency, careful error handling, strict type safety, and performance-critical algorithms.

7. From Codex to Modern AI Coding

7.1 GPT-4 and the Absorption of Code

When GPT-4 launched in March 2023, Codex was deprecated the same week. GPT-4 had been trained on a far larger and more curated mixture of code and natural language, not as a fine-tune, but as part of pre-training itself. Code was no longer a specialisation; it was a first-class modality.

The consequences were measurable. GPT-4 passed the HumanEval benchmark at ~67% pass@1, more than doubling Codex's best score. More significantly, it showed qualitatively different capabilities: it could reason about why code was wrong, explain the fix, and modify multi-step algorithmic logic in a single pass.

Code Interpreter (now Advanced Data Analysis) added a feedback loop: GPT-4 could execute code in a sandbox, observe the output or error, and revise its solution. This is the single most important architectural upgrade for practical coding tasks, the model can iterate rather than guess.

7.2 Open-Source Alternatives

The year following Codex's deprecation produced a wave of powerful open-source code models:

Model Parameters Training Data Context Notable Feature
StarCoder (BigCode) 15.5B The Stack v1 (~6.4 TB) 8,192 FIM support, permissive licence
CodeLlama (Meta) 7B–70B Llama 2 + code fine-tune 16,384 Infilling, instruction variants
DeepSeek-Coder 1B–33B 2T code tokens 16,384 Strong HumanEval scores across sizes
Qwen2.5-Coder 0.5B–72B 5.5T code tokens 128K Near-frontier performance open weights

The gap between open-source and frontier proprietary models has narrowed dramatically. For many practical coding tasks, a locally-run DeepSeek-Coder or Qwen2.5-Coder is sufficient, with the advantage of privacy, zero cost per token, and no rate limits.

7.3 Codex CLI (2025): Agentic Coding

In April 2025, OpenAI released a new product called Codex CLI, a name chosen to evoke the original Codex, but technically a different product altogether. Rather than being a completion model, the Codex CLI is an agentic coding assistant that runs in your terminal.

It operates on the same principle as tools like Claude Code: given a task in natural language, the agent reads files, writes code, runs commands in a sandboxed shell, reads the output, and iterates until the task is complete. Under the hood it uses OpenAI's o3 or o4-mini reasoning models.

The shift from completion to agent is profound. A completion model predicts the next token. An agent takes actions, observes consequences, and adapts. The bottleneck moved from how well can the model write code? to how well can it plan, debug, and navigate a real codebase?

8. What This Means for Developers and Data Scientists

Understanding how Codex works, and how its successors work, makes you a better user of these tools. A few practical principles follow directly from the architecture:

Provide explicit context. The model cannot see your private codebase, your library versions, or your team conventions. If you want code that works with pandas 2.1, say so. If you have a custom base class, paste its definition. The context window is yours to fill.

Treat generated code like a junior developer's PR. The model is confident regardless of correctness. It will produce syntactically valid, plausible-looking code that may contain subtle logic errors, outdated API calls, or missing edge case handling. Review it as you would review any unverified code.

Write tests first. The HumanEval benchmark works because it uses unit tests as the oracle. Apply the same thinking in your own workflow: write the test that defines correct behaviour before asking the model to implement the function. Then verify the output against the test. This is the closest you can get to pass@1 in a real project.

Use multiple completions. The pass@k metric exists because a single sample is an unreliable predictor of quality. If your editor or API call allows it, generate 3–5 completions and review them. The best answer is more likely to be in a set of five than in the first one alone.

Know the failure modes by domain. API hallucination is most dangerous in fast-moving libraries (LangChain, pandas, sklearn, PyTorch). Logic errors are most likely in algorithmic tasks with non-obvious edge cases. Context errors appear in multi-file tasks. Calibrate your review effort accordingly.

9. Conclusion

Codex was not a breakthrough in architecture, it was a breakthrough in what you point a transformer at. GPT-3's weights, fine-tuned on 54 million public repositories with a longer context window and a carefully filtered corpus, produced a model that changed what developers expected from their tools.

The lesson is not that code generation is solved. Pass@1 went from 28.8% on HumanEval in 2021 to near-100% on the same benchmark in 2025, but the real bar, resolving issues in production codebases, understanding project context, reasoning across thousands of lines, remains far harder. The models are better. The limitations are more subtle.

What has not changed is the fundamental mechanism: a transformer, trained to predict the next token, on enough human-written code, internalises patterns so deep and varied that the output looks like understanding. Whether it is understanding is a philosophical question. Whether it is useful is not.

References

  • Chen, M. et al. (2021). Evaluating Large Language Models Trained on Code. OpenAI. arXiv:2107.03374.
  • Brown, T. et al. (2020). Language Models are Few-Shot Learners. arXiv:2005.14165.
  • Bavarian, M. et al. (2022). Efficient Training of Language Models to Fill in the Middle. arXiv:2207.14255.
  • Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.
  • Li, R. et al. (2023). StarCoder: may the source be with you! arXiv:2305.06161.
  • OpenAI (2023). GPT-4 Technical Report. arXiv:2303.08774.