Mixture of Experts (MoE): The Architecture Behind Frontier LLMs
- What you will learn: How MoE replaces dense feed-forward layers with banks of specialist networks, how the gating router works, and why this lets models scale capacity without scaling compute.
- Why it matters: MoE is the architecture behind Mixtral, Grok-1, DeepSeek-V3, and the likely structure of GPT-4. Understanding it is essential for any engineer working with frontier-scale models.
- Key insight: Only 2 of 8 experts (or similar ratios) activate per token. Total parameters are large; active parameters per forward pass are small. That gap is where MoE's efficiency comes from.
- Watch out for: Load imbalance collapses all routing to a single expert without an auxiliary loss. Training MoE is more complex than training a dense model of equivalent active parameters.
- Covered in depth: Gating mechanisms, token-choice vs expert-choice routing, load balancing, training challenges, a hand-worked routing example, PyTorch implementation, and a comparison of real-world MoE models.
When Mistral released Mixtral 8x7B in December 2023, it demonstrated something striking: a model with 46.7 billion total parameters that matched or outperformed LLaMA 2 70B on most benchmarks, while running at roughly twice the inference speed. The secret was not a better dataset or a bigger GPU budget. It was a fundamentally different architecture, one that had been theorised since the 1990s but only recently became practical at scale: Mixture of Experts.
The core insight behind MoE is deceptively simple. Not every token in a sequence requires the same kind of processing. A token like "photosynthesis" calls for biological and chemical knowledge; a token like "integrate" might call for mathematical reasoning. Why should both tokens activate exactly the same set of parameters? In a standard dense transformer, they do. Every token flows through every weight in every feed-forward layer, regardless of relevance.
MoE breaks this constraint. Instead of one monolithic feed-forward network (FFN) per transformer layer, MoE replaces it with multiple smaller networks called experts, and a lightweight router that decides, for each token, which subset of experts to activate. The result is a model with far greater total capacity, but no increase in the compute required per token.
This post gives you a rigorous, ground-up understanding of MoE: the theory, the architecture, the training challenges, a hand-worked numerical example, and a clean PyTorch implementation. By the end you will understand why this architecture dominates frontier model design from 2024 through 2026, and what trade-offs you accept when you use it.
The Problem MoE Solves
To appreciate why MoE exists, you need to understand the scaling wall that dense transformers hit.
The Dense Model Scaling Problem
Scaling laws, first documented rigorously in the Chinchilla paper, show that a dense transformer's loss decreases predictably as you increase parameters and training tokens. More parameters means more capacity to memorise facts, learn syntax, and generalise across domains. Larger models are simply better, and the industry spent 2020 to 2023 proving this empirically.
But the cost of running a dense model scales linearly with its parameter count. If you double the number of parameters, you roughly double the FLOPs per forward pass, double the memory bandwidth required, and double the GPU memory needed to store the model. At 7 billion parameters this is manageable. At 70 billion parameters it requires careful engineering. At 700 billion parameters it becomes financially brutal at inference time.
The fundamental tension is this: you want capacity at training time, but you want cheapness at inference time. In a dense model, these two goals are in direct conflict. Every parameter you add for better quality is another parameter you pay to run on every token at inference.
The MoE Escape Hatch
MoE breaks the tight coupling between model capacity and per-token compute. You can have a model with, say, 46 billion parameters, but only activate 12 billion of them for any given token. Training teaches all 46 billion parameters to specialise, so the total knowledge in the model is large. But inference only pays for the 12 billion parameters that are actually used.
This is analogous to a hospital. A hospital employs hundreds of specialists: cardiologists, neurologists, dermatologists, oncologists. When a patient arrives, only the relevant specialists are called in. You do not call every specialist for every patient just because they are all on staff. The total expertise of the hospital is large, but the cost of treating any one patient is bounded by the number of specialists actually needed.
Core Concepts and Terminology
| Term | Definition |
|---|---|
| Expert | A distinct feed-forward network within an MoE layer. Each expert has its own weights and learns to specialise in a subset of the input distribution. |
| Router / Gating Network | A small learned linear layer that takes a token's hidden representation and produces a probability score for each expert. Determines which experts process each token. |
| Top-k Routing | The routing strategy where each token activates exactly k experts (typically k=1 or k=2). Only the top-k scoring experts receive the token; others are bypassed. |
| Sparse Activation | The property that only a small fraction of all parameters are activated for any given token. In Mixtral 8x7B, 2 of 8 experts fire per token: 25% of MoE parameters are active. |
| Load Balancing | The goal of distributing tokens roughly evenly across experts so no single expert becomes a bottleneck while others are idle. |
| Expert Capacity | A hard limit on how many tokens each expert can process in a single batch, expressed as a multiple of the average expected load (the capacity factor). |
| Auxiliary Loss | An additional loss term added during training to encourage balanced routing. Without it, experts collapse: the router learns to always pick the same one or two experts. |
| Token Dropping | When an expert's capacity buffer is full and a token that was routed to it gets discarded. The token then passes through unmodified (or is handled by a fallback mechanism). |
| Hard MoE | Routing with a discrete top-k selection. The router makes a hard binary decision: a token either goes to an expert or it does not. Most production MoE models use this. |
| Soft MoE | Routing where every expert processes a weighted combination of all tokens, with weights from the router. Differentiable but computationally expensive; used in research. |
MoE Architecture Deep Dive
Where Experts Live in the Transformer
A standard transformer layer has two sub-layers: a multi-head self-attention (MHSA) module and a feed-forward network (FFN). In a dense model, both are present in every layer, and they run for every token in every batch.
In an MoE model, the FFN sub-layer in selected layers (typically every other layer, or every layer) is replaced by an MoE layer. The MHSA sub-layer is kept as-is. The MoE layer contains N independent FFN experts plus a gating network. The gating network routes each token to k of the N experts, runs only those k experts, and combines their outputs.
In a dense transformer, every token flows through the same feed-forward network weights. In an MoE layer, a lightweight gating network acts like a dispatcher: it evaluates each token, selects its top-k expert networks, runs only those forward passes, and combines their outputs using the gating scores as weights. Experts not selected perform no computation at all. That is how MoE achieves higher capacity without proportional compute cost.
The Gating Network
The gating network is a simple linear projection: a weight matrix of shape [hidden_dim, num_experts]. Given a token's hidden state vector of dimension hidden_dim, the gating network computes one logit per expert via a matrix multiply and then applies a softmax to get routing probabilities.
Concretely, for a token with hidden state h and gating weight matrix W_g:
- Compute logits:
logits = h @ W_g(shape: [num_experts]) - Apply softmax:
scores = softmax(logits)(shape: [num_experts]) - Select top-k indices by score
- Renormalise the top-k scores so they sum to 1
- These renormalised scores become the mixing weights
The gating network has very few parameters relative to the experts. In Mixtral 8x7B, each expert is a standard 7B-class FFN with two linear layers. The gating matrix adds only 4096 * 8 = 32,768 parameters per layer, negligible compared to billions of expert parameters.
Top-k Selection: Why k=1 or k=2
The choice of k has significant practical consequences.
k=1 (Switch Transformer style): Each token activates exactly one expert. This minimises compute but also means the model cannot hedge. If the router is slightly wrong, there is no fallback. The expert must handle the token entirely. Training with k=1 tends to be less stable because the gating network receives high-variance gradient signals.
k=2 (Mixtral style): Each token activates two experts, and their outputs are combined with weighted averaging. This is more robust: if expert A and expert B both partially specialise in the token's domain, both contribute. Training is more stable than k=1 because the gradient can flow through two paths. The cost is that you activate twice the expert FLOPs per token compared to k=1.
k>2: Diminishing returns. Each additional expert adds compute and reduces specialisation pressure. Models rarely use k>2 in practice for dense inference settings.
Expert Capacity Buffer
During batched training and inference, multiple tokens in the same batch may route to the same expert. If 50% of tokens in a batch all want expert 3, expert 3 cannot process them all efficiently without becoming a serial bottleneck.
The capacity buffer solves this. Each expert is assigned a capacity: the maximum number of tokens it will process in one forward pass. The capacity is typically set as:
capacity = (batch_tokens / num_experts) * capacity_factor
A capacity factor of 1.0 means each expert handles exactly its fair share. A capacity factor of 1.25 gives a 25% buffer to absorb natural load variation. If more tokens are routed to an expert than its capacity allows, the excess tokens are dropped: they bypass that expert and their hidden state is passed through unchanged. During training, token dropping is tolerable if rare; during inference, it degrades output quality.
Data Flow for One Token Through an MoE Layer
Let us trace a single token through an MoE layer step by step. Assume 8 experts, k=2 routing, and the token's hidden state is a vector of dimension 4096.
- Gating computation: The hidden state (shape [4096]) is multiplied by the gating weight matrix (shape [4096, 8]) to produce 8 logits.
- Softmax: The 8 logits become 8 probabilities summing to 1.0.
- Top-2 selection: The two highest probabilities are identified, say expert 3 (score 0.41) and expert 7 (score 0.33).
- Score renormalisation: The two selected scores are renormalised: expert 3 gets weight 0.41/(0.41+0.33) = 0.554, expert 7 gets weight 0.446.
- Capacity check: Both experts check whether their capacity buffers have room. If yes, the token is added to their input buffers.
- Expert forward passes: Expert 3 and expert 7 each run their FFN independently on the token's hidden state, producing two output vectors.
- Weighted combination: The two output vectors are combined:
output = 0.554 * expert3_output + 0.446 * expert7_output. - Residual add: The combined output is added back to the input hidden state (standard transformer residual connection).
Routing Mechanisms
The gating function is the heart of MoE. Different routing strategies trade off between training stability, load balance, and computational tractability.
Token-Choice Routing (Standard)
In token-choice routing, each token independently selects its top-k experts. The router processes each token and outputs a distribution over experts; the top-k are activated. This is the most common scheme, used in Mixtral, Switch Transformer, and most other production MoE models.
Mechanism: For each token, compute gating scores for all N experts, take the top-k, renormalise, and combine expert outputs with those weights.
Advantages: Simple to implement. Each token gets its preferred experts. Easy to understand.
Disadvantages: Load imbalance is common. Popular experts get overloaded; unpopular experts starve. Requires auxiliary loss to prevent collapse. Token dropping is necessary when capacity is exceeded.
Expert-Choice Routing
In expert-choice routing, the perspective is flipped. Instead of each token choosing its top-k experts, each expert chooses its top-k tokens from the batch. Each expert is guaranteed to process exactly k tokens, eliminating capacity overflow by construction.
Mechanism: For each expert, compute affinity scores between that expert and all tokens, take the top-k tokens, and process them. Each expert processes exactly k tokens regardless of batch composition.
Advantages: Perfect load balance. No token dropping. No auxiliary loss needed for balancing.
Disadvantages: Some tokens may not be processed by any expert (if no expert selects them), or may be selected by multiple experts (redundant compute). Variable coverage per token makes masking and loss computation more complex. Not used in most production models at scale, though it appeared in Google's research.
Soft MoE
Soft MoE, proposed by Google in 2023, avoids the hard top-k selection entirely. Instead of routing each token to a discrete set of experts, Soft MoE constructs a weighted "slot" for each expert that is a convex combination of all tokens, weighted by routing scores. Each expert then processes its slot, and the outputs are recombined.
Mechanism: For each expert, compute a softmax-weighted sum of all token representations. This "input slot" is processed by the expert. The output slot is then distributed back to tokens via another softmax weighting.
Advantages: Fully differentiable. No discrete routing decisions, so no gradient estimation issues. No token dropping by construction.
Disadvantages: Computationally expensive. Every expert sees a contribution from every token, so the total compute is closer to dense than sparse. Better thought of as a research baseline than a practical scaling strategy.
Training Challenges
Expert Collapse
The most serious failure mode in MoE training is expert collapse. Early in training, by random chance, one expert produces slightly better outputs than the others. The router's gradient signal reinforces sending tokens to that expert. That expert then receives more training signal and improves faster, widening the gap. Eventually, nearly all tokens route to one or two experts, and the rest are effectively unused.
A collapsed MoE model has the compute cost of the full model at training time, but the effective capacity of only one or two experts. It is the worst of both worlds.
Load Imbalance and the Capacity Factor
Even without full collapse, natural imbalance degrades efficiency. If 40% of tokens route to expert 1 and only 5% to expert 8, expert 1 overflows its buffer while expert 8 sits idle. The capacity factor must be set high enough to absorb real-world imbalance without excessive token dropping.
Setting the capacity factor too high wastes memory (pre-allocated buffers that go unused). Setting it too low causes token dropping and quality degradation. A common default is 1.25, but this requires tuning per-model.
The Auxiliary Load Balancing Loss
To counteract collapse and imbalance, practitioners add an auxiliary loss term to the total training objective. The idea is to penalise the router whenever its routing decisions are unequal across experts.
Conceptually, the auxiliary loss works as follows. For each expert, you compute two quantities: the fraction of tokens routed to it (call this the load fraction) and the average routing probability assigned to it across all tokens. You then multiply these two quantities for each expert and sum the results. This sum is minimised when routing is perfectly uniform.
In plain English: if expert 3 always gets high routing scores AND always gets chosen, the product is large and the loss penalises this. The router is pushed toward distributing both scores and selections more evenly.
The auxiliary loss is added to the main cross-entropy loss with a small coefficient, typically 0.01 or 0.001. Too large a coefficient over-regularises and prevents experts from specialising; too small and collapse occurs anyway. This coefficient is one of the most sensitive hyperparameters in MoE training.
Communication Overhead in Distributed Training
In dense models, tensor parallelism and pipeline parallelism distribute the computation of each layer across devices. In MoE models, experts naturally map to expert parallelism: different experts live on different devices. This is efficient when routing is balanced.
However, when a token on device A is routed to an expert on device B, the token's hidden state must be transferred across the network interconnect. This all-to-all communication is a latency bottleneck, especially at large scales. The Switch Transformer paper dedicated significant engineering effort to this problem. DeepSeek-V3 introduced novel communication-compute overlap techniques to mitigate it.
Why MoE Models Are Harder to Fine-Tune
Fine-tuning an MoE model presents unique challenges. First, the full model must fit in GPU memory to allow gradient computation through all experts, which is expensive. Second, parameter-efficient fine-tuning (PEFT) methods like LoRA, when applied only to attention or dense layers, leave the expert weights frozen and may not adapt domain-specific knowledge effectively. Third, the routing distribution learned during pre-training may be miscalibrated for a new domain, causing suboptimal expert utilisation during fine-tuning. Finally, training instability (gradients through the sparse discrete routing) is more pronounced with smaller fine-tuning datasets.
Practical Example: Hand-Worked MoE Layer
Let us work through a concrete numerical example. We have a 4-expert MoE layer and a batch of 3 tokens. We use k=2 routing.
Step 1: Router Computes Scores
The gating network takes each token's hidden state and produces a score for each of the 4 experts. After applying softmax, we get the following routing probabilities:
| Token | Expert 1 Score | Expert 2 Score | Expert 3 Score | Expert 4 Score |
|---|---|---|---|---|
| Token A | 0.10 | 0.55 | 0.25 | 0.10 |
| Token B | 0.40 | 0.08 | 0.12 | 0.40 |
| Token C | 0.05 | 0.60 | 0.30 | 0.05 |
Step 2: Top-2 Selection Per Token
| Token | Selected Experts (top 2) | Raw Scores | Renormalised Weights |
|---|---|---|---|
| Token A | Expert 2, Expert 3 | 0.55, 0.25 | 0.688, 0.312 |
| Token B | Expert 1, Expert 4 | 0.40, 0.40 | 0.500, 0.500 |
| Token C | Expert 2, Expert 3 | 0.60, 0.30 | 0.667, 0.333 |
Step 3: Expert Activation Count
| Expert | Tokens Assigned | Load (of 3 tokens, k=2 so 6 total assignments) |
|---|---|---|
| Expert 1 | Token B | 1 token (16.7% of assignments) |
| Expert 2 | Token A, Token C | 2 tokens (33.3% of assignments) |
| Expert 3 | Token A, Token C | 2 tokens (33.3% of assignments) |
| Expert 4 | Token B | 1 token (16.7% of assignments) |
Expert 2 and Expert 3 are more loaded than Expert 1 and Expert 4. In a large training run, this imbalance would grow without the auxiliary loss pushing toward uniformity.
Step 4: Output Combination
After all activated experts run their forward passes, outputs are combined:
- Token A output = 0.688 * Expert2(h_A) + 0.312 * Expert3(h_A)
- Token B output = 0.500 * Expert1(h_B) + 0.500 * Expert4(h_B)
- Token C output = 0.667 * Expert2(h_C) + 0.333 * Expert3(h_C)
Each expert ran exactly once (processing the tokens assigned to it in a batched forward pass), and the weighted sum reconstructs the token-level output. Notice that Expert 2 processed both Token A and Token C in a single batched operation, which is computationally efficient.
Python Implementation
The following implementation covers the core MoE layer: gating, top-k routing with a capacity buffer, expert forward passes, and the auxiliary load balancing loss.
import torch
import torch.nn as nn
import torch.nn.functional as F
class ExpertFFN(nn.Module):
"""A single expert: a standard two-layer FFN with SiLU activation."""
def __init__(self, hidden_dim: int, ffn_dim: int):
super().__init__()
self.w1 = nn.Linear(hidden_dim, ffn_dim, bias=False)
self.w2 = nn.Linear(ffn_dim, hidden_dim, bias=False)
self.w3 = nn.Linear(hidden_dim, ffn_dim, bias=False) # gate projection (SwiGLU)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# SwiGLU: element-wise product of SiLU(w1(x)) and w3(x), then projected back
return self.w2(F.silu(self.w1(x)) * self.w3(x))
class MoELayer(nn.Module):
"""
Sparse Mixture of Experts layer.
Args:
hidden_dim: Dimension of the token hidden states.
ffn_dim: Inner dimension of each expert FFN.
num_experts: Total number of experts (N).
top_k: Number of experts activated per token (k).
capacity_factor: Multiplier on the average expert load to set capacity.
aux_loss_coef: Weight for the auxiliary load-balancing loss.
"""
def __init__(
self,
hidden_dim: int,
ffn_dim: int,
num_experts: int = 8,
top_k: int = 2,
capacity_factor: float = 1.25,
aux_loss_coef: float = 0.01,
):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.capacity_factor = capacity_factor
self.aux_loss_coef = aux_loss_coef
# Gating network: projects hidden_dim to num_experts logits
self.gate = nn.Linear(hidden_dim, num_experts, bias=False)
# Expert networks
self.experts = nn.ModuleList(
[ExpertFFN(hidden_dim, ffn_dim) for _ in range(num_experts)]
)
def forward(self, x: torch.Tensor):
"""
Args:
x: Token hidden states, shape [batch_size, seq_len, hidden_dim].
Returns:
output: Same shape as x.
aux_loss: Scalar auxiliary load-balancing loss.
"""
batch_size, seq_len, hidden_dim = x.shape
# Flatten tokens: treat batch and sequence as one dimension
# Shape: [batch_size * seq_len, hidden_dim]
x_flat = x.view(-1, hidden_dim)
num_tokens = x_flat.shape[0]
# ── Gating ──────────────────────────────────────────────────────────
# Raw logits from the gating network
gate_logits = self.gate(x_flat) # [num_tokens, num_experts]
gate_scores = F.softmax(gate_logits, dim=-1) # [num_tokens, num_experts]
# ── Top-k selection ──────────────────────────────────────────────────
# Select the top-k experts for each token
topk_scores, topk_indices = torch.topk(gate_scores, self.top_k, dim=-1)
# topk_scores: [num_tokens, top_k]
# topk_indices: [num_tokens, top_k]
# Renormalise the top-k scores so they sum to 1 per token
topk_scores = topk_scores / topk_scores.sum(dim=-1, keepdim=True)
# ── Auxiliary load-balancing loss ────────────────────────────────────
# Fraction of tokens routed to each expert (discrete indicator)
expert_mask = F.one_hot(topk_indices, num_classes=self.num_experts).float()
# expert_mask: [num_tokens, top_k, num_experts]
tokens_per_expert = expert_mask.sum(dim=[0, 1]) # [num_experts]
fraction_routed = tokens_per_expert / (num_tokens * self.top_k)
# Average routing probability for each expert
mean_gate_scores = gate_scores.mean(dim=0) # [num_experts]
# Auxiliary loss: dot product of fraction routed and mean scores,
# scaled by num_experts so the target value is ~1.0 at perfect balance
aux_loss = self.aux_loss_coef * self.num_experts * (
fraction_routed * mean_gate_scores
).sum()
# ── Expert capacity ──────────────────────────────────────────────────
# Average tokens per expert (with top_k factor)
avg_load = (num_tokens * self.top_k) / self.num_experts
capacity = int(avg_load * self.capacity_factor)
# ── Expert forward passes ────────────────────────────────────────────
output_flat = torch.zeros_like(x_flat)
for expert_idx, expert in enumerate(self.experts):
# Find all (token, k_slot) positions assigned to this expert
# expert_mask[:, :, expert_idx]: [num_tokens, top_k]
token_positions = expert_mask[:, :, expert_idx].nonzero(as_tuple=False)
# token_positions[:, 0] are token indices
# token_positions[:, 1] are the k-slot indices (0 or 1 for top-2)
if token_positions.numel() == 0:
continue
token_indices = token_positions[:, 0]
# Apply capacity: drop tokens beyond capacity
if token_indices.shape[0] > capacity:
token_indices = token_indices[:capacity]
token_positions = token_positions[:capacity]
# Gather the routing weights for these (token, expert) pairs
k_slot_indices = token_positions[:, 1]
routing_weights = topk_scores[token_indices, k_slot_indices] # [n_assigned]
# Run expert on the selected tokens
expert_inputs = x_flat[token_indices] # [n_assigned, hidden_dim]
expert_outputs = expert(expert_inputs) # [n_assigned, hidden_dim]
# Weight outputs and accumulate
weighted_outputs = expert_outputs * routing_weights.unsqueeze(-1)
output_flat.index_add_(0, token_indices, weighted_outputs)
# Reshape back to [batch_size, seq_len, hidden_dim]
output = output_flat.view(batch_size, seq_len, hidden_dim)
return output, aux_loss
# ── Usage example ────────────────────────────────────────────────────────────
def demo():
batch_size, seq_len, hidden_dim, ffn_dim = 2, 128, 512, 2048
moe = MoELayer(
hidden_dim=hidden_dim,
ffn_dim=ffn_dim,
num_experts=8,
top_k=2,
capacity_factor=1.25,
aux_loss_coef=0.01,
)
x = torch.randn(batch_size, seq_len, hidden_dim)
output, aux_loss = moe(x)
print(f"Input shape: {x.shape}") # [2, 128, 512]
print(f"Output shape: {output.shape}") # [2, 128, 512]
print(f"Aux loss: {aux_loss.item():.4f}")
# In a training loop, add aux_loss to the main loss:
# total_loss = cross_entropy_loss + aux_loss
# total_loss.backward()
demo()
A few notes on the implementation above. The index_add_ operation accumulates weighted expert outputs into the output tensor, handling the case where multiple token-expert pairs share the same output slot. The capacity check truncates the token list to capacity tokens; in a production implementation, you would track dropped tokens for monitoring. The auxiliary loss computation follows the formulation from the Switch Transformer paper but is applied per-call rather than accumulated across steps.
Real-World Models Using MoE
| Model | Organisation | Total Parameters | Active Parameters (per token) | Total Experts | Active Experts (k) | Notes |
|---|---|---|---|---|---|---|
| Switch Transformer | Up to 1.6T | ~1/N of MoE params | Up to 2048 | 1 | First large-scale MoE transformer; k=1 routing | |
| GLaM | 1.2T | ~96B | 64 | 2 | Matched GPT-3 quality at 1/3 the training energy | |
| GShard | 600B | ~13B | 2048 | 2 | Multilingual translation; scaled to 2048 experts | |
| Mixtral 8x7B | Mistral AI | 46.7B | ~12.9B | 8 | 2 | Open weights; matched LLaMA 2 70B at lower cost |
| Mixtral 8x22B | Mistral AI | 141B | ~39B | 8 | 2 | Strongest open-weights MoE at release in 2024 |
| GPT-4 | OpenAI | ~1.8T (rumoured) | ~220B (rumoured) | ~16 (rumoured) | 2 (rumoured) | Architecture unconfirmed; MoE widely reported by insiders |
| Grok-1 | xAI | 314B | ~86B | 8 | 2 | Open weights released March 2024; MoE confirmed |
| DeepSeek-V2 | DeepSeek | 236B | ~21B | 160 | 6 | Fine-grained MoE; also uses Multi-head Latent Attention |
| DeepSeek-V3 | DeepSeek | 671B | ~37B | 256 | 8 | Trained for $5.5M; auxiliary-loss-free load balancing |
Note on GPT-4: OpenAI has not officially confirmed GPT-4's architecture. The MoE figures cited above originate from reporting by George Hotz and others, and should be treated as credible rumour rather than confirmed fact.
Advantages
- Compute efficiency at scale. Mixtral 8x7B matches LLaMA 2 70B in quality but costs roughly 6x less compute per inference token. This is not a minor optimisation; at production scale it changes the economics entirely.
- Better scaling laws. MoE models follow more favourable scaling curves than dense models when parameter count is measured against compute budget. You get more capability per FLOP spent on training.
- Expert specialisation. Empirical studies show that individual experts develop preferences for particular token types: syntax-heavy text, mathematical expressions, code, specific languages. The model learns a natural division of labour.
- Parallelism-friendly architecture. Expert parallelism maps cleanly to multi-device setups. Each expert can live on a separate GPU or node, making very large models tractable to train and serve.
- Knowledge capacity. Total parameter count determines how much factual knowledge a model can store. MoE lets you grow this capacity cheaply, since adding experts does not increase per-token inference cost proportionally.
- Proven at frontier scale. Every credible frontier lab (OpenAI, Google, Mistral, xAI, DeepSeek) now uses MoE or MoE-inspired architectures. The technique has been validated across dozens of independent training runs at different scales.
Limitations and Trade-offs
- Memory vs compute trade-off. The full model must be loaded into memory even though only a fraction of parameters are active per token. Serving Mixtral 8x7B requires loading all 46.7B parameters, not just the 12.9B that run for any given token. This requires significantly more RAM than a comparably-performing dense model.
- Communication costs in distributed inference. Serving an MoE model at scale with expert parallelism requires token-to-expert routing across devices, which introduces network latency. For latency-sensitive applications, this can be worse than a dense model served on a single large GPU.
- Training instability. MoE models are more sensitive to hyperparameters than dense models. The auxiliary loss coefficient, the learning rate schedule, and the warmup period all interact in complex ways. A misconfigured run can produce a collapsed model with poor quality.
- Fine-tuning difficulty. Full fine-tuning requires loading and updating all expert weights. PEFT methods that bypass expert weights may miss important domain adaptation. Routing distributions shift during fine-tuning and may diverge from the pre-training distribution in ways that hurt generalisation.
- Token dropping. When experts are overloaded, tokens are dropped. Dropped tokens receive no expert processing, which degrades output quality. Monitoring and minimising token dropping is essential for production systems.
- Reproducibility and debugging complexity. The non-deterministic routing (token permutations, capacity overflows) makes debugging MoE models harder than dense models. Bugs in the routing logic can silently degrade quality without obvious error signals.
Common Mistakes
- Ignoring the auxiliary loss entirely. Some practitioners omit the load balancing loss, assuming the model will naturally distribute load. It will not. Expert collapse is the default outcome without explicit regularisation. Always include the auxiliary loss and monitor expert utilisation during training.
- Setting the capacity factor too close to 1.0. A capacity factor of 1.0 means any imbalance causes token dropping. Real routing distributions are never perfectly uniform. Use at least 1.1, and prefer 1.25 as a starting point. Reduce only if memory is severely constrained.
- Applying PEFT only to attention and dense layers. LoRA or adapters applied exclusively to attention weights will not adapt the experts, which contain the bulk of domain-specific knowledge in an MoE model. Either fine-tune expert weights directly or apply LoRA to expert FFN weights as well.
- Confusing total parameters with active parameters. Reporting Mixtral 8x7B as a "7B model" is inaccurate (it has 46.7B parameters). Reporting it as a "46B model" overstates inference cost (only 12.9B parameters are active per token). Distinguish clearly between total parameter count (relevant for memory and storage) and active parameter count (relevant for compute and latency).
- Assuming MoE expert specialisation is guaranteed. Experts develop soft specialisation during training, but this is an emergent property, not a guaranteed one. If the auxiliary loss is too strong, experts become nearly identical to ensure balanced load, losing the benefit of specialisation.
- Underestimating the impact of token dropping at inference. Token dropping during training is a controlled regulariser. Token dropping during inference is a quality bug. Evaluate your model's drop rate on representative inference workloads and increase capacity factor if drops exceed 1-2%.
Best Practices
- Start with a well-validated configuration. If building on open-source infrastructure, start with Mixtral's published hyperparameters (8 experts, k=2, capacity factor 1.25, auxiliary loss coefficient 0.02) and deviate only when you have a specific reason. Validated configurations save weeks of debugging.
- Monitor expert utilisation throughout training. Log the fraction of tokens routed to each expert at regular intervals. A healthy training run should show relatively uniform utilisation (no expert above 25-30% of load for 8-expert k=2). Early detection of imbalance allows you to adjust the auxiliary loss coefficient before the run completes.
- Tune the auxiliary loss coefficient carefully. Too high and experts become identical; too low and collapse occurs. Start at 0.01. If utilisation is uneven after 10% of training, increase to 0.02. If experts are identical (measuring by cosine similarity of weights), reduce to 0.005.
- Use expert parallelism for models with many experts. If you have 8 experts and 8 GPUs, assign one expert per GPU. This minimises cross-device communication. For models with more experts than GPUs, use expert groups and profile the all-to-all communication overhead carefully.
- Prefer k=2 over k=1 for better training stability. k=1 routing (Switch Transformer style) is computationally cheaper but prone to instability. For most use cases, k=2 provides a better quality-stability balance and is used by every major open-weights MoE model.
- Use MoE when parameter count is the primary bottleneck. MoE is the right choice when you need to store more knowledge than a dense model can hold within your compute budget. If you need a small, fast, cheap model for latency-sensitive production, a well-distilled dense model is usually preferable. MoE excels at frontier-scale pretraining and large-scale inference services where throughput (tokens/second across many requests) matters more than per-request latency.
Frequently Asked Questions
Does GPT-4 really use MoE?
OpenAI has never officially confirmed GPT-4's architecture. The widespread belief that it uses MoE originates from reporting by George Hotz in August 2023, who claimed GPT-4 consists of 8 MoE experts each around 220B parameters, with 2 activated per token. This figure has been cited and repeated enough to become widely accepted, but it remains unverified by OpenAI. What we can say with confidence is that the compute economics and performance profile of GPT-4 are consistent with a large MoE architecture, and that OpenAI had access to all the prior MoE research that would make this choice natural.
Why is Mixtral 8x7B not 56 billion parameters effectively?
The "8x7B" naming is somewhat misleading. Mixtral 8x7B has 8 experts, each with roughly the FFN capacity of a 7B model. But the model is not simply 8 independent 7B models stacked together. The attention layers are shared across all experts, and there is only one set of attention weights per transformer layer, not 8. The total parameter count is approximately 46.7B because the non-MoE components (embeddings, attention, layer norms) are counted only once. Of those 46.7B parameters, roughly 12.9B are active for any given token (the shared components plus 2 of the 8 expert FFN blocks).
Why can't I just run more experts for better quality?
Adding more experts helps only up to a point. First, more experts means more total parameters, which increases memory requirements even if active compute stays the same. Second, with fixed k, more experts means each expert sees fewer tokens per training step, which slows expert learning. Third, more experts require larger all-to-all communication overhead in distributed settings. Fourth, load balancing becomes harder with more experts, as rare experts may be poorly trained. DeepSeek-V2 showed that fine-grained MoE with many small experts (160 experts, k=6) can outperform coarse-grained MoE, but this comes with significant engineering complexity.
Is MoE better than dense for all tasks?
No. MoE is better when you need to maximise quality for a given training compute budget and can tolerate higher memory requirements. Dense models are preferable when you need the lowest possible inference latency (no routing overhead, no all-to-all communication), when you have very limited serving memory, when you need to fine-tune the model frequently on small datasets, or when you are operating at a scale where the memory bandwidth cost of a sparse model outweighs the compute savings. Many production deployments serve distilled dense models that were trained using larger MoE teacher models, combining the best of both approaches.
What is expert specialisation, and can I observe it?
Expert specialisation refers to the phenomenon where different experts in a trained MoE model develop preferences for different types of tokens. Studies of trained models have found that some experts preferentially handle punctuation and formatting, others handle numeric tokens, others activate for specific languages, and others handle domain-specific vocabulary. You can observe this by tracking, for each expert, which tokens most frequently route to it and analysing their linguistic properties. The degree of specialisation varies: models with stronger auxiliary loss (ensuring balance) tend to show weaker specialisation, while models with more lenient load balancing often develop more distinct expert personas.
References
- Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). "Adaptive mixtures of local experts." Neural Computation, 3(1), 79-87. The original MoE paper.
- Shazeer, N., et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer." ICLR 2017. First application of MoE to large-scale NLP with LSTMs.
- Lepikhin, D., et al. (2021). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding." ICLR 2021. Scaled MoE to 600B parameters for multilingual translation.
- Fedus, W., Zoph, B., and Shazeer, N. (2022). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity." JMLR 2022. k=1 routing; demonstrated 1.6T parameter MoE transformers.
- Du, N., et al. (2022). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts." ICML 2022. 1.2T parameter MoE model matching GPT-3 at 1/3 the training energy.
- Zoph, B., et al. (2022). "ST-MoE: Designing Stable and Transferable Sparse Expert Models." arXiv:2202.08906. Comprehensive study of MoE training stability and fine-tuning.
- Mistral AI (2024). "Mixtral of Experts." arXiv:2401.04088. Technical report for Mixtral 8x7B; first major open-weights MoE LLM.
- DeepSeek-AI (2024). "DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model." arXiv:2405.04434. Fine-grained MoE with 160 experts; introduces Multi-head Latent Attention.
- DeepSeek-AI (2024). "DeepSeek-V3 Technical Report." arXiv:2412.19437. 671B MoE model with auxiliary-loss-free load balancing and multi-token prediction.
- Puigcerver, J., et al. (2023). "From Sparse to Soft Mixtures of Experts." arXiv:2308.00951. Proposes Soft MoE as a fully differentiable alternative to hard top-k routing.
Key Takeaways
- MoE decouples model capacity from per-token compute. You can have a model with 46B total parameters that activates only 13B per token. Total parameter count and active parameter count are two separate, independently important metrics.
- The router is the critical component. A well-trained router that achieves balanced expert utilisation is what separates a good MoE model from one that collapses to using a single expert. The auxiliary load balancing loss is not optional.
- Top-2 routing is the current practical sweet spot. k=1 is cheaper but unstable; k>2 provides diminishing returns at increasing compute cost. Almost every production MoE model from 2024 through 2026 uses k=2.
- Memory is the price you pay for compute efficiency. MoE models require loading all expert weights into memory even though only a fraction are active per token. This trade-off is worth it at large scale but may not be at smaller scales.
- Expert specialisation is emergent, not programmed. You do not explicitly assign domains to experts. The model learns its own division of labour through gradient descent. This specialisation is real and measurable, but it is fragile and can be destroyed by overly aggressive load balancing.
- MoE is now the dominant frontier architecture. GPT-4, Grok-1, Mixtral, and DeepSeek-V3 all use or are credibly reported to use MoE. Understanding this architecture is no longer optional for practitioners working at the frontier of language model engineering.
Related Articles