Blogpost · February 22, 2026

Why Your LLM Application Feels Slow

Understanding Latency Bottlenecks in Production Large Language Model Applications

by Perivitta 14 mins read Intermediate
Share
Back to all posts

Why Your LLM Application Feels Slow

Introduction

You spend weeks building an AI-powered application. The model gives great answers. Then users start complaining that it feels slow. You look at the model's response time and it seems fine — so what is the problem?

Here is the key insight: LLM latency is usually not the model's fault. In most production systems, the model itself is only one of many stages that add delay. The real bottlenecks are architectural — how requests flow through your system, which operations happen sequentially when they could happen in parallel, and whether you are streaming output or waiting for the full response.

This article breaks down where latency actually comes from and how to fix it.


The Three Latency Metrics That Actually Matter

Before you can optimise latency, you need to know which type of latency you are dealing with. LLM latency decomposes into three distinct metrics, each with different causes and different fixes.

Metric Full Name Definition Target
TTFT Time to First Token Time from request sent to first token received < 500ms
TBT Time Between Tokens Time between each successive output token < 50ms
E2E End-to-End Latency Total time: TTFT + (TBT × number of output tokens) Context-dependent

Understanding which metric is degraded tells you where to look:

  • High TTFT — the model is taking too long to start. Usually caused by long input prompts (large context windows take longer to process) or slow retrieval before the model call. This is what users experience as "thinking time" or "the AI is loading."
  • High TBT — the model is generating tokens slowly. Usually a hardware or model efficiency issue (GPU memory bandwidth). Affects how smooth streaming feels.
  • High E2E — the total wall-clock time. For short responses, TTFT dominates. For long responses, TBT accumulates and dominates.

Measure all three in your own application before optimising. Here is a code snippet that captures them using streaming:

import time
import anthropic

client = anthropic.Anthropic()
ttft = None
tokens = 0
start = time.perf_counter()

with client.messages.stream(
    model="claude-opus-4-5",
    max_tokens=512,
    messages=[{"role": "user", "content": "Explain transformers in 3 paragraphs."}]
) as stream:
    for text in stream.text_stream:
        if ttft is None:
            ttft = time.perf_counter() - start
        tokens += 1

e2e = time.perf_counter() - start
tbt = (e2e - ttft) / max(tokens - 1, 1)

print(f"TTFT: {ttft*1000:.0f}ms | TBT: {tbt*1000:.0f}ms | E2E: {e2e*1000:.0f}ms")

Where Latency Actually Comes From

An LLM application is not a single API request. It is a pipeline of stages that each add delay. A typical request flows through:

Request Reception
Authentication and Validation
Embedding Generation (if using retrieval augmentation)
Vector Database Search
Context Assembly
Model Inference
Post-processing
Response Delivery

Each stage adds delay. Even if each one is fast, they accumulate. And if they run sequentially — waiting for the previous stage to finish before starting — the total delay is the sum of all stages.

Directed acyclic graph with labeled nodes and directed edges showing pipeline stage dependencies
Figure: A Directed Acyclic Graph (DAG) represents the dependency structure of an LLM request pipeline. Each node contributes latency, and the critical path from input to output determines total response time. Optimising the slowest stage — often retrieval or the prefill phase — has the highest impact on end-to-end latency. Source: David W. / Wikimedia Commons (Public Domain)

Bottleneck 1: Long Prompts Slow Down the Model

The LLM takes longer to start generating when it has a larger input to process. This is the prefill phase — before the model can output its first token, it must read and process your entire prompt.

Every extra token in your prompt adds to TTFT:

TTFT ≈ Function(number of input tokens)

Common sources of unnecessarily large inputs:

  • Sending the full conversation history when only the last few messages are relevant
  • Inserting complete documents when only one section is needed
  • Repeating lengthy system instructions in every request
  • Inserting verbose tool outputs without trimming them first

What to do about it

  • Use rolling summarisation for long conversations instead of full history.
  • Retrieve only the relevant chunk from a document — not the whole thing.
  • Use prompt caching for large, repeated system prompts (so you pay once, not every request).
  • Set a strict output length limit with max_tokens.

Bottleneck 2: Network Overhead Accumulates

Network latency feels small per hop — but in a distributed AI system, there can be many hops.

Sources of network overhead:

  • TLS handshake overhead on every new connection
  • Cross-region latency if your inference service is in a different region from your application
  • Cold start delays in serverless environments (the function needs to initialise before it can respond)
  • Sequential API chaining — calling one service, waiting for the response, then calling another

What to do about it

  • Maintain persistent connections to frequently used services instead of opening a new connection each time.
  • Deploy inference services in the same region as your application server.
  • Warm serverless functions to avoid cold start penalties for time-sensitive endpoints.
  • Combine multiple small API calls into fewer larger calls where possible.

Bottleneck 3: Retrieval Latency in RAG Systems

If you are using RAG (Retrieval-Augmented Generation), retrieval adds latency before the model call even begins. A typical RAG request involves:

  • Embedding the user's query (an API call)
  • Searching the vector database (a database query)
  • Optional reranking (another model call)
  • Assembling the context from retrieved chunks

All of this happens before the LLM sees anything. In poorly designed systems, retrieval takes longer than the LLM inference itself.

Common retrieval mistakes that add latency

  • Re-embedding the same query every time instead of caching frequent query embeddings
  • Retrieving far more documents than needed (top-50 when top-5 would suffice)
  • Running retrieval synchronously before any other processing begins
  • Using hybrid search (dense + keyword) for every query when semantic search alone would suffice

What to do about it

  • Cache embeddings for frequently repeated queries.
  • Limit retrieval top-k aggressively — start with 5 and increase only if quality suffers.
  • Start retrieval as soon as the user message arrives, before authentication and validation finish.
  • Pre-trim document chunks during indexing so you never send oversized context to the model.

Bottleneck 4: Sequential Pipeline Execution

This is one of the most common and fixable causes of slow LLM applications. Many early implementations run every step sequentially:

  1. Validate the request. (50ms)
  2. Generate the query embedding. (100ms)
  3. Search the vector database. (80ms)
  4. Call the LLM. (2000ms)
  5. Log the result. (30ms)

Total: 2,260ms. But steps 1 and 2 do not depend on each other. Steps 4 and 5 do not need to overlap with anything, but logging can happen asynchronously. With parallelisation:

  1. Validate + begin query embedding simultaneously.
  2. As soon as embedding is ready, start vector search.
  3. Call the LLM.
  4. Log asynchronously in the background.

Total: potentially 2,050ms. Small improvements compound across every request.

Design principles for lower latency

  • Start retrieval as early as possible — do not wait for unrelated validation steps.
  • Stream model output immediately after the first token arrives.
  • Move logging, analytics, and memory updates to asynchronous background tasks.
  • Identify which pipeline stages genuinely depend on each other and which can run in parallel.

Streaming: The Most Important User Experience Fix

Streaming means the model starts sending words as it generates them, instead of waiting until the full response is ready. This is one of the most impactful changes you can make to how your application feels.

Here is why it matters: humans are very sensitive to silence. If an app does not respond for 5 seconds, it feels broken — even if the final answer arrives at second 5 and is excellent. With streaming, the user sees words appearing at second 1, second 2, second 3 — the experience feels responsive even though the total generation time has not changed.

Streaming should not be treated as a "nice to have" feature. For any interactive application, it is a fundamental design requirement. Users who see nothing for 5 seconds will assume the app is broken and refresh or leave.


Key Takeaways

  • Latency is primarily a system architecture problem — audit the full pipeline before blaming the model.
  • TTFT (time to first token) dominates user perception of speed — optimise for it specifically, not total generation time.
  • In RAG systems, retrieval is frequently the dominant bottleneck — cache embeddings and limit top-k retrieval aggressively.
  • Parallelise independent pipeline stages and make streaming non-negotiable for interactive applications.

Conclusion

When your LLM application feels slow, the instinct is to blame the model. But the model is usually not the problem. The real causes are architectural: sequential stages that could run in parallel, large inputs that slow down the prefill phase, retrieval pipelines that add seconds before the model even starts, and missing streaming that makes users wait in silence.

Optimising latency means auditing your pipeline as a whole, not just the model call. Measure TTFT, TBT, and E2E separately. Find the actual bottleneck. Then fix that — not the model.

Speed matters as much as accuracy in real-world applications. A slow but accurate app will be abandoned. A fast and accurate app will be used.

References

  • Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
  • Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.
  • Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
  • vLLM Documentation — Efficient LLM Serving
  • Pope, R., et al. (2023). Efficiently Scaling Transformer Inference. MLSys 2023.

Related Articles

Model Context Protocol (MCP): A Complete Beginner's Guide
Model Context Protocol (MCP): A Complete Beginner's Guide
MCP is the USB-C port for AI applications — one protocol that...
Read More →
OpenAI Codex Explained: How LLMs Learn to Write Code
OpenAI Codex Explained: How LLMs Learn to Write Code
OpenAI Codex powers GitHub Copilot and sparked the AI coding revolution. This...
Read More →
Found this useful?