LLM Inference Optimization: Quantization, KV Cache, and Serving at Scale
Introduction
Running large language models in production is expensive. A single GPU costs thousands of dollars per month, and popular models like Llama 70B or GPT-4 require multiple GPUs just to fit in memory.
Inference latency is another challenge. Users expect responses in seconds, not minutes. But transformers process tokens sequentially, and large models can be frustratingly slow.
Throughput matters too. If your service handles thousands of requests per second, you need to maximize how many concurrent requests each GPU can handle.
This is where inference optimization comes in. Through techniques like quantization, KV cache management, continuous batching, and speculative decoding, you can:
- Reduce memory usage by 4-8×, allowing larger models on smaller GPUs.
- Increase throughput by 10-20× through better batching strategies.
- Decrease latency by 2-3× with smarter generation techniques.
- Lower costs by 5-10× by fitting more requests per GPU.
This post provides a comprehensive deep dive into production LLM inference optimization. You will learn the mathematics behind quantization, how KV cache works internally, advanced batching strategies, and how modern serving frameworks like vLLM and TGI achieve state-of-the-art performance.
The Inference Problem: Memory and Computation Bottlenecks
LLM inference has two distinct phases, each with different bottlenecks:
Phase 1: Prefill (Prompt Processing)
Process the entire input prompt in parallel. This is compute-bound because all tokens are processed simultaneously through matrix multiplications.
For a 2048-token prompt, this might take 100-500ms depending on model size and hardware.
Phase 2: Decode (Token Generation)
Generate output tokens one at a time. This is memory-bandwidth-bound because the model weights must be loaded from memory for each token.
For a 70B model with fp16 weights (140GB), generating each token requires reading those 140GB from GPU memory. Even on an A100 with 2TB/s bandwidth, this takes 70ms per token.
The bottleneck is memory access, not computation. The GPU compute units are mostly idle waiting for data.
Memory Breakdown
For a 70B parameter model in fp16:
- Model weights: 70B × 2 bytes = 140GB
- KV cache: Depends on batch size and sequence length. Can be 50-100GB.
- Activations: Temporary tensors during forward pass. 5-20GB.
- Total: 200-250GB for inference.
This does not fit on a single A100 (80GB). Even with multiple GPUs, memory is the limiting factor for throughput.
Quantization: Compressing Model Weights
Quantization reduces the precision of weights and activations from 16-bit floating point (fp16) to lower bit widths like int8, int4, or even lower.
Why Quantization Works
Neural network weights often cluster around zero with a Gaussian distribution. Many weights contribute little to the final output.
By representing these weights with fewer bits, we lose some precision but gain massive memory savings.
Types of Quantization
Post-Training Quantization (PTQ)
Quantize a pre-trained model without retraining. Fast but can degrade quality.
Quantization-Aware Training (QAT)
Train the model with quantization in mind. Better quality but requires full retraining.
For LLMs, QAT is usually impractical due to training costs. PTQ is the standard approach.
Quantization Methods Deep Dive
1. GPTQ (GPU-based Post-Training Quantization)
GPTQ quantizes to 4-bit or 3-bit per weight with minimal accuracy loss.
How GPTQ Works
- Pass calibration data through the model layer by layer.
- For each layer, compute the Hessian matrix (second derivatives of loss w.r.t. weights).
- Use the Hessian to determine which weights are most sensitive.
- Quantize less-sensitive weights more aggressively.
This is based on the Optimal Brain Quantization (OBQ) algorithm.
Mathematical Foundation
The quantization error for a weight \( w \) quantized to \( \hat{w} \) is:
The impact on the loss \( L \) is approximated by the second-order Taylor expansion:
Where \( H \) is the Hessian matrix. GPTQ minimizes this error when choosing quantization parameters.
Usage Example
from transformers import AutoModelForCausalLM, GPTQConfig
# Load model with GPTQ quantization
quantization_config = GPTQConfig(
bits=4,
dataset="c4", # Calibration dataset
tokenizer=tokenizer
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
device_map="auto",
quantization_config=quantization_config
)
# Model is now quantized to 4-bit
# 70B params × 0.5 bytes (4-bit) = 35GB instead of 140GB
2. AWQ (Activation-aware Weight Quantization)
AWQ observes that not all weights are equally important. Some channels have much larger activations and contribute more to the output.
Key Insight
Identify important weights by analyzing activation magnitudes. Protect important weights from quantization, quantize others more aggressively.
AWQ typically achieves better quality than GPTQ at the same bit width because it is activation-aware.
Implementation
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
model_path = "meta-llama/Llama-2-70b-hf"
quant_path = "llama-70b-awq-4bit"
# Load model
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)
# Quantize
quant_config = {"zero_point": True, "q_group_size": 128, "w_bit": 4}
model.quantize(tokenizer, quant_config=quant_config)
# Save quantized model
model.save_quantized(quant_path)
3. GGUF (GPT-Generated Unified Format)
GGUF is the format used by llama.cpp for running models on CPUs and consumer GPUs.
Supports various quantization schemes:
- Q4_0: 4-bit quantization, no zero point
- Q4_K_M: 4-bit with k-quants (mixed precision)
- Q5_K_M: 5-bit k-quants
- Q8_0: 8-bit quantization
k-quants Explained
Different parts of the model use different bit widths. Attention layers might use 5-bit, while feed-forward layers use 4-bit.
Usage
# Using llama.cpp
./quantize llama-2-70b-fp16.gguf llama-2-70b-q4_k_m.gguf Q4_K_M
# Run inference
./main -m llama-2-70b-q4_k_m.gguf -p "Once upon a time"
Quantization Comparison
| Method | Bit Width | Quality Loss | Speed | Best For |
|---|---|---|---|---|
| FP16 (baseline) | 16-bit | 0% | 1× (base) | Maximum quality |
| INT8 | 8-bit | ~1% | 1.5-2× | Balanced |
| GPTQ-4bit | 4-bit | 2-5% | 2-3× | GPU, memory-limited |
| AWQ-4bit | 4-bit | 1-3% | 2-3× | GPU, best quality@4bit |
| GGUF Q4_K_M | 4-bit mixed | 2-4% | 2-3× | CPU, consumer GPUs |
| GPTQ-3bit | 3-bit | 5-10% | 3-4× | Extreme compression |
KV Cache: The Secret to Fast Autoregressive Decoding
During token generation, the model processes all previous tokens to generate the next one. Without optimization, this means recomputing the entire sequence every time.
The KV cache solves this by storing intermediate computations.
How Transformer Attention Works
Simplified self-attention:
Where:
- \( Q \) = Query matrix
- \( K \) = Key matrix
- \( V \) = Value matrix
For autoregressive generation, at step \( t \), we need to attend to all previous tokens \( 1, 2, \ldots, t-1 \).
The Naive Approach
Recompute \( K \) and \( V \) for all previous tokens at every step. This is \( O(n^2) \) in sequence length.
KV Caching
Store the \( K \) and \( V \) matrices for all previous tokens. When generating token \( t \):
- Compute new \( K_t \) and \( V_t \) for current token.
- Concatenate with cached \( K_{1:t-1} \) and \( V_{1:t-1} \).
- Compute attention using full \( K \) and \( V \).
- Cache the new \( K_t \) and \( V_t \).
This reduces computation to \( O(n) \) but requires storing all past \( K \) and \( V \) values.
KV Cache Memory Footprint
For a model with:
- \( L \) layers
- \( h \) attention heads
- \( d_h \) head dimension
- \( n \) sequence length
- \( b \) batch size
KV cache size:
For Llama 70B with 80 layers, 64 heads, 128 head dim, batch size 32, sequence length 2048, fp16:
This exceeds most GPU memory! This is why KV cache management is critical.
KV Cache Optimization Strategies
1. KV Cache Quantization
Store KV cache in int8 instead of fp16. Reduces memory by 2×.
# Enable KV cache quantization in vLLM
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
kv_cache_dtype="auto", # Automatically quantize KV cache
quantization="awq"
)
2. Paged Attention
vLLM's key innovation: split KV cache into fixed-size blocks (pages) and manage them like virtual memory.
Benefits:
- No memory fragmentation.
- Share KV cache blocks between sequences (e.g., for beam search).
- Dynamic allocation as sequences grow.
3. Multi-Query Attention (MQA) and Grouped-Query Attention (GQA)
Reduce the number of KV heads while keeping query heads the same.
Standard attention: 64 heads, each with its own K and V.
MQA: 64 query heads, but only 1 shared K and V.
GQA: 64 query heads, 8 KV heads (groups of 8 queries share KV).
GQA reduces KV cache by 8× with minimal quality loss. Used in Llama 2 70B.
Continuous Batching: Maximizing Throughput
Traditional batching waits for all requests in a batch to finish before starting new ones. This wastes GPU cycles because sequences finish at different times.
The Problem with Static Batching
Imagine a batch of 8 requests. One generates 100 tokens, another generates 500 tokens. The GPU sits idle after the shorter requests finish, waiting for the longest one.
Continuous Batching (Iteration-Level Scheduling)
Instead of batch-level scheduling, schedule at the iteration level. As soon as a sequence finishes, replace it with a new request.
vLLM and TGI implement this:
- Start with a batch of requests.
- Generate one token for each active request.
- If any request finishes, remove it and add a new request from the queue.
- Repeat until queue is empty.
This keeps the GPU fully utilized and can increase throughput by 10-20×.
# vLLM automatically uses continuous batching
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-13b-hf")
# Requests can have varying lengths
prompts = [
"Write a short poem", # Might generate 50 tokens
"Write a detailed essay on AI", # Might generate 500 tokens
"Hello", # Might generate 10 tokens
]
outputs = llm.generate(prompts, SamplingParams(temperature=0.7, max_tokens=512))
# vLLM handles batching and scheduling automatically
Speculative Decoding: Reducing Latency
Speculative decoding uses a small, fast "draft" model to generate candidate tokens, then a large "target" model to verify them in parallel.
How It Works
- Draft phase: Small model generates \( k \) candidate tokens autoregressively (e.g., 4 tokens).
- Verification phase: Large model verifies all \( k \) candidates in parallel.
- Accept candidates that match the large model's distribution.
- If a candidate is rejected, fall back to the large model's token and retry.
Why This Works
The small model is much faster (e.g., 10× faster). If it guesses correctly even 50% of the time, you save significant latency.
The large model verifies multiple tokens in parallel, which is faster than generating them one by one.
Mathematical Guarantee
Speculative decoding produces the exact same distribution as the target model. It is lossless in terms of output quality.
Example
# Using speculative decoding in transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
# Load target model (large)
target_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-70b-hf")
# Load draft model (small)
draft_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Enable speculative decoding
outputs = target_model.generate(
inputs,
assistant_model=draft_model,
do_sample=True,
temperature=0.7
)
# Can be 2-3× faster for long generations
When Speculative Decoding Helps
- Long output sequences (>100 tokens).
- When draft model is good enough (>50% acceptance rate).
- When latency matters more than throughput.
For short sequences or high-throughput scenarios, the overhead may not be worth it.
vLLM: Production Serving Framework
vLLM is the state-of-the-art open-source inference engine.
Key Innovations
- PagedAttention: Efficient KV cache management.
- Continuous batching: Maximum GPU utilization.
- Optimized CUDA kernels: Faster attention and sampling.
- Tensor parallelism: Distribute models across multiple GPUs.
Setting Up vLLM
pip install vllm
# Serve a model
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-70b-hf \
--tensor-parallel-size 4 \
--dtype auto \
--max-model-len 4096
Advanced Configuration
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-70b-hf",
tensor_parallel_size=4, # Use 4 GPUs
dtype="float16",
max_model_len=4096,
gpu_memory_utilization=0.95, # Use 95% of GPU memory
block_size=16, # PagedAttention block size
quantization="awq", # Enable AWQ quantization
kv_cache_dtype="fp8" # Quantize KV cache
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=512
)
outputs = llm.generate(prompts, sampling_params)
Benchmarking vLLM
Compared to HuggingFace Transformers or TensorRT-LLM:
- 10-20× higher throughput due to continuous batching.
- 2-5× better memory efficiency from PagedAttention.
- Lower latency at high concurrency.
Text Generation Inference (TGI): HuggingFace's Solution
TGI is HuggingFace's production serving framework, similar to vLLM.
Features
- Continuous batching.
- Tensor parallelism.
- Quantization support (GPTQ, AWQ, bitsandbytes).
- Flash Attention integration.
- Custom CUDA kernels.
Deployment
# Using Docker
docker run --gpus all --shm-size 1g -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-70b-hf \
--num-shard 4 \
--quantize awq \
--max-batch-total-tokens 32768
API Usage
import requests
response = requests.post(
"http://localhost:8080/generate",
json={
"inputs": "Once upon a time",
"parameters": {
"max_new_tokens": 100,
"temperature": 0.7,
"top_p": 0.9
}
}
)
print(response.json()["generated_text"])
Flash Attention: Faster and More Memory-Efficient Attention
Standard attention has \( O(n^2) \) memory complexity. Flash Attention reduces this to \( O(n) \) while being faster.
How Flash Attention Works
Instead of materializing the full attention matrix, Flash Attention computes attention in blocks (tiles) and fuses operations.
Benefits:
- 2-4× faster than standard attention.
- Linear memory usage (can handle very long sequences).
- No approximation—mathematically equivalent to standard attention.
# Enable Flash Attention in transformers
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
attn_implementation="flash_attention_2", # Enable Flash Attention
torch_dtype=torch.float16,
device_map="auto"
)
Tensor Parallelism vs Pipeline Parallelism
For models too large for a single GPU, distribute across multiple GPUs.
Tensor Parallelism
Split individual layers across GPUs. Each GPU computes part of the layer, then communicates results.
Example: 70B model, 4 GPUs. Each GPU handles 17.5B parameters.
- Pros: Lower latency (all GPUs work on same batch).
- Cons: High communication overhead (all-reduce operations).
Pipeline Parallelism
Split model vertically. GPU 1 handles layers 1-20, GPU 2 handles 21-40, etc.
- Pros: Less communication.
- Cons: Higher latency (GPUs process sequentially). Pipeline bubbles waste compute.
In Practice
For inference, tensor parallelism is usually preferred because latency matters more than training.
Comparison: Serving Frameworks
| Framework | Throughput | Latency | Memory Efficiency | Best For |
|---|---|---|---|---|
| HF Transformers | Low | High | Poor | Research, prototyping |
| vLLM | Excellent | Good | Excellent | High-throughput serving |
| TGI | Excellent | Good | Excellent | HF ecosystem integration |
| TensorRT-LLM | Excellent | Excellent | Good | Lowest latency, NVIDIA GPUs |
| llama.cpp | Medium | Medium | Good | CPU, consumer GPUs, local |
Production Deployment Checklist
1. Choose the Right Quantization
- INT8 for minimal quality loss.
- AWQ-4bit for best quality at 4-bit.
- GPTQ-4bit for broader model support.
2. Enable KV Cache Optimization
- Use FP8 KV cache quantization.
- Configure appropriate max sequence length.
3. Configure Batching
- Use continuous batching (vLLM/TGI handle this).
- Tune
max_batch_total_tokensbased on GPU memory.
4. Set Up Monitoring
- Track GPU utilization.
- Monitor throughput (requests/second).
- Measure latency percentiles (p50, p95, p99).
- Watch memory usage.
5. Load Testing
Simulate production load before deployment.
# Load testing with locust
from locust import HttpUser, task
class LLMUser(HttpUser):
@task
def generate(self):
self.client.post("/generate", json={
"prompt": "Once upon a time",
"max_tokens": 100
})
Cost Optimization Strategies
1. Autoscaling
Scale GPU instances based on request volume.
2. Spot Instances
Use cloud spot instances for 50-70% cost savings. Handle interruptions gracefully.
3. Request Queuing
Queue requests during peak times rather than over-provisioning GPUs.
4. Model Selection
Use the smallest model that meets quality requirements. A well-prompted 13B model often outperforms a poorly-prompted 70B model.
Future of LLM Inference
- Multi-token prediction: Generate multiple tokens per forward pass.
- Better quantization: 2-bit and ternary quantization with minimal loss.
- Sparse models: Activate only subset of parameters per token (MoE).
- Hardware advances: Custom AI accelerators (Groq, Cerebras).
Conclusion
LLM inference optimization is the key to making large language models practical and cost-effective in production.
Through quantization, you can run 70B models on consumer GPUs. Through continuous batching and PagedAttention, you can serve hundreds of concurrent users on a single GPU. Through speculative decoding, you can reduce latency by 2-3×.
Modern serving frameworks like vLLM and TGI have made these optimizations accessible. You do not need to implement them from scratch—just understand the principles and configure them correctly.
The difference between naive inference and optimized inference can be 10-20× in throughput and 5-10× in cost. For production systems, this is not optional—it is essential.
Key Takeaways
- LLM inference is memory-bandwidth-bound during token generation.
- Quantization (GPTQ, AWQ) reduces model size by 4-8× with minimal quality loss.
- KV cache stores past key-value pairs to avoid recomputation.
- Continuous batching increases throughput by 10-20× over static batching.
- Speculative decoding can reduce latency by 2-3× for long sequences.
- vLLM's PagedAttention eliminates memory fragmentation in KV cache.
- Flash Attention is 2-4× faster than standard attention with linear memory.
- Use tensor parallelism for low-latency multi-GPU inference.
- vLLM and TGI are production-ready frameworks with built-in optimizations.
- Always benchmark on your specific workload before deployment.