Back to all posts

LLM Inference Optimization: Quantization, KV Cache, and Serving at Scale

You will learn how quantization reduces model memory by 4 to 8 times without retraining the model from scratch.

You will understand what the KV cache is, why it grows with batch size, and how modern serving frameworks manage it efficiently.

You will see why continuous batching and speculative decoding each attack a different bottleneck, throughput versus latency, and when each applies.

Key takeaway: running a 70B model in production without these techniques costs 10 to 20 times more than it needs to. The tools to fix that are open source and production-ready today.

Introduction

Running a large language model in production is expensive. A 70B parameter model like Llama 2 requires multiple high-end GPUs just to load into memory, each costing thousands of dollars per month to rent. Users expect responses in seconds, but the model generates tokens one at a time, and large models are slow. If your service handles many concurrent users, you also need to maximise how many requests each GPU can handle at once.

The good news is that through a combination of engineering techniques, you can dramatically reduce costs and improve performance without changing the model itself. Quantization cuts memory usage by shrinking the precision of stored weights. Continuous batching eliminates the idle time between requests. KV caching avoids redundant computation during generation. Speculative decoding lets a small model do most of the guesswork while the large model verifies in parallel. Together, these techniques transform an unaffordable deployment into a viable one.

Problem Statement

LLM inference has two distinct phases with fundamentally different bottlenecks.

The first phase is called prefill, where the model processes the entire input prompt in parallel. This phase is compute-bound because all tokens are processed simultaneously through large matrix multiplications. For a 2048-token prompt, this might take 100 to 500 milliseconds depending on model size and hardware.

The second phase is called decode, where the model generates output tokens one at a time. This phase is memory-bandwidth-bound. At each generation step, the model weights must be loaded from GPU memory. For a 70B model stored in 16-bit floating point format, that means reading 140 gigabytes from memory every single token. Even an A100 GPU with 2 terabytes per second of memory bandwidth takes about 70 milliseconds per token for this transfer alone. The GPU's compute units are mostly idle, waiting for data to arrive.

A second constraint is that for a 70B model, the model weights alone consume 140 gigabytes. The KV cache can add another 20 to 50 gigabytes depending on batch size and sequence length. Activations during the forward pass add another 5 to 20 gigabytes. The total often lands between 180 and 220 gigabytes, which does not fit on a single A100 with 80 gigabytes of VRAM. Memory is the limiting factor for both fitting the model and maximising throughput. Traditional frameworks do nothing about this, leaving most of the GPU idle and the user paying for it.

Core Concepts and Terminology

Term	Definition
Quantization	Reducing the numerical precision of model weights, for example from 16-bit floats to 4-bit integers, to shrink memory usage.
KV Cache	Stored key and value matrices from previous tokens in the attention computation, allowing each new token to be generated without recomputing all previous positions.
Continuous batching	A scheduling strategy where completed sequences are immediately replaced by new requests, keeping the GPU fully occupied at all times.
Speculative decoding	Using a small fast model to guess several tokens ahead, then verifying those guesses in a single forward pass of the large model.
PagedAttention	vLLM's technique of dividing the KV cache into fixed-size blocks managed like virtual memory pages, eliminating fragmentation.
Grouped Query Attention (GQA)	An attention variant where multiple query heads share a single set of key and value heads, directly reducing KV cache size.
Tensor parallelism	Splitting individual model layers horizontally across multiple GPUs so all GPUs process every request simultaneously.
Flash Attention	An attention algorithm that computes attention in tiles, never materialising the full attention matrix, reducing memory usage from quadratic to linear in sequence length.
PTQ (Post-Training Quantization)	Quantizing a fully trained model without retraining, using a small calibration dataset to measure weight distributions.
SMILES	Not applicable here. See drug discovery post.

How It Works

Quantization: Compressing Model Weights

Think of quantization like the difference between writing a measurement as 3.14159265 versus 3.14. You lose some information, but for most practical purposes the simpler representation works nearly as well. Neural network weights tend to follow a roughly bell-curve distribution, with the majority clustering near zero and contributing relatively little to the model's output. Mapping those small values to a coarser integer grid loses little useful information.

The most widely used method on GPUs is GPTQ, which quantizes to 4-bit or 3-bit per weight. It works by passing a small calibration dataset through the model layer by layer, computing a sensitivity measure for each weight, and then quantizing less-sensitive weights more aggressively while protecting the ones that matter most. AWQ, or Activation-aware Weight Quantization, takes a different angle: instead of measuring weight sensitivity directly, it looks at which input channels consistently produce large activations and therefore have a larger influence on outputs. It protects those channels from quantization and tends to achieve slightly better quality than GPTQ at the same bit width.

For deployments on CPUs and consumer GPUs, GGUF format (used by llama.cpp) supports mixed-precision quantization where attention layers, being more sensitive, use 5-bit precision while feed-forward layers use 4-bit. This flexibility allows practitioners to tune the quality-memory trade-off layer by layer.

Bit layout diagram of the IEEE 754 half-precision 16-bit floating point format showing sign, exponent, and fraction fields — **Figure:** The IEEE 754 half-precision (FP16) floating-point format: 1 sign bit, 5 exponent bits, and 10 fraction bits totalling 16 bits. Quantization to INT4 or INT8 replaces this representation with a coarser integer grid, cutting memory per weight by 4x or 2x respectively, the direct mechanism behind fitting a 70B model onto fewer GPUs. Source: Habbit / Wikimedia Commons (CC BY-SA 3.0)

KV Cache: Eliminating Redundant Computation

The KV cache is best understood through an analogy. Imagine you are writing a long essay and at every new sentence you needed to re-read the entire essay from the beginning to decide what to write next. That is exactly what a transformer without a KV cache does. The KV cache is like a notepad where you write down the key ideas from each sentence as you go, so each new sentence only requires reading the notepad rather than the full essay.

Specifically, the transformer attention mechanism computes three matrices for each token: a Query, a Key, and a Value. When generating token number 500, the model needs the Keys and Values from all 499 previous tokens to compute the correct attention weights. Without a KV cache, those matrices are recomputed from scratch at every step. With a KV cache, they are computed once and stored. Each new step only needs to compute the Query, Key, and Value for the single new token, then look up everything else from the cache.

The cost of this speed improvement is memory. For Llama 2 70B with grouped query attention (8 KV heads instead of 64), a batch size of 32, and a sequence length of 2048, the KV cache consumes roughly 26 gigabytes. That is a substantial fraction of GPU memory, and it grows linearly with both batch size and sequence length. This is why KV cache management is a first-class engineering concern in production serving.

Continuous Batching: Eliminating GPU Idle Time

Traditional batching waits for every request in a batch to finish before accepting new ones. Imagine a batch of 8 requests where one generates 10 tokens and another generates 500 tokens. After the short request finishes, 7 slots remain occupied by the long request while 1 slot sits idle for hundreds of steps. The GPU is paying full power for wasted capacity.

Continuous batching replaces the completed sequence immediately with a new request from the queue. The batch is always full. vLLM and Text Generation Inference both implement this at the iteration level, meaning a new request can be inserted mid-generation of other requests. The result is 10 to 20 times higher throughput than static batching on workloads with variable output lengths.

Speculative Decoding: Reducing Latency

Token generation is inherently sequential because each token depends on the previous one. Speculative decoding works around this constraint by using a small draft model to guess k tokens ahead and then running the large target model once to verify all k guesses in parallel. The large model can verify all positions simultaneously because it processes the sequence forward pass in one shot.

If the draft model guesses correctly, you get k tokens for roughly the cost of one. If a guess is wrong, the large model's correct token is used at that position and the process restarts. The output distribution is mathematically identical to what the large model would produce alone, so this technique is lossless in terms of quality. In practice, for the right types of content and a well-matched draft model, speculative decoding delivers 2 to 3 times lower latency on long generations.

Practical Example

Consider a customer support chatbot backed by a 70B model. The team starts with vanilla HuggingFace Transformers and a single A100. The model does not fit. They add a second GPU and load it in 16-bit floating point, using 140 gigabytes across both GPUs. Throughput is low because the framework uses static batching: each request waits in queue until the current batch finishes.

They migrate to vLLM with AWQ 4-bit quantization. The model now fits on a single A100 at roughly 35 gigabytes. PagedAttention manages the KV cache without fragmentation. Continuous batching keeps the GPU occupied. Throughput increases by roughly 15 times. P95 latency drops by 60 percent. They then enable Flash Attention to handle longer conversation histories without memory blowup. For high-traffic hours when latency is critical, they run a Llama 7B draft model in the same process for speculative decoding, gaining another 2 times reduction in latency for long replies.

The result is a single GPU handling the load that previously required two, at a fraction of the cost, with faster response times.

Advantages

Quantization makes large models accessible: AWQ or GPTQ 4-bit quantization lets a 70B model fit on a single 40-gigabyte GPU that could not otherwise run it, unlocking deployment options that are three to four times cheaper.
KV caching eliminates the most wasteful computation in generation: Without it, each generated token would require re-processing the entire input. With it, each token costs only the marginal computation for one new position.
Continuous batching converts GPU idle time into revenue: A 10 to 20 times throughput gain on variable-length workloads means the same hardware serves far more users at the same cost.
Speculative decoding improves user experience without changing model quality: Because the output distribution is provably identical to the unaccelerated model, there are no quality trade-offs, only latency gains.
Flash Attention makes long contexts practical: Reducing attention memory from quadratic to linear allows handling 32K to 128K token contexts that would otherwise overflow GPU memory entirely.

Limitations and Trade-offs

Quantization does reduce quality at extreme compression: 4-bit quantization introduces 1 to 5 percent quality loss on standard benchmarks. At 3-bit it rises to 5 to 10 percent. For creative or customer-facing applications, this may be perceptible and unacceptable.
KV cache memory grows unboundedly with context length: Even with GQA and KV quantization, very long contexts at large batch sizes will exhaust GPU memory. There is no free lunch; the practitioner must cap sequence length or batch size.
Speculative decoding only helps when the draft model acceptance rate is high: For highly creative or low-temperature outputs, the small draft model diverges from the large model more often, reducing the speedup below the overhead cost.
Continuous batching adds scheduling complexity: Requests with very different output lengths can cause head-of-line blocking in some schedulers. Tuning scheduler parameters matters for production workloads.
Tensor parallelism requires high-bandwidth GPU interconnects: On machines without NVLink, the all-reduce communication overhead between GPUs can eat a substantial fraction of the theoretical speedup.

Common Mistakes

Using HuggingFace Transformers in production: The default library has no continuous batching and poor memory management. It is excellent for research and prototyping, but a production serving stack like vLLM or TGI is not optional for any serious workload.
Choosing quantization method without benchmarking: GPTQ and AWQ produce meaningfully different quality on different model families. Always evaluate both on your specific task before committing. Do not assume the more recent method is always better for your use case.
Setting max_model_len too high: vLLM pre-allocates KV cache for the maximum sequence length. Setting this to 32K when your typical usage is 2K wastes 16 times the KV cache memory, reducing the batch size that fits in GPU memory.
Applying speculative decoding to short outputs: Speculative decoding adds overhead in the setup phase. For outputs shorter than roughly 50 tokens, this overhead often exceeds the latency savings.
Assuming synthetic data is always better: Not applicable here, see the synthetic data post for that mistake.

Best Practices

Start with AWQ 4-bit quantization as your default for GPU inference, it typically offers the best quality-to-memory ratio at 4-bit precision. Fall back to GPTQ if an AWQ model is not available for your base model.
Deploy with vLLM for high-throughput services, or TGI if you need tighter HuggingFace Hub integration. Do not use vanilla Transformers for anything beyond prototyping.
Enable FP8 or INT8 KV cache quantization in vLLM for large batch sizes. This halves the KV cache footprint with minimal quality impact.
Set max_model_len to match your actual 95th percentile sequence length, not the model's theoretical maximum. This dramatically increases the batch size that fits in memory.
Add speculative decoding only for latency-sensitive endpoints where the typical output is longer than 100 tokens and you have a matching family draft model available.
Monitor GPU memory utilisation, KV cache hit rates, and latency percentiles (p50, p95, p99) from day one. These metrics tell you which bottleneck to address next.

Comparison: Serving Frameworks

Framework	Throughput	Latency	Memory Efficiency	Best For
HuggingFace Transformers	Low	High	Poor	Research and prototyping only
vLLM	Excellent	Good	Excellent	High-throughput production serving
Text Generation Inference (TGI)	Excellent	Good	Excellent	HuggingFace ecosystem integration
TensorRT-LLM	Excellent	Excellent	Good	Lowest latency on NVIDIA hardware
llama.cpp	Medium	Medium	Good	CPU inference and consumer GPUs

Comparison: Quantization Methods

Method	Bit Width	Quality Loss	Speed Gain	Best For
FP16 (baseline)	16-bit	None	1x (baseline)	Maximum quality, no memory constraint
INT8	8-bit	About 1%	1.5 to 2x	Balanced quality and speed
AWQ 4-bit	4-bit	1 to 3%	2 to 3x	Best quality at 4-bit on GPU
GPTQ 4-bit	4-bit	2 to 5%	2 to 3x	GPU, broad model support
GGUF Q4_K_M	4-bit mixed	2 to 4%	2 to 3x	CPU and consumer GPUs
GPTQ 3-bit	3-bit	5 to 10%	3 to 4x	Extreme memory constraints only

FAQ

Does 4-bit quantization noticeably hurt output quality?

On standard benchmarks, AWQ 4-bit typically shows 1 to 3 percent degradation versus FP16. In practice, for most factual question-answering and instruction-following tasks, the difference is imperceptible to human evaluators. The quality gap becomes more noticeable on tasks requiring precise mathematical reasoning or highly nuanced creative writing. Always benchmark on your specific task before deploying quantized models.

Can I use speculative decoding with any draft model?

The draft model must be from the same model family as the target model to achieve a useful acceptance rate. A Llama 7B draft works well with a Llama 70B target because they share vocabulary and general generation patterns. Using an unrelated model as a draft typically results in acceptance rates too low to recover the overhead cost.

How much does PagedAttention actually help in practice?

On workloads with highly variable output lengths, PagedAttention reduces memory waste from approximately 30 to 60 percent (from fragmentation and reserved-but-unused space in contiguous allocation) to under 4 percent. This translates directly into larger effective batch sizes and higher throughput on the same hardware.

Should I use tensor parallelism or pipeline parallelism for a model that does not fit on one GPU?

For inference where latency matters, tensor parallelism is generally preferred because all GPUs work on every request simultaneously, keeping latency low. Pipeline parallelism introduces sequential delays between stages. However, tensor parallelism requires fast NVLink connections between GPUs; without them the communication overhead can erode most of the benefit.

Is Flash Attention the same as the KV cache?

No. Flash Attention is an algorithm for computing the attention operation more efficiently within a single forward pass by tiling the computation and never materialising the full attention matrix. The KV cache is a mechanism for reusing computations across generation steps. They address different bottlenecks and are typically used together.

References

Frantar, E., et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. arXiv:2210.17323.
Lin, J., et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. arXiv:2306.00978.
Kwon, W., et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
Dao, T., et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
Leviathan, Y., Kalman, M., and Matias, Y. (2023). Fast Inference from Transformers via Speculative Decoding. ICML 2023.

Key Takeaways

Token generation is memory-bandwidth-bound, not compute-bound. Quantization to AWQ or GPTQ 4-bit directly reduces the data that must be loaded per token, yielding 2 to 3 times speedups with minimal quality loss.
The KV cache trades memory for compute. At large batch sizes its footprint can exceed model weights, making KV quantization and PagedAttention essential tools rather than optional extras.
Continuous batching eliminates GPU idle time by replacing finished sequences immediately. Use vLLM or TGI rather than vanilla HuggingFace Transformers for any production workload.
Speculative decoding gives 2 to 3 times lower latency on long generations with no quality trade-off, but only helps when the draft model's acceptance rate is high enough to overcome the overhead.
Flash Attention reduces attention memory from quadratic to linear in sequence length. Enable it by default for any context longer than a few thousand tokens.
The combination of AWQ quantization, vLLM, Flash Attention, and continuous batching is the current production-grade baseline. Each piece solves a different bottleneck and they compound when used together.