LoRA and QLoRA: Fine-Tuning LLMs on a Single GPU
Introduction
Every team deploying an LLM eventually hits the same wall. The base model is capable but not specialised: it answers in the wrong tone, uses the wrong terminology, ignores domain conventions, or hallucinates domain-specific facts that a fine-tuned model would know to retrieve or refuse. The obvious fix is fine-tuning. The obvious problem is cost.
Full fine-tuning a 7 billion parameter model in fp32 requires approximately 112 GB of VRAM just to hold the model, gradients, and optimiser states simultaneously. That is more than four A100 80 GB cards combined. At cloud pricing, a single training run costs hundreds of dollars. Iteration is slow and expensive.
LoRA (Low-Rank Adaptation) changes this economics entirely. Instead of updating all 7 billion parameters, LoRA freezes the pretrained weights and trains a tiny pair of matrices whose product approximates the weight update. The number of trainable parameters drops by 99%. QLoRA compounds this by loading the frozen base model in 4-bit precision, cutting VRAM requirements by another 75%. The result: fine-tuning a 7B model on a single RTX 4090 in an afternoon for under $10.
1. Fine-Tuning vs RAG: Choosing the Right Tool
Fine-tuning and retrieval-augmented generation solve different problems. Using the wrong one wastes time and money.
| Problem | Better fit | Why |
|---|---|---|
| Model needs access to a large, changing knowledge base | RAG | Knowledge updates without retraining; retrieval is the bottleneck, not model weights |
| Model outputs in the wrong format or style | Fine-tuning | Style and format are learned in weights; RAG cannot change output structure |
| Model makes factual errors on domain knowledge | RAG first, fine-tuning if RAG fails | RAG is faster to implement; fine-tuning for facts baked deeply into the domain |
| Model needs to follow a rigid instruction schema consistently | Fine-tuning | Prompt engineering is fragile at scale; fine-tuning is more robust |
| Latency is critical (no retrieval round-trip tolerated) | Fine-tuning | RAG adds a retrieval step; fine-tuned model answers directly |
| Budget is tight and knowledge base is small | RAG | Cheaper to maintain a small vector store than to retrain periodically |
2. Why Full Fine-Tuning Is Unaffordable
Training a model updates its parameters using gradients. For each parameter, the optimiser (e.g. AdamW) stores two momentum values. The total VRAM footprint per parameter in fp32 training is:
For a 7B parameter model: \(7 \times 10^9 \times 16 = 112\text{ GB}\). A single A100 80 GB cannot hold this. Mixed-precision training (bf16/fp16) keeps a fp32 master weight copy alongside the fp16 model, so the total footprint remains close to 112 GB. Even a pure fp16 regime without a master copy needs roughly 56 GB for parameters and optimiser state alone, leaving only 24 GB for activations, which is insufficient for typical batch sizes and sequence lengths, so in practice you still need multiple GPUs or significant memory optimisation.
LoRA's insight is that the weight updates themselves are low-rank. When you fine-tune a model, the matrices do not change arbitrarily. The effective change lives in a much smaller subspace. LoRA exploits this by constraining updates to that subspace explicitly.
3. LoRA: The Mathematics
For a weight matrix \(W_0 \in \mathbb{R}^{d \times k}\), instead of learning a full \(\Delta W \in \mathbb{R}^{d \times k}\), LoRA constrains the update to a product of two small matrices:
During training, \(W_0\) is frozen and only \(B\) and \(A\) receive gradient updates. Initialisation is critical: \(A\) is initialised from a random Gaussian distribution and \(B\) is initialised to zero, so that \(BA = 0\) at the start of training. This guarantees that the fine-tuned model behaves identically to the pretrained model at step zero, with no random-noise injection into the forward pass. The rank \(r\) controls the expressiveness of the update: higher rank learns more complex changes but uses more memory and is slower to train. During inference, \(BA\) is merged into \(W_0\), so there is no inference latency cost. The adapter disappears into the base weights.
The scaling factor lora_alpha controls how strongly the adapter output is mixed into the base model.
The effective scaling applied to \(BA\) is \(\alpha / r\). Setting lora_alpha = 2r (twice the rank)
is the standard default that applies a 2× amplification, which empirically works well across most tasks.
4. QLoRA: 4-Bit Quantisation Meets LoRA
QLoRA (Dettmers et al., 2023) combines LoRA with two additional innovations that together cut VRAM requirements by roughly 75% compared to LoRA alone.
4-bit NormalFloat (NF4) quantisation stores each weight in 4 bits instead of 16. NF4 is information-theoretically optimal for normally distributed weights (which LLM weights approximately are): it allocates quantisation levels to match the weight distribution, minimising rounding error.
Double quantisation quantises the quantisation constants themselves. The constants needed to dequantise the 4-bit weights are stored in 8 bits rather than 32, saving an additional 0.37 bits per parameter on average.
5. Setting Up the Environment
# Python 3.10+, CUDA 12.1+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers peft trl bitsandbytes datasets accelerate
pip install huggingface_hub # for model download and upload
# Verify CUDA and bitsandbytes
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
python -c "import bitsandbytes as bnb; print(bnb.__version__)"
# verify_setup.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
print(f"CUDA available: {torch.cuda.is_available()}")
print(f"GPU: {torch.cuda.get_device_name(0)}")
print(f"VRAM: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")
# Quick smoke test: load a tiny model in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
model = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", quantization_config=bnb)
print("4-bit loading: OK")
6. Dataset Preparation
Instruction fine-tuning requires data in a structured chat format: a system prompt, a user instruction, and the
expected assistant response. The TRL library's SFTTrainer handles tokenisation and packing
automatically if you provide data in this format.
# dataset_prep.py
from datasets import Dataset
RAW_DATA = [
{
"system": "You are a customer support agent for an e-commerce platform. Be concise and helpful.",
"user": "What is your return policy?",
"assistant": "You can return any item within 30 days of delivery for a full refund. "
"Items must be unused and in original packaging. "
"Start a return at account.example.com/returns.",
},
{
"system": "You are a customer support agent for an e-commerce platform. Be concise and helpful.",
"user": "My order hasn't arrived yet. Order ID: 12345.",
"assistant": "Let me check order 12345. Deliveries typically take 3–5 business days. "
"If it's been over 7 days, I'll open a trace with our carrier. "
"Could you confirm your delivery address is still current?",
},
# ... more examples
]
def format_as_chat(example: dict) -> dict:
"""Convert to the Llama chat template format."""
messages = [
{"role": "system", "content": example["system"]},
{"role": "user", "content": example["user"]},
{"role": "assistant", "content": example["assistant"]},
]
return {"messages": messages}
dataset = Dataset.from_list([format_as_chat(r) for r in RAW_DATA])
# Split: 90% train, 10% validation
split = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = split["train"]
eval_ds = split["test"]
print(f"Train: {len(train_ds)} samples")
print(f"Eval: {len(eval_ds)} samples")
# Optionally push to Hugging Face Hub for reuse
# dataset.push_to_hub("your-org/your-dataset-name")
Aim for at least 200 high-quality examples before training. Quality matters far more than quantity for instruction tuning: 500 clean, representative examples consistently outperform 5,000 noisy ones.
7. Training with QLoRA
# train_qlora.py
# Run dataset_prep.py first to create train_ds and eval_ds, then run this file.
from dataset_prep import train_ds, eval_ds
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct" # 3B fits on 8 GB VRAM with QLoRA
# ── 4-bit quantisation ──────────────────────────────────────────────
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat 4-bit
bnb_4bit_compute_dtype=torch.bfloat16, # compute in bfloat16 for numerical stability
bnb_4bit_use_double_quant=True, # quantise the quantisation constants
)
model = AutoModelForCausalLM.from_pretrained(
BASE_MODEL,
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model) # enable gradient checkpointing
tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
# ── LoRA adapter config ─────────────────────────────────────────────
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16, # rank, start here, raise to 32 or 64 for complex tasks
lora_alpha=32, # scaling factor: alpha/r = 2.0 (standard default)
lora_dropout=0.05,
target_modules=[ # apply LoRA to all attention projection layers
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj", # include FFN for stronger adaptation
],
bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 39,976,960 || all params: 3,252,768,768 || trainable%: 1.2290
# ── Training configuration ──────────────────────────────────────────
# Apply the model's chat template to each example, converting message dicts to a
# single formatted string. SFTTrainer requires a plain-text field, not a list of dicts.
def apply_template(example):
text = tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
add_generation_prompt=False,
)
return {"text": text}
train_ds = train_ds.map(apply_template)
eval_ds = eval_ds.map(apply_template)
training_args = SFTConfig(
output_dir="./checkpoints/llama3-support-bot",
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
gradient_accumulation_steps=8, # effective batch size = 2 × 8 = 16
learning_rate=2e-4,
lr_scheduler_type="cosine",
warmup_ratio=0.05,
bf16=True,
logging_steps=10,
eval_steps=50,
save_steps=100,
save_total_limit=3,
eval_strategy="steps", # required for eval_steps and load_best_model_at_end to work
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
max_seq_length=2048,
dataset_text_field="text", # the plain-text field produced by apply_template
report_to="none", # swap to "wandb" or "tensorboard" for tracking
)
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_ds,
eval_dataset=eval_ds,
processing_class=tokenizer, # preferred over deprecated tokenizer= in TRL v0.9+
)
trainer.train()
# ── Save adapter only (small, typically 50–200 MB) ─────────────────
trainer.model.save_pretrained("./output/llama3-support-adapter")
tokenizer.save_pretrained("./output/llama3-support-adapter")
print("Adapter saved.")
8. Loading and Running the Fine-Tuned Model
# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel
BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"
ADAPTER_PATH = "./output/llama3-support-adapter"
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16)
base = AutoModelForCausalLM.from_pretrained(BASE_MODEL, quantization_config=bnb_config, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER_PATH)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)
# Optional: merge adapter into base weights for faster inference (no longer need PEFT at runtime).
# merge_and_unload() dequantizes the 4-bit base weights to bf16 during the merge,
# so VRAM rises from ~2 GB (4-bit) to ~6 GB (bf16) for this 3B model.
# Skip this line and use the PeftModel directly if VRAM is tight.
model = model.merge_and_unload()
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
max_new_tokens=256, temperature=0.1)
messages = [
{"role": "system", "content": "You are a customer support agent. Be concise and helpful."},
{"role": "user", "content": "I never received my refund. Order #99887."},
]
output = pipe(messages)
print(output[0]["generated_text"][-1]["content"])
9. Evaluating the Fine-Tuned Model
Evaluate on a held-out test set using both automatic metrics and LLM-as-judge scoring. Track three things: perplexity (lower is better), embedding similarity to reference responses, and LLM judge scores for tone and accuracy.
# evaluate.py
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import numpy as np
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0.0)
embedder = SentenceTransformer("all-MiniLM-L6-v2")
test_cases = [
{
"input": "What is the return window?",
"reference": "Items can be returned within 30 days of delivery for a full refund.",
},
{
"input": "Can I return a used item?",
"reference": "Items must be unused and in original packaging to qualify for a return.",
},
]
similarities = []
for case in test_cases:
messages = [
{"role": "system", "content": "You are a customer support agent. Be concise and helpful."},
{"role": "user", "content": case["input"]},
]
response = pipe(messages)[0]["generated_text"][-1]["content"]
embs = embedder.encode([response, case["reference"]], normalize_embeddings=True)
sim = float(np.dot(embs[0], embs[1]))
similarities.append(sim)
print(f"Q: {case['input']}")
print(f"A: {response}")
print(f"Similarity to reference: {sim:.3f}\n")
print(f"Mean similarity: {sum(similarities)/len(similarities):.3f}")
# Target: > 0.85 for a well-adapted model
10. Cost and Performance Analysis
| Approach | Model size | VRAM needed | Training time (1K samples, 3 epochs) | Approx cost (cloud GPU) |
|---|---|---|---|---|
| Full fine-tuning | 7B | 112 GB (fp32) | 6–12 hours (8× A100) | $200–$500 |
| LoRA (fp16) | 7B | 28 GB | 2–4 hours (A100 40 GB) | $15–$40 |
| QLoRA (NF4) | 7B | 8–10 GB | 1–2 hours (RTX 4090 or T4) | $2–$8 |
| QLoRA (NF4) | 3B | 4–6 GB | 30–60 min (T4 or L4) | $0.50–$2 |
| QLoRA (NF4) | 13B | 14–16 GB | 2–3 hours (A10G or RTX 4090) | $5–$15 |
For most production use cases, a 3B or 7B model fine-tuned with QLoRA outperforms a raw 70B model on the specific task it was trained for. The fine-tuned model has learned the exact output format, tone, and domain constraints that the larger model must be coaxed into through long, expensive prompts.
11. Common Pitfalls
| Pitfall | Symptom | Fix |
|---|---|---|
| Rank too high | Model overfits; eval loss rises after epoch 1 | Drop r from 64 to 16; reduce epochs or add dropout |
| Learning rate too high | Train loss spikes then diverges | Use 1e-4 to 2e-4; enable warmup (warmup_ratio=0.05) |
| Wrong target modules | Very few trainable parameters; weak adaptation | Print model architecture and add FFN layers: gate_proj, up_proj, down_proj |
| Catastrophic forgetting | Fine-tuned model forgets general capabilities | Keep training data diverse; add 5–10% general instruction examples |
| Data format mismatch | Model outputs garbled text or wrong role labels | Print 2–3 tokenised examples before training to verify chat template is applied correctly |
| Padding side wrong | Attention mask errors; training crashes or gives nonsense | Set tokenizer.padding_side = "right" for causal LMs |
12. Key Takeaways
- LoRA reduces trainable parameters by 99%. By decomposing weight updates as \(BA\) where rank \(r \ll d\), LoRA trains only the small matrices while keeping the pretrained weights frozen. At inference, the adapter is merged into the base weights with no latency cost.
- QLoRA makes 7B fine-tuning fit on a consumer GPU. NF4 4-bit quantisation plus double quantisation reduces the frozen model's VRAM footprint from 14 GB to roughly 4 GB. Adapter weights remain in bfloat16 for numerical stability during training.
- Rank r=16 is the right starting point. Raise to 32 or 64 for complex behavioural changes (instruction following, structured output schemas). Keep at 8 or 16 for style and tone adaptation.
- Data quality dominates data quantity. 500 clean, representative instruction examples consistently outperform 5,000 noisy ones. Invest in curation before scaling data collection.
- Target all attention and FFN layers. Adding gate_proj, up_proj, down_proj alongside the standard q/k/v/o projections gives substantially stronger adaptation with modest additional VRAM cost.
- Fine-tune beats prompting at scale. A fine-tuned 3B model outperforms a prompted 70B model on its target task, at a fraction of the inference cost. Fine-tune when you have at least 200 high-quality examples and the task is stable enough to justify periodic retraining.
References
- Hu et al. (2021). LoRA: Low-Rank Adaptation of Large Language Models. arXiv:2106.09685
- Dettmers et al. (2023). QLoRA: Efficient Finetuning of Quantized LLMs. arXiv:2305.14314
- Hugging Face PEFT Documentation
- TRL SFTTrainer Documentation
- Transformers Quantization Documentation
Related Articles