LoRA and QLoRA: Fine-Tuning LLMs on a Single GPU

Introduction

Every team deploying an LLM eventually hits the same wall. The base model is capable but not specialised: it answers in the wrong tone, uses the wrong terminology, ignores domain conventions, or hallucinates domain-specific facts that a fine-tuned model would know to retrieve or refuse. The obvious fix is fine-tuning. The obvious problem is cost.

Full fine-tuning a 7 billion parameter model in fp32 requires approximately 112 GB of VRAM just to hold the model, gradients, and optimiser states simultaneously. That is more than four A100 80 GB cards combined. At cloud pricing, a single training run costs hundreds of dollars. Iteration is slow and expensive.

LoRA (Low-Rank Adaptation) changes this economics entirely. Instead of updating all 7 billion parameters, LoRA freezes the pretrained weights and trains a tiny pair of matrices whose product approximates the weight update. The number of trainable parameters drops by 99%. QLoRA compounds this by loading the frozen base model in 4-bit precision, cutting VRAM requirements by another 75%. The result: fine-tuning a 7B model on a single RTX 4090 in an afternoon for under $10.

1. Fine-Tuning vs RAG: Choosing the Right Tool

Fine-tuning and retrieval-augmented generation solve different problems. Using the wrong one wastes time and money.

Problem	Better fit	Why
Model needs access to a large, changing knowledge base	RAG	Knowledge updates without retraining; retrieval is the bottleneck, not model weights
Model outputs in the wrong format or style	Fine-tuning	Style and format are learned in weights; RAG cannot change output structure
Model makes factual errors on domain knowledge	RAG first, fine-tuning if RAG fails	RAG is faster to implement; fine-tuning for facts baked deeply into the domain
Model needs to follow a rigid instruction schema consistently	Fine-tuning	Prompt engineering is fragile at scale; fine-tuning is more robust
Latency is critical (no retrieval round-trip tolerated)	Fine-tuning	RAG adds a retrieval step; fine-tuned model answers directly
Budget is tight and knowledge base is small	RAG	Cheaper to maintain a small vector store than to retrain periodically

2. Why Full Fine-Tuning Is Unaffordable

Training a model updates its parameters using gradients. For each parameter, the optimiser (e.g. AdamW) stores two momentum values. The total VRAM footprint per parameter in fp32 training is:

\text{VRAM} = P \times (4_{\text{weights}} + 4_{\text{grads}} + 8_{\text{Adam states}}) = P \times 16 \text{ bytes}

For a 7B parameter model: $7 \times 10^9 \times 16 = 112\text{ GB}$. A single A100 80 GB cannot hold this. Mixed-precision training (bf16/fp16) keeps a fp32 master weight copy alongside the fp16 model, so the total footprint remains close to 112 GB. Even a pure fp16 regime without a master copy needs roughly 56 GB for parameters and optimiser state alone, leaving only 24 GB for activations, which is insufficient for typical batch sizes and sequence lengths, so in practice you still need multiple GPUs or significant memory optimisation.

LoRA's insight is that the weight updates themselves are low-rank. When you fine-tune a model, the matrices do not change arbitrarily. The effective change lives in a much smaller subspace. LoRA exploits this by constraining updates to that subspace explicitly.

3. LoRA: The Mathematics

For a weight matrix $W_0 \in \mathbb{R}^{d \times k}$, instead of learning a full $\Delta W \in \mathbb{R}^{d \times k}$, LoRA constrains the update to a product of two small matrices:

W = W_0 + \Delta W = W_0 + BA \quad \text{where} \quad B \in \mathbb{R}^{d \times r},\; A \in \mathbb{R}^{r \times k},\; r \ll \min(d, k)

During training, $W_0$ is frozen and only $B$ and $A$ receive gradient updates. Initialisation is critical: $A$ is initialised from a random Gaussian distribution and $B$ is initialised to zero, so that $BA = 0$ at the start of training. This guarantees that the fine-tuned model behaves identically to the pretrained model at step zero, with no random-noise injection into the forward pass. The rank $r$ controls the expressiveness of the update: higher rank learns more complex changes but uses more memory and is slower to train. During inference, $BA$ is merged into $W_0$, so there is no inference latency cost. The adapter disappears into the base weights.

Figure 1: LoRA decomposes the weight update into two small matrices B and A. Only B and A are trained; W₀ is frozen. At inference, BA is merged into W₀ so there is no added latency.

The scaling factor lora_alpha controls how strongly the adapter output is mixed into the base model. The effective scaling applied to $BA$ is $\alpha / r$. Setting lora_alpha = 2r (twice the rank) is the standard default that applies a 2× amplification, which empirically works well across most tasks.

4. QLoRA: 4-Bit Quantisation Meets LoRA

QLoRA (Dettmers et al., 2023) combines LoRA with two additional innovations that together cut VRAM requirements by roughly 75% compared to LoRA alone.

4-bit NormalFloat (NF4) quantisation stores each weight in 4 bits instead of 16. NF4 is information-theoretically optimal for normally distributed weights (which LLM weights approximately are): it allocates quantisation levels to match the weight distribution, minimising rounding error.

Double quantisation quantises the quantisation constants themselves. The constants needed to dequantise the 4-bit weights are stored in 8 bits rather than 32, saving an additional 0.37 bits per parameter on average.

Figure 2: VRAM progression from full fine-tuning to QLoRA. A 7B model that required 4 A100s for full fine-tuning fits on a single RTX 4090 with QLoRA. Adapter weights remain in BF16; only the frozen base is 4-bit.

5. Setting Up the Environment


# Python 3.10+, CUDA 12.1+
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers peft trl bitsandbytes datasets accelerate
pip install huggingface_hub     # for model download and upload

# Verify CUDA and bitsandbytes
python -c "import torch; print(torch.cuda.is_available(), torch.version.cuda)"
python -c "import bitsandbytes as bnb; print(bnb.__version__)"


# verify_setup.py
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

print(f"CUDA available:  {torch.cuda.is_available()}")
print(f"GPU:             {torch.cuda.get_device_name(0)}")
print(f"VRAM:            {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB")

# Quick smoke test: load a tiny model in 4-bit
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                          bnb_4bit_compute_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-125m")
model     = AutoModelForCausalLM.from_pretrained("facebook/opt-125m", quantization_config=bnb)
print("4-bit loading: OK")

6. Dataset Preparation

Instruction fine-tuning requires data in a structured chat format: a system prompt, a user instruction, and the expected assistant response. The TRL library's SFTTrainer handles tokenisation and packing automatically if you provide data in this format.


# dataset_prep.py
from datasets import Dataset

RAW_DATA = [
    {
        "system":    "You are a customer support agent for an e-commerce platform. Be concise and helpful.",
        "user":      "What is your return policy?",
        "assistant": "You can return any item within 30 days of delivery for a full refund. "
                     "Items must be unused and in original packaging. "
                     "Start a return at account.example.com/returns.",
    },
    {
        "system":    "You are a customer support agent for an e-commerce platform. Be concise and helpful.",
        "user":      "My order hasn't arrived yet. Order ID: 12345.",
        "assistant": "Let me check order 12345. Deliveries typically take 3–5 business days. "
                     "If it's been over 7 days, I'll open a trace with our carrier. "
                     "Could you confirm your delivery address is still current?",
    },
    # ... more examples
]

def format_as_chat(example: dict) -> dict:
    """Convert to the Llama chat template format."""
    messages = [
        {"role": "system",    "content": example["system"]},
        {"role": "user",      "content": example["user"]},
        {"role": "assistant", "content": example["assistant"]},
    ]
    return {"messages": messages}

dataset = Dataset.from_list([format_as_chat(r) for r in RAW_DATA])

# Split: 90% train, 10% validation
split    = dataset.train_test_split(test_size=0.1, seed=42)
train_ds = split["train"]
eval_ds  = split["test"]

print(f"Train: {len(train_ds)} samples")
print(f"Eval:  {len(eval_ds)} samples")

# Optionally push to Hugging Face Hub for reuse
# dataset.push_to_hub("your-org/your-dataset-name")

Aim for at least 200 high-quality examples before training. Quality matters far more than quantity for instruction tuning: 500 clean, representative examples consistently outperform 5,000 noisy ones.

7. Training with QLoRA


# train_qlora.py
# Run dataset_prep.py first to create train_ds and eval_ds, then run this file.
from dataset_prep import train_ds, eval_ds

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, TaskType, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer, SFTConfig

BASE_MODEL = "meta-llama/Llama-3.2-3B-Instruct"   # 3B fits on 8 GB VRAM with QLoRA

# ── 4-bit quantisation ──────────────────────────────────────────────
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",              # NormalFloat 4-bit
    bnb_4bit_compute_dtype=torch.bfloat16,  # compute in bfloat16 for numerical stability
    bnb_4bit_use_double_quant=True,         # quantise the quantisation constants
)

model = AutoModelForCausalLM.from_pretrained(
    BASE_MODEL,
    quantization_config=bnb_config,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)
model = prepare_model_for_kbit_training(model)   # enable gradient checkpointing

tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# ── LoRA adapter config ─────────────────────────────────────────────
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                # rank, start here, raise to 32 or 64 for complex tasks
    lora_alpha=32,       # scaling factor: alpha/r = 2.0 (standard default)
    lora_dropout=0.05,
    target_modules=[     # apply LoRA to all attention projection layers
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",    # include FFN for stronger adaptation
    ],
    bias="none",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# trainable params: 39,976,960 || all params: 3,252,768,768 || trainable%: 1.2290

# ── Training configuration ──────────────────────────────────────────
# Apply the model's chat template to each example, converting message dicts to a
# single formatted string. SFTTrainer requires a plain-text field, not a list of dicts.
def apply_template(example):
    text = tokenizer.apply_chat_template(
        example["messages"],
        tokenize=False,
        add_generation_prompt=False,
    )
    return {"text": text}

train_ds = train_ds.map(apply_template)
eval_ds  = eval_ds.map(apply_template)

training_args = SFTConfig(
    output_dir="./checkpoints/llama3-support-bot",
    num_train_epochs=3,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    gradient_accumulation_steps=8,   # effective batch size = 2 × 8 = 16
    learning_rate=2e-4,
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    bf16=True,
    logging_steps=10,
    eval_steps=50,
    save_steps=100,
    save_total_limit=3,
    eval_strategy="steps",        # required for eval_steps and load_best_model_at_end to work
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    max_seq_length=2048,
    dataset_text_field="text",   # the plain-text field produced by apply_template
    report_to="none",            # swap to "wandb" or "tensorboard" for tracking
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_ds,
    eval_dataset=eval_ds,
    processing_class=tokenizer,  # preferred over deprecated tokenizer= in TRL v0.9+
)

trainer.train()

# ── Save adapter only (small, typically 50–200 MB) ─────────────────
trainer.model.save_pretrained("./output/llama3-support-adapter")
tokenizer.save_pretrained("./output/llama3-support-adapter")
print("Adapter saved.")

8. Loading and Running the Fine-Tuned Model


# inference.py
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig, pipeline
from peft import PeftModel

BASE_MODEL    = "meta-llama/Llama-3.2-3B-Instruct"
ADAPTER_PATH  = "./output/llama3-support-adapter"

bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4",
                                 bnb_4bit_compute_dtype=torch.bfloat16)

base      = AutoModelForCausalLM.from_pretrained(BASE_MODEL, quantization_config=bnb_config, device_map="auto")
model     = PeftModel.from_pretrained(base, ADAPTER_PATH)
tokenizer = AutoTokenizer.from_pretrained(ADAPTER_PATH)

# Optional: merge adapter into base weights for faster inference (no longer need PEFT at runtime).
# merge_and_unload() dequantizes the 4-bit base weights to bf16 during the merge,
# so VRAM rises from ~2 GB (4-bit) to ~6 GB (bf16) for this 3B model.
# Skip this line and use the PeftModel directly if VRAM is tight.
model = model.merge_and_unload()

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer,
                max_new_tokens=256, temperature=0.1)

messages = [
    {"role": "system",  "content": "You are a customer support agent. Be concise and helpful."},
    {"role": "user",    "content": "I never received my refund. Order #99887."},
]
output = pipe(messages)
print(output[0]["generated_text"][-1]["content"])

9. Evaluating the Fine-Tuned Model

Evaluate on a held-out test set using both automatic metrics and LLM-as-judge scoring. Track three things: perplexity (lower is better), embedding similarity to reference responses, and LLM judge scores for tone and accuracy.


# evaluate.py
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import numpy as np

pipe     = pipeline("text-generation", model=model, tokenizer=tokenizer, max_new_tokens=256, temperature=0.0)
embedder = SentenceTransformer("all-MiniLM-L6-v2")

test_cases = [
    {
        "input":     "What is the return window?",
        "reference": "Items can be returned within 30 days of delivery for a full refund.",
    },
    {
        "input":     "Can I return a used item?",
        "reference": "Items must be unused and in original packaging to qualify for a return.",
    },
]

similarities = []
for case in test_cases:
    messages = [
        {"role": "system", "content": "You are a customer support agent. Be concise and helpful."},
        {"role": "user",   "content": case["input"]},
    ]
    response = pipe(messages)[0]["generated_text"][-1]["content"]
    embs = embedder.encode([response, case["reference"]], normalize_embeddings=True)
    sim  = float(np.dot(embs[0], embs[1]))
    similarities.append(sim)
    print(f"Q: {case['input']}")
    print(f"A: {response}")
    print(f"Similarity to reference: {sim:.3f}\n")

print(f"Mean similarity: {sum(similarities)/len(similarities):.3f}")
# Target: > 0.85 for a well-adapted model

10. Cost and Performance Analysis

Approach	Model size	VRAM needed	Training time (1K samples, 3 epochs)	Approx cost (cloud GPU)
Full fine-tuning	7B	112 GB (fp32)	6–12 hours (8× A100)	$200–$500
LoRA (fp16)	7B	28 GB	2–4 hours (A100 40 GB)	$15–$40
QLoRA (NF4)	7B	8–10 GB	1–2 hours (RTX 4090 or T4)	$2–$8
QLoRA (NF4)	3B	4–6 GB	30–60 min (T4 or L4)	$0.50–$2
QLoRA (NF4)	13B	14–16 GB	2–3 hours (A10G or RTX 4090)	$5–$15

For most production use cases, a 3B or 7B model fine-tuned with QLoRA outperforms a raw 70B model on the specific task it was trained for. The fine-tuned model has learned the exact output format, tone, and domain constraints that the larger model must be coaxed into through long, expensive prompts.

11. Common Pitfalls

Pitfall	Symptom	Fix
Rank too high	Model overfits; eval loss rises after epoch 1	Drop r from 64 to 16; reduce epochs or add dropout
Learning rate too high	Train loss spikes then diverges	Use 1e-4 to 2e-4; enable warmup (warmup_ratio=0.05)
Wrong target modules	Very few trainable parameters; weak adaptation	Print model architecture and add FFN layers: gate_proj, up_proj, down_proj
Catastrophic forgetting	Fine-tuned model forgets general capabilities	Keep training data diverse; add 5–10% general instruction examples
Data format mismatch	Model outputs garbled text or wrong role labels	Print 2–3 tokenised examples before training to verify chat template is applied correctly
Padding side wrong	Attention mask errors; training crashes or gives nonsense	Set tokenizer.padding_side = "right" for causal LMs

12. Key Takeaways

LoRA reduces trainable parameters by 99%. By decomposing weight updates as $BA$ where rank $r \ll d$, LoRA trains only the small matrices while keeping the pretrained weights frozen. At inference, the adapter is merged into the base weights with no latency cost.
QLoRA makes 7B fine-tuning fit on a consumer GPU. NF4 4-bit quantisation plus double quantisation reduces the frozen model's VRAM footprint from 14 GB to roughly 4 GB. Adapter weights remain in bfloat16 for numerical stability during training.
Rank r=16 is the right starting point. Raise to 32 or 64 for complex behavioural changes (instruction following, structured output schemas). Keep at 8 or 16 for style and tone adaptation.
Data quality dominates data quantity. 500 clean, representative instruction examples consistently outperform 5,000 noisy ones. Invest in curation before scaling data collection.
Target all attention and FFN layers. Adding gate_proj, up_proj, down_proj alongside the standard q/k/v/o projections gives substantially stronger adaptation with modest additional VRAM cost.
Fine-tune beats prompting at scale. A fine-tuned 3B model outperforms a prompted 70B model on its target task, at a fraction of the inference cost. Fine-tune when you have at least 200 high-quality examples and the task is stable enough to justify periodic retraining.