LLM Evaluation Pipelines: How to Test What Your Model Actually Does
Introduction
Every engineer who has deployed an LLM feature faces the same uncomfortable question a week later: is it working better or worse than it was last Tuesday? A prompt change, a model upgrade, a retrieval tweak: any of these can silently degrade response quality in ways that no unit test will catch. The model still returns a string. The string still looks plausible. Users just start complaining.
The problem is that standard software testing assumes deterministic behaviour: given input X, the output must exactly equal Y. LLMs break this assumption completely. The same question asked ten times produces ten different correct answers. Correctness is partial, contextual, and multi-dimensional. You need a different testing paradigm entirely.
This post builds that paradigm from the ground up: what dimensions to measure, which metrics to use for which tasks, how to run an LLM-as-judge, how to use RAGAS for RAG-specific evaluation, and how to wire everything into an automated pipeline that runs on every deployment and alerts on regressions.
1. Why Unit Tests Fail for LLMs
Consider a customer support bot answering: "What is your return policy?"
All of the following are correct answers:
- "We accept returns within 30 days of purchase with the original receipt."
- "Products can be returned for a full refund within 30 days."
- "You have 30 days to return any item. Just bring your receipt."
An exact-match test against any single phrasing would incorrectly flag the others as failures. And that is the simplest possible case. Real evaluation problems include: the model gives a correct but incomplete answer, the model gives a fluent but factually wrong answer, the model answers a different question than the one asked, or the model correctly answers but cites a policy that changed last week.
Effective LLM evaluation requires measuring along multiple dimensions simultaneously, with metrics that handle partial correctness, semantic variation, and grounding in source documents.
| Testing approach | What it catches | What it misses |
|---|---|---|
| Exact-match string comparison | Verbatim regressions | Paraphrased correct answers, partial credit, semantic equivalence |
| Keyword presence check | Missing required terms | Incorrect context, hallucinated claims, coherence failures |
| Human review | Everything | Nothing; but does not scale and cannot run on every deploy |
| LLM-as-judge + embedding metrics | Quality dimensions at scale | Subtle domain knowledge errors that require expert review |
2. The Four Evaluation Dimensions
Every LLM evaluation question maps onto one or more of four quality dimensions. Defining these explicitly before choosing metrics prevents the common mistake of measuring what is easy rather than what matters.
3. LLM-as-Judge: Scoring with a Capable Model
The most flexible evaluation technique is using a capable model to score another model's output. You provide a rubric, the question, the response, and optionally a reference answer. The judge returns a structured score.
This scales to arbitrary output formats, requires no reference answer for some metrics, and generalises across languages and domains. The cost is roughly one judge call per evaluation sample, which is cheap relative to the human annotation it replaces.
3.1 Single-answer grading with a forced tool call
Using a forced tool call ensures the judge always returns parseable structured scores rather than free text that needs regex extraction.
# llm_judge.py
import anthropic
client = anthropic.Anthropic()
JUDGE_TOOL = {
"name": "submit_scores",
"description": "Submit evaluation scores for the model response.",
"input_schema": {
"type": "object",
"properties": {
"correctness": {
"type": "integer", "minimum": 1, "maximum": 5,
"description": "Factual accuracy of the response (1=factually wrong, 5=fully correct)"
},
"completeness": {
"type": "integer", "minimum": 1, "maximum": 5,
"description": "How fully the response addresses all parts of the question"
},
"clarity": {
"type": "integer", "minimum": 1, "maximum": 5,
"description": "How clear and readable the response is"
},
"reasoning": {
"type": "string",
"description": "Two-sentence explanation justifying the scores"
}
},
"required": ["correctness", "completeness", "clarity", "reasoning"]
}
}
JUDGE_SYSTEM = """You are an expert evaluator of AI assistant responses.
Score the given response on three dimensions, each 1 (very poor) to 5 (excellent):
- Correctness: factual accuracy relative to the question and reference answer
- Completeness: whether all aspects of the question are addressed
- Clarity: how clear, concise, and well-structured the response is
Be strict. A score of 5 means the response is essentially perfect on that dimension."""
def judge(question: str, response: str, reference: str = "") -> dict:
prompt = f"Question: {question}\n\nResponse to evaluate:\n{response}"
if reference:
prompt += f"\n\nReference answer (use as ground truth for correctness):\n{reference}"
result = client.messages.create(
model="claude-opus-4-7",
max_tokens=512,
temperature=0, # always use temperature=0 for reproducible, comparable scores
system=JUDGE_SYSTEM,
tools=[JUDGE_TOOL],
tool_choice={"type": "tool", "name": "submit_scores"},
messages=[{"role": "user", "content": prompt}],
)
for block in result.content:
if block.type == "tool_use" and block.name == "submit_scores":
return block.input
return {}
# Usage
scores = judge(
question="What is XGBoost and when should I use it?",
response="XGBoost is a gradient boosting library. Use it for tabular data.",
reference="XGBoost is a gradient boosted decision tree framework using second-order Taylor "
"approximation and regularisation. It excels at tabular data, handles missing values "
"natively, and is the default choice for Kaggle-style regression and classification tasks.",
)
print(scores)
# {'correctness': 3, 'completeness': 2, 'clarity': 4,
# 'reasoning': 'Correct but very incomplete, misses Taylor approx, regularisation, missing value handling.'}
3.2 Pairwise comparison
When you want to compare two system versions directly, pairwise comparison is more reliable than single-answer scoring because it eliminates absolute scale calibration. The judge simply decides which response is better.
PAIR_TOOL = {
"name": "submit_preference",
"description": "State which response is better and why.",
"input_schema": {
"type": "object",
"properties": {
"winner": {"type": "string", "enum": ["A", "B", "tie"],
"description": "Which response is better overall"},
"reason": {"type": "string",
"description": "One sentence explaining the choice"}
},
"required": ["winner", "reason"]
}
}
def compare(question: str, response_a: str, response_b: str) -> dict:
prompt = (
f"Question: {question}\n\n"
f"Response A:\n{response_a}\n\n"
f"Response B:\n{response_b}\n\n"
"Which response is better? Consider accuracy, completeness, and clarity."
)
result = client.messages.create(
model="claude-opus-4-7",
max_tokens=256,
system="You are an expert evaluator. Judge objectively which AI response better answers the question.",
tools=[PAIR_TOOL],
tool_choice={"type": "tool", "name": "submit_preference"},
messages=[{"role": "user", "content": prompt}],
)
for block in result.content:
if block.type == "tool_use":
return block.input
return {}
3.3 Known biases and how to mitigate them
| Bias | Description | Mitigation |
|---|---|---|
| Verbosity bias | Longer answers tend to score higher even if padding adds nothing | Add "penalise unnecessary padding" to the rubric; use reference-based scoring |
| Position bias | In pairwise, Response A is preferred ~55% of the time by default | Run each pair twice with A/B swapped; count wins from both orderings |
| Self-preference | Claude prefers Claude-style answers; GPT prefers GPT-style answers | Use a different model family as judge than the model under test |
| Calibration drift | Judge scores shift across model updates, making historical comparison unreliable | Fix the judge model version; validate scores against a human-labelled anchor set quarterly |
4. RAGAS: Purpose-Built Metrics for RAG Pipelines
General-purpose LLM judges are expensive and noisy for RAG evaluation. RAGAS provides reference-free metrics specifically designed for retrieval-augmented generation systems. It answers a distinct set of questions that a general judge cannot easily quantify.
4.1 Faithfulness
Faithfulness measures whether every claim in the generated answer can be inferred from the retrieved context. It detects hallucination: content the model invented rather than grounded in the source documents.
RAGAS computes this by first using an LLM to extract atomic claims from the answer, then verifying each claim against the retrieved context chunks with a second LLM call. A faithfulness score of 0.95 means 95% of the answer's claims are grounded; the remaining 5% were hallucinated.
4.2 Answer Relevancy
Answer Relevancy measures whether the answer actually addresses the question. A common failure mode is an answer that is faithful to the context but answers a subtly different question than the one asked. RAGAS computes this by generating N questions from the answer, then measuring the cosine similarity between each generated question and the original.
4.3 Context Precision and Context Recall
These two metrics evaluate the retriever, not the generator. Context Precision asks: are the most relevant chunks ranked at the top? A high-precision retriever wastes less of the LLM's context window on noise. Context Recall asks: does the retrieved context contain everything needed to answer the question? Both require ground truth answers to compute.
4.4 Running RAGAS
pip install "ragas>=0.1,<0.2" datasets anthropic # pin to v0.1 API used in examples below
RAGAS v0.2 renamed several column keys (question to user_input,
ground_truth to reference, etc.). The examples below use the v0.1 schema,
which remains widely deployed. Check ragas --version and consult the migration guide if
you are on v0.2+.
# ragas_eval.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset
# Each row: one evaluated sample from your RAG system
data = {
"question": [
"What is the return policy?",
"How do I cancel my subscription?",
],
"answer": [
"You can return items within 30 days of purchase.",
"To cancel, go to Account Settings and select Cancel Subscription.",
],
"contexts": [
["Our policy allows returns within 30 days with the original receipt."],
["Subscriptions can be cancelled at any time via Account Settings > Billing > Cancel."],
],
"ground_truth": [
"Products can be returned within 30 days for a full refund.",
"You can cancel your subscription from Account Settings.",
],
}
dataset = Dataset.from_dict(data)
result = evaluate(
dataset=dataset,
metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)
print(result)
# {'faithfulness': 0.97, 'answer_relevancy': 0.94,
# 'context_precision': 0.89, 'context_recall': 0.96}
# Per-question breakdown
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])
5. Embedding-Based Similarity Metrics
When you have reference answers but cannot afford an LLM judge call on every sample, embedding-based similarity provides a cheap, fast alternative. It measures semantic overlap between the generated answer and the reference without calling a large model.
# embedding_eval.py
import numpy as np
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
def semantic_similarity(generated: str, reference: str) -> float:
"""Cosine similarity between generated answer and reference. Range: [-1, 1]."""
embs = embedder.encode([generated, reference], normalize_embeddings=True)
return float(np.dot(embs[0], embs[1]))
def batch_eval(pairs: list[tuple[str, str]]) -> list[float]:
"""Evaluate a list of (generated, reference) pairs efficiently."""
generated = [p[0] for p in pairs]
references = [p[1] for p in pairs]
gen_embs = embedder.encode(generated, normalize_embeddings=True)
ref_embs = embedder.encode(references, normalize_embeddings=True)
return [float(np.dot(g, r)) for g, r in zip(gen_embs, ref_embs)]
# Interpretation guide:
# > 0.95: near-identical meaning
# 0.85 – 0.95: same information, different phrasing (good)
# 0.70 – 0.85: related but notable gaps
# < 0.70: likely missing key information or wrong answer
scores = batch_eval([
("You can return items within 30 days.", "Products can be returned within 30 days for a full refund."),
("Contact support for help.", "Go to Account Settings to cancel your subscription."),
])
print(scores) # [0.93, 0.42]
# First answer: semantically close ✓
# Second answer: off-topic ✗
Embedding similarity is fast enough to run on every CI pipeline run and cheap enough for production monitoring. Use it as a first-pass filter: samples below 0.80 similarity get routed to an LLM judge for detailed scoring.
6. Building an Automated Eval Pipeline
All three techniques above are only useful if they run automatically and track trends over time. An eval pipeline has three components: an eval dataset, a scoring harness, and a regression gate.
6.1 The eval dataset
Start with 50 to 200 curated (question, reference_answer) pairs covering your core use cases and known edge cases. Add a few adversarial examples (questions that previously caused bad answers). Store it in a versioned JSON or CSV file that lives in your repository alongside the code.
# eval_dataset.py, stores and loads the eval dataset
import json
from pathlib import Path
from dataclasses import dataclass
@dataclass
class EvalSample:
id: str
question: str
reference: str
contexts: list[str] # expected retrieved chunks (for RAGAS)
category: str # "returns", "billing", "technical", etc.
severity: str # "critical" or "standard"
def load_dataset(path: str = "eval/dataset.json") -> list[EvalSample]:
raw = json.loads(Path(path).read_text())
return [EvalSample(**item) for item in raw]
6.2 The scoring harness
# eval_runner.py
import json, time
from datetime import datetime, timezone
from pathlib import Path
from llm_judge import judge
from embedding_eval import semantic_similarity
def run_eval(
response_fn, # callable: question -> str
dataset: list,
output_path: str = "eval/results_latest.json",
) -> dict:
results = []
for sample in dataset:
t0 = time.perf_counter()
response = response_fn(sample.question)
latency_ms = (time.perf_counter() - t0) * 1000
emb_score = semantic_similarity(response, sample.reference)
# Cost optimisation: skip LLM judge for high-similarity responses.
# A score >= 0.85 means "same information, different phrasing" (per the interpretation
# guide above). This heuristic will miss edge cases where wording is similar but
# correctness differs, lower to 0.90 or remove for safety-critical evals.
if emb_score < 0.85:
llm_scores = judge(sample.question, response, sample.reference)
else:
llm_scores = {"correctness": 5, "completeness": 5, "clarity": 5, "reasoning": "auto-pass (emb>=0.85)"}
results.append({
"id": sample.id,
"question": sample.question,
"response": response,
"embedding_sim": round(emb_score, 4),
"llm_scores": llm_scores,
"latency_ms": round(latency_ms, 1),
"category": sample.category,
"severity": sample.severity,
"timestamp": datetime.now(timezone.utc).isoformat(),
})
summary = _aggregate(results)
payload = {"summary": summary, "results": results}
Path(output_path).write_text(json.dumps(payload, indent=2))
return summary
def _aggregate(results: list) -> dict:
def mean(vals): return round(sum(vals) / len(vals), 4) if vals else 0.0
return {
"n": len(results),
"mean_embedding_sim": mean([r["embedding_sim"] for r in results]),
"mean_correctness": mean([r["llm_scores"].get("correctness", 5) for r in results]),
"mean_completeness": mean([r["llm_scores"].get("completeness", 5) for r in results]),
"mean_latency_ms": mean([r["latency_ms"] for r in results]),
"critical_failures": sum(1 for r in results
if r["embedding_sim"] < 0.70 and r.get("severity") == "critical"),
}
7. Regression Testing: Catching Drops Before They Ship
Run the eval harness on every pull request and fail the build if any critical metric regresses beyond the configured tolerance. The pattern mirrors standard CI quality gates, but the thresholds are empirically calibrated rather than binary.
# regression_gate.py
import json, sys
from pathlib import Path
THRESHOLDS = {
"mean_embedding_sim": 0.82, # warn if drops below this
"mean_correctness": 3.8, # out of 5
"mean_completeness": 3.8,
"critical_failures": 0, # zero tolerance for critical sample failures
}
def check_regression(
current_path: str = "eval/results_latest.json",
baseline_path: str = "eval/results_baseline.json",
) -> bool:
current = json.loads(Path(current_path).read_text())["summary"]
baseline = json.loads(Path(baseline_path).read_text())["summary"]
passed = True
print("\n=== Eval Regression Report ===")
for metric, threshold in THRESHOLDS.items():
curr_val = current.get(metric, 0)
base_val = baseline.get(metric, 0)
delta = curr_val - base_val
if metric == "critical_failures":
ok = curr_val == 0
else:
ok = curr_val >= threshold
status = "✓ PASS" if ok else "✗ FAIL"
print(f" {status} {metric}: {curr_val:.3f} (baseline {base_val:.3f}, delta {delta:+.3f})")
if not ok:
passed = False
print(f"\nOverall: {'PASS' if passed else 'FAIL, block merge'}")
return passed
if __name__ == "__main__":
ok = check_regression()
sys.exit(0 if ok else 1)
Wire this into your CI as a GitHub Actions job that runs on every PR targeting main. Store the baseline JSON in the repository and update it manually after intentional quality changes are validated by human review.
# .github/workflows/eval.yml
name: LLM Quality Gate
on:
pull_request:
branches: [main]
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with: {python-version: "3.12"}
- run: pip install anthropic ragas datasets sentence-transformers
- run: python eval/run_full_eval.py
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
- run: python eval/regression_gate.py
8. Choosing the Right Metric for Your Task
| Task type | Primary metric | Secondary metric | Why |
|---|---|---|---|
| RAG question answering | RAGAS faithfulness + relevancy | Context precision + recall | Distinguishes retriever bugs from generation bugs |
| Open-ended Q&A (no retrieval) | LLM-as-judge correctness | Embedding similarity vs reference | LLM judge handles semantic variation; embedding provides fast pass/fail |
| Summarisation | Embedding similarity + faithfulness | LLM-as-judge completeness | Need both semantic closeness and grounding in source |
| Classification / extraction | Exact-match or F1 on extracted fields | LLM-as-judge for edge cases | Structured output allows exact comparison; LLM judge handles ambiguous cases |
| Conversational chat | LLM-as-judge pairwise (A/B test) | Safety classifier | Quality is preference-based; single-answer scoring is unreliable for chat |
| Code generation | Execution-based tests (unit tests) | LLM-as-judge readability | Code can be deterministically tested; execution is the ground truth |
9. Key Takeaways
- Unit tests cannot evaluate LLMs. Exact-match comparison misses semantically equivalent answers and provides no partial credit. You need metrics designed for probabilistic outputs.
- LLM-as-judge is the most flexible evaluator. Use a forced tool call for structured output, a different model family than the one under test to reduce self-preference bias, and double-swap pairwise comparisons to cancel position bias.
- RAGAS separates retrieval from generation. Faithfulness and Answer Relevancy measure the generator. Context Precision and Recall measure the retriever. Knowing which component failed tells you where to fix the problem.
- Embedding similarity is your fast filter. A local model like all-MiniLM-L6-v2 runs at near-zero cost and screens out obvious failures in milliseconds. Route only low-similarity samples to the full LLM judge to cut judging costs by 50–80% on typical production traffic.
- Eval datasets are code. Version them in your repository, review additions in PRs, and treat regressions as blocking CI failures with the same severity as broken tests.
- Calibrate your judge against humans. Run 100 samples through both your LLM judge and human reviewers quarterly. Target a Pearson correlation above 0.80. A judge that drifts from human judgement silently corrupts all downstream metrics.
References
- RAGAS Documentation
- DeepEval Documentation
- Zheng et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. arXiv:2306.05685
- Es et al. (2023). RAGAS: Automated Evaluation of Retrieval Augmented Generation. arXiv:2309.15217
- Anthropic Tool Use Documentation
Related Articles