LLM Evaluation Pipelines: How to Test What Your Model Actually Does

Introduction

Every engineer who has deployed an LLM feature faces the same uncomfortable question a week later: is it working better or worse than it was last Tuesday? A prompt change, a model upgrade, a retrieval tweak: any of these can silently degrade response quality in ways that no unit test will catch. The model still returns a string. The string still looks plausible. Users just start complaining.

The problem is that standard software testing assumes deterministic behaviour: given input X, the output must exactly equal Y. LLMs break this assumption completely. The same question asked ten times produces ten different correct answers. Correctness is partial, contextual, and multi-dimensional. You need a different testing paradigm entirely.

This post builds that paradigm from the ground up: what dimensions to measure, which metrics to use for which tasks, how to run an LLM-as-judge, how to use RAGAS for RAG-specific evaluation, and how to wire everything into an automated pipeline that runs on every deployment and alerts on regressions.

1. Why Unit Tests Fail for LLMs

Consider a customer support bot answering: "What is your return policy?"

All of the following are correct answers:

"We accept returns within 30 days of purchase with the original receipt."
"Products can be returned for a full refund within 30 days."
"You have 30 days to return any item. Just bring your receipt."

An exact-match test against any single phrasing would incorrectly flag the others as failures. And that is the simplest possible case. Real evaluation problems include: the model gives a correct but incomplete answer, the model gives a fluent but factually wrong answer, the model answers a different question than the one asked, or the model correctly answers but cites a policy that changed last week.

Effective LLM evaluation requires measuring along multiple dimensions simultaneously, with metrics that handle partial correctness, semantic variation, and grounding in source documents.

Testing approach	What it catches	What it misses
Exact-match string comparison	Verbatim regressions	Paraphrased correct answers, partial credit, semantic equivalence
Keyword presence check	Missing required terms	Incorrect context, hallucinated claims, coherence failures
Human review	Everything	Nothing; but does not scale and cannot run on every deploy
LLM-as-judge + embedding metrics	Quality dimensions at scale	Subtle domain knowledge errors that require expert review

2. The Four Evaluation Dimensions

Every LLM evaluation question maps onto one or more of four quality dimensions. Defining these explicitly before choosing metrics prevents the common mistake of measuring what is easy rather than what matters.

Figure 1: The four dimensions of LLM output quality. Each has specialised metrics. Most production systems need to track two or three of these, not all four.

3. LLM-as-Judge: Scoring with a Capable Model

The most flexible evaluation technique is using a capable model to score another model's output. You provide a rubric, the question, the response, and optionally a reference answer. The judge returns a structured score.

This scales to arbitrary output formats, requires no reference answer for some metrics, and generalises across languages and domains. The cost is roughly one judge call per evaluation sample, which is cheap relative to the human annotation it replaces.

3.1 Single-answer grading with a forced tool call

Using a forced tool call ensures the judge always returns parseable structured scores rather than free text that needs regex extraction.


# llm_judge.py
import anthropic

client = anthropic.Anthropic()

JUDGE_TOOL = {
    "name": "submit_scores",
    "description": "Submit evaluation scores for the model response.",
    "input_schema": {
        "type": "object",
        "properties": {
            "correctness": {
                "type": "integer", "minimum": 1, "maximum": 5,
                "description": "Factual accuracy of the response (1=factually wrong, 5=fully correct)"
            },
            "completeness": {
                "type": "integer", "minimum": 1, "maximum": 5,
                "description": "How fully the response addresses all parts of the question"
            },
            "clarity": {
                "type": "integer", "minimum": 1, "maximum": 5,
                "description": "How clear and readable the response is"
            },
            "reasoning": {
                "type": "string",
                "description": "Two-sentence explanation justifying the scores"
            }
        },
        "required": ["correctness", "completeness", "clarity", "reasoning"]
    }
}

JUDGE_SYSTEM = """You are an expert evaluator of AI assistant responses.
Score the given response on three dimensions, each 1 (very poor) to 5 (excellent):
- Correctness: factual accuracy relative to the question and reference answer
- Completeness: whether all aspects of the question are addressed
- Clarity: how clear, concise, and well-structured the response is
Be strict. A score of 5 means the response is essentially perfect on that dimension."""

def judge(question: str, response: str, reference: str = "") -> dict:
    prompt = f"Question: {question}\n\nResponse to evaluate:\n{response}"
    if reference:
        prompt += f"\n\nReference answer (use as ground truth for correctness):\n{reference}"

    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=512,
        temperature=0,         # always use temperature=0 for reproducible, comparable scores
        system=JUDGE_SYSTEM,
        tools=[JUDGE_TOOL],
        tool_choice={"type": "tool", "name": "submit_scores"},
        messages=[{"role": "user", "content": prompt}],
    )
    for block in result.content:
        if block.type == "tool_use" and block.name == "submit_scores":
            return block.input
    return {}

# Usage
scores = judge(
    question="What is XGBoost and when should I use it?",
    response="XGBoost is a gradient boosting library. Use it for tabular data.",
    reference="XGBoost is a gradient boosted decision tree framework using second-order Taylor "
              "approximation and regularisation. It excels at tabular data, handles missing values "
              "natively, and is the default choice for Kaggle-style regression and classification tasks.",
)
print(scores)
# {'correctness': 3, 'completeness': 2, 'clarity': 4,
#  'reasoning': 'Correct but very incomplete, misses Taylor approx, regularisation, missing value handling.'}

3.2 Pairwise comparison

When you want to compare two system versions directly, pairwise comparison is more reliable than single-answer scoring because it eliminates absolute scale calibration. The judge simply decides which response is better.


PAIR_TOOL = {
    "name": "submit_preference",
    "description": "State which response is better and why.",
    "input_schema": {
        "type": "object",
        "properties": {
            "winner": {"type": "string", "enum": ["A", "B", "tie"],
                      "description": "Which response is better overall"},
            "reason": {"type": "string",
                      "description": "One sentence explaining the choice"}
        },
        "required": ["winner", "reason"]
    }
}

def compare(question: str, response_a: str, response_b: str) -> dict:
    prompt = (
        f"Question: {question}\n\n"
        f"Response A:\n{response_a}\n\n"
        f"Response B:\n{response_b}\n\n"
        "Which response is better? Consider accuracy, completeness, and clarity."
    )
    result = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=256,
        system="You are an expert evaluator. Judge objectively which AI response better answers the question.",
        tools=[PAIR_TOOL],
        tool_choice={"type": "tool", "name": "submit_preference"},
        messages=[{"role": "user", "content": prompt}],
    )
    for block in result.content:
        if block.type == "tool_use":
            return block.input
    return {}

3.3 Known biases and how to mitigate them

Bias	Description	Mitigation
Verbosity bias	Longer answers tend to score higher even if padding adds nothing	Add "penalise unnecessary padding" to the rubric; use reference-based scoring
Position bias	In pairwise, Response A is preferred ~55% of the time by default	Run each pair twice with A/B swapped; count wins from both orderings
Self-preference	Claude prefers Claude-style answers; GPT prefers GPT-style answers	Use a different model family as judge than the model under test
Calibration drift	Judge scores shift across model updates, making historical comparison unreliable	Fix the judge model version; validate scores against a human-labelled anchor set quarterly

4. RAGAS: Purpose-Built Metrics for RAG Pipelines

General-purpose LLM judges are expensive and noisy for RAG evaluation. RAGAS provides reference-free metrics specifically designed for retrieval-augmented generation systems. It answers a distinct set of questions that a general judge cannot easily quantify.

Figure 2: RAGAS maps onto the two phases of a RAG pipeline. Context Precision and Recall measure how good the retriever is. Faithfulness and Answer Relevancy measure how well the LLM uses what was retrieved.

4.1 Faithfulness

Faithfulness measures whether every claim in the generated answer can be inferred from the retrieved context. It detects hallucination: content the model invented rather than grounded in the source documents.

\text{Faithfulness} = \frac{|\text{claims supported by context}|}{|\text{total claims in answer}|}

RAGAS computes this by first using an LLM to extract atomic claims from the answer, then verifying each claim against the retrieved context chunks with a second LLM call. A faithfulness score of 0.95 means 95% of the answer's claims are grounded; the remaining 5% were hallucinated.

4.2 Answer Relevancy

Answer Relevancy measures whether the answer actually addresses the question. A common failure mode is an answer that is faithful to the context but answers a subtly different question than the one asked. RAGAS computes this by generating N questions from the answer, then measuring the cosine similarity between each generated question and the original.

\text{Answer Relevancy} = \frac{1}{N} \sum_{i=1}^{N} \cos\!\bigl(E(q_{\text{gen},i}),\; E(q_{\text{orig}})\bigr)

4.3 Context Precision and Context Recall

These two metrics evaluate the retriever, not the generator. Context Precision asks: are the most relevant chunks ranked at the top? A high-precision retriever wastes less of the LLM's context window on noise. Context Recall asks: does the retrieved context contain everything needed to answer the question? Both require ground truth answers to compute.

4.4 Running RAGAS


pip install "ragas>=0.1,<0.2" datasets anthropic   # pin to v0.1 API used in examples below

RAGAS v0.2 renamed several column keys (question to user_input, ground_truth to reference, etc.). The examples below use the v0.1 schema, which remains widely deployed. Check ragas --version and consult the migration guide if you are on v0.2+.


# ragas_eval.py
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Each row: one evaluated sample from your RAG system
data = {
    "question": [
        "What is the return policy?",
        "How do I cancel my subscription?",
    ],
    "answer": [
        "You can return items within 30 days of purchase.",
        "To cancel, go to Account Settings and select Cancel Subscription.",
    ],
    "contexts": [
        ["Our policy allows returns within 30 days with the original receipt."],
        ["Subscriptions can be cancelled at any time via Account Settings > Billing > Cancel."],
    ],
    "ground_truth": [
        "Products can be returned within 30 days for a full refund.",
        "You can cancel your subscription from Account Settings.",
    ],
}

dataset = Dataset.from_dict(data)

result = evaluate(
    dataset=dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(result)
# {'faithfulness': 0.97, 'answer_relevancy': 0.94,
#  'context_precision': 0.89, 'context_recall': 0.96}

# Per-question breakdown
df = result.to_pandas()
print(df[["question", "faithfulness", "answer_relevancy"]])

5. Embedding-Based Similarity Metrics

When you have reference answers but cannot afford an LLM judge call on every sample, embedding-based similarity provides a cheap, fast alternative. It measures semantic overlap between the generated answer and the reference without calling a large model.


# embedding_eval.py
import numpy as np
from sentence_transformers import SentenceTransformer

embedder = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_similarity(generated: str, reference: str) -> float:
    """Cosine similarity between generated answer and reference. Range: [-1, 1]."""
    embs = embedder.encode([generated, reference], normalize_embeddings=True)
    return float(np.dot(embs[0], embs[1]))

def batch_eval(pairs: list[tuple[str, str]]) -> list[float]:
    """Evaluate a list of (generated, reference) pairs efficiently."""
    generated = [p[0] for p in pairs]
    references = [p[1] for p in pairs]
    gen_embs = embedder.encode(generated, normalize_embeddings=True)
    ref_embs = embedder.encode(references, normalize_embeddings=True)
    return [float(np.dot(g, r)) for g, r in zip(gen_embs, ref_embs)]

# Interpretation guide:
# > 0.95: near-identical meaning
# 0.85 – 0.95: same information, different phrasing (good)
# 0.70 – 0.85: related but notable gaps
# < 0.70: likely missing key information or wrong answer

scores = batch_eval([
    ("You can return items within 30 days.", "Products can be returned within 30 days for a full refund."),
    ("Contact support for help.", "Go to Account Settings to cancel your subscription."),
])
print(scores)  # [0.93, 0.42]
# First answer: semantically close ✓
# Second answer: off-topic ✗

Embedding similarity is fast enough to run on every CI pipeline run and cheap enough for production monitoring. Use it as a first-pass filter: samples below 0.80 similarity get routed to an LLM judge for detailed scoring.

6. Building an Automated Eval Pipeline

All three techniques above are only useful if they run automatically and track trends over time. An eval pipeline has three components: an eval dataset, a scoring harness, and a regression gate.

6.1 The eval dataset

Start with 50 to 200 curated (question, reference_answer) pairs covering your core use cases and known edge cases. Add a few adversarial examples (questions that previously caused bad answers). Store it in a versioned JSON or CSV file that lives in your repository alongside the code.


# eval_dataset.py, stores and loads the eval dataset
import json
from pathlib import Path
from dataclasses import dataclass

@dataclass
class EvalSample:
    id:            str
    question:      str
    reference:     str
    contexts:      list[str]   # expected retrieved chunks (for RAGAS)
    category:      str         # "returns", "billing", "technical", etc.
    severity:      str         # "critical" or "standard"

def load_dataset(path: str = "eval/dataset.json") -> list[EvalSample]:
    raw = json.loads(Path(path).read_text())
    return [EvalSample(**item) for item in raw]

6.2 The scoring harness


# eval_runner.py
import json, time
from datetime import datetime, timezone
from pathlib import Path
from llm_judge import judge
from embedding_eval import semantic_similarity

def run_eval(
    response_fn,           # callable: question -> str
    dataset: list,
    output_path: str = "eval/results_latest.json",
) -> dict:
    results = []

    for sample in dataset:
        t0 = time.perf_counter()
        response = response_fn(sample.question)
        latency_ms = (time.perf_counter() - t0) * 1000

        emb_score = semantic_similarity(response, sample.reference)

        # Cost optimisation: skip LLM judge for high-similarity responses.
        # A score >= 0.85 means "same information, different phrasing" (per the interpretation
        # guide above). This heuristic will miss edge cases where wording is similar but
        # correctness differs, lower to 0.90 or remove for safety-critical evals.
        if emb_score < 0.85:
            llm_scores = judge(sample.question, response, sample.reference)
        else:
            llm_scores = {"correctness": 5, "completeness": 5, "clarity": 5, "reasoning": "auto-pass (emb>=0.85)"}

        results.append({
            "id":               sample.id,
            "question":         sample.question,
            "response":         response,
            "embedding_sim":    round(emb_score, 4),
            "llm_scores":       llm_scores,
            "latency_ms":       round(latency_ms, 1),
            "category":         sample.category,
            "severity":         sample.severity,
            "timestamp":        datetime.now(timezone.utc).isoformat(),
        })

    summary = _aggregate(results)
    payload = {"summary": summary, "results": results}
    Path(output_path).write_text(json.dumps(payload, indent=2))
    return summary

def _aggregate(results: list) -> dict:
    def mean(vals): return round(sum(vals) / len(vals), 4) if vals else 0.0
    return {
        "n":                  len(results),
        "mean_embedding_sim": mean([r["embedding_sim"] for r in results]),
        "mean_correctness":   mean([r["llm_scores"].get("correctness", 5) for r in results]),
        "mean_completeness":  mean([r["llm_scores"].get("completeness", 5) for r in results]),
        "mean_latency_ms":    mean([r["latency_ms"] for r in results]),
        "critical_failures":  sum(1 for r in results
                                  if r["embedding_sim"] < 0.70 and r.get("severity") == "critical"),
    }

7. Regression Testing: Catching Drops Before They Ship

Run the eval harness on every pull request and fail the build if any critical metric regresses beyond the configured tolerance. The pattern mirrors standard CI quality gates, but the thresholds are empirically calibrated rather than binary.


# regression_gate.py
import json, sys
from pathlib import Path

THRESHOLDS = {
    "mean_embedding_sim": 0.82,   # warn if drops below this
    "mean_correctness":   3.8,    # out of 5
    "mean_completeness":  3.8,
    "critical_failures":  0,      # zero tolerance for critical sample failures
}

def check_regression(
    current_path:  str = "eval/results_latest.json",
    baseline_path: str = "eval/results_baseline.json",
) -> bool:
    current  = json.loads(Path(current_path).read_text())["summary"]
    baseline = json.loads(Path(baseline_path).read_text())["summary"]

    passed = True
    print("\n=== Eval Regression Report ===")

    for metric, threshold in THRESHOLDS.items():
        curr_val = current.get(metric, 0)
        base_val = baseline.get(metric, 0)
        delta    = curr_val - base_val

        if metric == "critical_failures":
            ok = curr_val == 0
        else:
            ok = curr_val >= threshold

        status = "✓ PASS" if ok else "✗ FAIL"
        print(f"  {status}  {metric}: {curr_val:.3f}  (baseline {base_val:.3f}, delta {delta:+.3f})")

        if not ok:
            passed = False

    print(f"\nOverall: {'PASS' if passed else 'FAIL, block merge'}")
    return passed

if __name__ == "__main__":
    ok = check_regression()
    sys.exit(0 if ok else 1)

Wire this into your CI as a GitHub Actions job that runs on every PR targeting main. Store the baseline JSON in the repository and update it manually after intentional quality changes are validated by human review.


# .github/workflows/eval.yml
name: LLM Quality Gate

on:
  pull_request:
    branches: [main]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with: {python-version: "3.12"}
      - run: pip install anthropic ragas datasets sentence-transformers
      - run: python eval/run_full_eval.py
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
      - run: python eval/regression_gate.py

8. Choosing the Right Metric for Your Task

Task type	Primary metric	Secondary metric	Why
RAG question answering	RAGAS faithfulness + relevancy	Context precision + recall	Distinguishes retriever bugs from generation bugs
Open-ended Q&A (no retrieval)	LLM-as-judge correctness	Embedding similarity vs reference	LLM judge handles semantic variation; embedding provides fast pass/fail
Summarisation	Embedding similarity + faithfulness	LLM-as-judge completeness	Need both semantic closeness and grounding in source
Classification / extraction	Exact-match or F1 on extracted fields	LLM-as-judge for edge cases	Structured output allows exact comparison; LLM judge handles ambiguous cases
Conversational chat	LLM-as-judge pairwise (A/B test)	Safety classifier	Quality is preference-based; single-answer scoring is unreliable for chat
Code generation	Execution-based tests (unit tests)	LLM-as-judge readability	Code can be deterministically tested; execution is the ground truth

9. Key Takeaways

Unit tests cannot evaluate LLMs. Exact-match comparison misses semantically equivalent answers and provides no partial credit. You need metrics designed for probabilistic outputs.
LLM-as-judge is the most flexible evaluator. Use a forced tool call for structured output, a different model family than the one under test to reduce self-preference bias, and double-swap pairwise comparisons to cancel position bias.
RAGAS separates retrieval from generation. Faithfulness and Answer Relevancy measure the generator. Context Precision and Recall measure the retriever. Knowing which component failed tells you where to fix the problem.
Embedding similarity is your fast filter. A local model like all-MiniLM-L6-v2 runs at near-zero cost and screens out obvious failures in milliseconds. Route only low-similarity samples to the full LLM judge to cut judging costs by 50–80% on typical production traffic.
Eval datasets are code. Version them in your repository, review additions in PRs, and treat regressions as blocking CI failures with the same severity as broken tests.
Calibrate your judge against humans. Run 100 samples through both your LLM judge and human reviewers quarterly. Target a Pearson correlation above 0.80. A judge that drifts from human judgement silently corrupts all downstream metrics.