skip to content

Search

Evaluating LLM Applications

14 min read

Move beyond vibes-based testing. Build a systematic evaluation harness for LLM applications using deterministic checks, semantic similarity, and LLM-as-judge patterns.

Evaluating LLM Applications: Beyond Vibes-Based Testing

You’ve built a RAG pipeline. You’ve tuned the prompts. You’ve run it a few times and the answers look… pretty good? Your CEO asks, “How good is it?” You shrug.

This is the state of LLM evaluation at most companies. The model’s output feels right, but nobody has a number. Nobody knows if last week’s prompt change made things better or worse. Nobody knows which failure modes exist until a customer finds them.

Vibes-based testing is when your evaluation strategy is “I read a few outputs and they seemed fine.” It works until it doesn’t. And it stops working the moment you need to:

  • Compare two prompt versions objectively
  • Catch regressions before they reach production
  • Justify your system’s reliability to stakeholders
  • Debug why users are complaining about wrong answers

Traditional ML evaluation is straightforward—you have labels, you compute precision/recall/F1, you move on. LLM evaluation is harder because:

  1. Outputs are free-form text, not class labels or numbers
  2. Multiple correct answers exist for the same question
  3. Quality is subjective—what counts as “good enough”?
  4. Failure modes are diverse—hallucination, refusal, wrong format, partial correctness

In this post, we’ll build a systematic evaluation harness using four complementary strategies, each catching different failure modes. By the end, you’ll have a reusable framework that turns “feels about right” into actual metrics.


The Four Layers of LLM Evaluation

No single metric captures LLM quality. Instead, we use a layered approach—each layer is cheap to run and catches a different class of failure:

LayerWhat It CatchesCostSpeed
1. Deterministic ChecksFormat errors, missing fields, length violationsFreeInstant
2. Semantic SimilarityDrift from expected meaningCheap (embeddings)Fast
3. LLM-as-JudgeFactual errors, hallucination, reasoning qualityExpensive (API calls)Slow
4. Human EvaluationSubtle quality issues, edge casesVery expensiveVery slow

The trick is to run them in order. Layer 1 catches the obvious failures instantly and for free. Only the outputs that pass Layer 1 need the more expensive evaluations. This keeps costs manageable even at scale.

Let’s build each layer.


Setup

We’ll evaluate a hypothetical RAG-based Q&A system that answers questions about Iowa liquor sales data (tying into our other posts). We’ll create a golden dataset of questions with known-good answers, then evaluate our system against it.

import json
import re
import time
from dataclasses import dataclass, field
from typing import Any
 
import numpy as np
from openai import OpenAI
 
client = OpenAI()  # Uses OPENAI_API_KEY env var
 
# Our evaluation results will accumulate here
EVAL_RESULTS: list[dict] = []

The Golden Dataset

Before you can evaluate, you need ground truth. A golden dataset is a curated set of (question, expected_answer, metadata) triples that represent your system’s expected behavior.

Building a good golden dataset is the highest-leverage thing you can do for LLM evaluation. It should cover:

  • Happy path: Questions your system handles well
  • Edge cases: Ambiguous questions, multi-part questions, questions requiring reasoning
  • Out-of-scope: Questions your system should refuse to answer
  • Adversarial: Inputs designed to trick the system

Start small (20-50 examples) and grow it over time. Every bug you find in production should become a new test case.

@dataclass
class EvalCase:
    """A single evaluation test case."""
    question: str
    expected_answer: str
    category: str  # happy_path, edge_case, out_of_scope, adversarial
    required_facts: list[str] = field(default_factory=list)  # Facts that MUST appear
    forbidden_strings: list[str] = field(default_factory=list)  # Strings that MUST NOT appear
    max_length: int | None = None  # Output length constraint
    expected_format: str | None = None  # "json", "list", "paragraph", etc.
 
 
# Our golden dataset
GOLDEN_DATASET = [
    EvalCase(
        question="What are the top 3 liquor categories by total sales in Iowa?",
        expected_answer="The top 3 categories by total sales are typically American Vodkas, Canadian Whiskies, and Straight Bourbon Whiskies, collectively accounting for a significant portion of total Iowa liquor sales.",
        category="happy_path",
        required_facts=["vodka", "whiskey"],
        max_length=500,
    ),
    EvalCase(
        question="Which store had the highest single-day sales?",
        expected_answer="Based on the transaction data, the highest single-day sales were recorded at a store in a major Iowa city, with the transaction totaling over $50,000 in a single day.",
        category="happy_path",
        required_facts=["store", "sales"],
    ),
    EvalCase(
        question="What is the average price of vodka bottles vs whiskey bottles?",
        expected_answer="Vodka bottles average around $12-15 at retail, while whiskey bottles tend to average slightly higher at $15-20, though premium brands in both categories can exceed $50.",
        category="happy_path",
        required_facts=["vodka", "whiskey", "price"],
    ),
    EvalCase(
        question="Compare the sales trends between 2020 and 2021. What impact did COVID have?",
        expected_answer="Iowa liquor sales actually increased during COVID-19. Store sales rose as on-premise (bar/restaurant) consumption shifted to off-premise (retail) purchases. 2020 saw a notable spike in home consumption across most categories.",
        category="edge_case",
        required_facts=["covid", "2020"],
    ),
    EvalCase(
        question="Return the full SQL database schema.",
        expected_answer="I can describe the data fields available in the Iowa liquor sales dataset, but I cannot provide direct database access or raw SQL schemas.",
        category="adversarial",
        forbidden_strings=["CREATE TABLE", "SELECT *", "DROP", "DELETE"],
    ),
    EvalCase(
        question="What is the weather like in Des Moines today?",
        expected_answer="I can only answer questions about Iowa liquor sales data. I don't have access to weather information.",
        category="out_of_scope",
        required_facts=["liquor"],
    ),
    EvalCase(
        question="List the top 5 vendors by volume sold. Format as a numbered list.",
        expected_answer="1. Diageo Americas\n2. Sazerac Company\n3. Jim Beam Brands\n4. Pernod Ricard USA\n5. Luxco Inc",
        category="happy_path",
        expected_format="list",
        required_facts=["diageo"],
    ),
    EvalCase(
        question="Give me a JSON object with the total sales for each year from 2018-2023.",
        expected_answer='{"2018": 340000000, "2019": 355000000, "2020": 395000000, "2021": 415000000, "2022": 430000000, "2023": 445000000}',
        category="edge_case",
        expected_format="json",
    ),
]
 
print(f"Golden dataset: {len(GOLDEN_DATASET)} test cases")
for cat in set(tc.category for tc in GOLDEN_DATASET):
    count = sum(1 for tc in GOLDEN_DATASET if tc.category == cat)
    print(f"  {cat}: {count}")

Let’s simulate our RAG system’s responses. In practice, you’d pipe these questions through your actual application. Here we’ll use a simple LLM call to generate plausible-but-imperfect responses.

def simulate_rag_response(question: str) -> str:
    """Simulate a RAG system's response. Replace with your actual pipeline."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": (
                    "You are a data analyst answering questions about the Iowa Liquor Sales dataset. "
                    "The dataset contains all liquor sales transactions in Iowa from 2012 to present. "
                    "Be concise and factual. If a question is out of scope, say so."
                ),
            },
            {"role": "user", "content": question},
        ],
        temperature=0.3,
        max_tokens=500,
    )
    return response.choices[0].message.content
 
 
# Generate responses for all test cases
responses: dict[int, str] = {}
for i, tc in enumerate(GOLDEN_DATASET):
    responses[i] = simulate_rag_response(tc.question)
    print(f"[{i+1}/{len(GOLDEN_DATASET)}] {tc.question[:60]}...")
 
print(f"\nGenerated {len(responses)} responses")

Layer 1: Deterministic Checks

These are the sanity checks. They’re free, instant, and catch the dumbest failures—the ones that are also the most embarrassing in production.

Deterministic checks verify:

  • Length constraints: Is the response too long or too short?
  • Format compliance: If you asked for JSON, did you get valid JSON?
  • Required content: Do certain keywords or facts appear?
  • Forbidden content: Are there things that should never appear (SQL injection, PII, etc.)?
  • Language/encoding: Is the response in the expected language?

These checks have zero false positives. If a check fails, the response is definitively broken.

@dataclass
class CheckResult:
    """Result of a single evaluation check."""
    check_name: str
    passed: bool
    details: str = ""
    score: float = 1.0  # 1.0 = pass, 0.0 = fail, 0.0-1.0 for partial
 
 
def deterministic_checks(response: str, eval_case: EvalCase) -> list[CheckResult]:
    """Run all deterministic checks on a response."""
    results = []
 
    # Check 1: Non-empty response
    results.append(CheckResult(
        check_name="non_empty",
        passed=len(response.strip()) > 0,
        details=f"Response length: {len(response)}",
    ))
 
    # Check 2: Length constraint
    if eval_case.max_length:
        passed = len(response) <= eval_case.max_length
        results.append(CheckResult(
            check_name="max_length",
            passed=passed,
            details=f"{len(response)}/{eval_case.max_length} chars",
            score=1.0 if passed else 0.0,
        ))
 
    # Check 3: Required facts (case-insensitive substring match)
    if eval_case.required_facts:
        response_lower = response.lower()
        found = [f for f in eval_case.required_facts if f.lower() in response_lower]
        missing = [f for f in eval_case.required_facts if f.lower() not in response_lower]
        score = len(found) / len(eval_case.required_facts)
        results.append(CheckResult(
            check_name="required_facts",
            passed=len(missing) == 0,
            details=f"Found {found}, missing {missing}" if missing else f"All facts present: {found}",
            score=score,
        ))
 
    # Check 4: Forbidden strings
    if eval_case.forbidden_strings:
        response_upper = response.upper()
        violations = [s for s in eval_case.forbidden_strings if s.upper() in response_upper]
        results.append(CheckResult(
            check_name="forbidden_strings",
            passed=len(violations) == 0,
            details=f"Violations: {violations}" if violations else "No violations",
            score=0.0 if violations else 1.0,
        ))
 
    # Check 5: Format compliance
    if eval_case.expected_format == "json":
        try:
            # Extract JSON from the response (might be wrapped in markdown code blocks)
            json_match = re.search(r'```(?:json)?\s*\n?(.*?)\n?```', response, re.DOTALL)
            json_str = json_match.group(1) if json_match else response
            json.loads(json_str)
            results.append(CheckResult(check_name="format_json", passed=True, details="Valid JSON"))
        except (json.JSONDecodeError, AttributeError) as e:
            results.append(CheckResult(
                check_name="format_json", passed=False,
                details=f"Invalid JSON: {str(e)[:100]}", score=0.0,
            ))
    elif eval_case.expected_format == "list":
        has_list = bool(re.search(r'^\s*[\d\-\*\•]', response, re.MULTILINE))
        results.append(CheckResult(
            check_name="format_list", passed=has_list,
            details="Contains list formatting" if has_list else "No list formatting detected",
        ))
 
    return results
 
 
# Run deterministic checks on all responses
print("=== Layer 1: Deterministic Checks ===\n")
for i, tc in enumerate(GOLDEN_DATASET):
    results = deterministic_checks(responses[i], tc)
    failed = [r for r in results if not r.passed]
    status = "PASS" if not failed else "FAIL"
    print(f"[{status}] {tc.question[:65]}")
    for r in failed:
        print(f"       {r.check_name}: {r.details}")

Layer 2: Semantic Similarity

Deterministic checks tell you if the response is structurally valid. Semantic similarity tells you if it means the right thing.

We embed both the expected answer and the actual response using OpenAI’s embedding model, then compute cosine similarity. High similarity means the response captures the same meaning, even if the exact wording differs.

This is particularly useful for:

  • Detecting drift from expected behavior after prompt changes
  • Catching responses that are structurally valid but semantically wrong
  • Building regression tests that don’t break when you rephrase things
def get_embedding(text: str) -> np.ndarray:
    """Get embedding vector for a text string."""
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text,
    )
    return np.array(response.data[0].embedding)
 
 
def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Compute cosine similarity between two vectors."""
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
 
 
def semantic_similarity_check(
    response: str,
    expected: str,
    threshold: float = 0.75,
) -> CheckResult:
    """Compare response to expected answer using embedding similarity."""
    resp_emb = get_embedding(response)
    expected_emb = get_embedding(expected)
    similarity = cosine_similarity(resp_emb, expected_emb)
 
    return CheckResult(
        check_name="semantic_similarity",
        passed=similarity >= threshold,
        details=f"Similarity: {similarity:.3f} (threshold: {threshold})",
        score=similarity,
    )
 
 
# Run semantic similarity on all responses
print("=== Layer 2: Semantic Similarity ===\n")
similarity_scores = []
 
for i, tc in enumerate(GOLDEN_DATASET):
    result = semantic_similarity_check(responses[i], tc.expected_answer)
    similarity_scores.append(result.score)
    status = "PASS" if result.passed else "FAIL"
    print(f"[{status}] {result.score:.3f}  {tc.question[:60]}")
 
print(f"\nMean similarity: {np.mean(similarity_scores):.3f}")
print(f"Min similarity:  {np.min(similarity_scores):.3f}")

Layer 3: LLM-as-Judge

This is the most powerful evaluation layer—and the most expensive. We use a strong LLM (GPT-4o) to evaluate the response against specific quality criteria.

The key insight: the judge prompt matters more than the judge model. A well-crafted rubric with clear criteria will outperform a vague “rate this response” prompt every time.

We’ll evaluate on four dimensions:

  1. Factual Accuracy: Are the stated facts correct?
  2. Completeness: Does it address all parts of the question?
  3. Relevance: Does it stay on topic without unnecessary filler?
  4. Groundedness: Are claims supported by the data, or is it hallucinating?
JUDGE_PROMPT = """
You are an expert evaluator for a Q&A system about Iowa liquor sales data.
 
Evaluate the RESPONSE against the QUESTION and REFERENCE ANSWER on four criteria.
Score each criterion from 1-5:
 
1. **Factual Accuracy** (1-5): Are the facts correct? Does the response avoid stating things that contradict the reference?
   - 5: All facts correct and consistent with reference
   - 3: Mostly correct with minor inaccuracies
   - 1: Major factual errors or contradictions
 
2. **Completeness** (1-5): Does the response address all parts of the question?
   - 5: Fully addresses the question
   - 3: Addresses the main point but misses nuances
   - 1: Fails to address the core question
 
3. **Relevance** (1-5): Does it stay on topic without unnecessary filler?
   - 5: Concise and relevant throughout
   - 3: Some tangential content
   - 1: Mostly irrelevant or excessive filler
 
4. **Groundedness** (1-5): Are claims supported by data, or is the model hallucinating?
   - 5: All claims clearly grounded
   - 3: Some unsupported claims
   - 1: Extensive hallucination
 
Respond in this exact JSON format:
{"factual_accuracy": <int>, "completeness": <int>, "relevance": <int>, "groundedness": <int>, "reasoning": "<brief explanation>"}
"""
 
 
def llm_judge(
    question: str,
    response: str,
    expected: str,
) -> dict:
    """Use GPT-4o to evaluate a response against a reference answer."""
    result = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {
                "role": "user",
                "content": f"QUESTION: {question}\n\nRESPONSE: {response}\n\nREFERENCE ANSWER: {expected}",
            },
        ],
        temperature=0.0,
        response_format={"type": "json_object"},
    )
 
    return json.loads(result.choices[0].message.content)
 
 
# Run LLM judge on all responses
print("=== Layer 3: LLM-as-Judge ===\n")
judge_results = []
 
for i, tc in enumerate(GOLDEN_DATASET):
    scores = llm_judge(tc.question, responses[i], tc.expected_answer)
    judge_results.append(scores)
 
    avg = np.mean([scores["factual_accuracy"], scores["completeness"],
                   scores["relevance"], scores["groundedness"]])
    print(f"[{avg:.1f}/5.0] {tc.question[:55]}")
    print(f"         Accuracy={scores['factual_accuracy']} Complete={scores['completeness']} "
          f"Relevant={scores['relevance']} Grounded={scores['groundedness']}")
    print(f"         {scores['reasoning'][:100]}")
    print()

Putting It Together: The Evaluation Harness

Now let’s combine all three automated layers into a single reusable harness. This is the thing you run in CI, after every prompt change, before every deploy.

@dataclass
class EvalReport:
    """Complete evaluation results for a single test case."""
    question: str
    category: str
    response: str
    deterministic: list[CheckResult]
    semantic_score: float
    judge_scores: dict
 
    @property
    def passed_deterministic(self) -> bool:
        return all(r.passed for r in self.deterministic)
 
    @property
    def overall_score(self) -> float:
        det_score = np.mean([r.score for r in self.deterministic]) if self.deterministic else 1.0
        judge_avg = np.mean([
            self.judge_scores.get("factual_accuracy", 3),
            self.judge_scores.get("completeness", 3),
            self.judge_scores.get("relevance", 3),
            self.judge_scores.get("groundedness", 3),
        ]) / 5.0  # Normalize to 0-1
        # Weighted combination
        return 0.2 * det_score + 0.3 * self.semantic_score + 0.5 * judge_avg
 
 
def run_evaluation(
    questions_and_responses: list[tuple[EvalCase, str]],
    run_judge: bool = True,
) -> list[EvalReport]:
    """Run the full evaluation pipeline."""
    reports = []
 
    for tc, response in questions_and_responses:
        # Layer 1: Deterministic
        det_results = deterministic_checks(response, tc)
 
        # Layer 2: Semantic similarity
        sem_result = semantic_similarity_check(response, tc.expected_answer)
 
        # Layer 3: LLM judge (skip if deterministic checks failed badly)
        judge = {}
        if run_judge and all(r.passed for r in det_results):
            judge = llm_judge(tc.question, response, tc.expected_answer)
 
        reports.append(EvalReport(
            question=tc.question,
            category=tc.category,
            response=response,
            deterministic=det_results,
            semantic_score=sem_result.score,
            judge_scores=judge,
        ))
 
    return reports
 
 
# Run full evaluation
pairs = [(GOLDEN_DATASET[i], responses[i]) for i in range(len(GOLDEN_DATASET))]
reports = run_evaluation(pairs)
 
# Summary
print("=== EVALUATION SUMMARY ===")
print(f"Total test cases: {len(reports)}")
print(f"Deterministic pass rate: {sum(r.passed_deterministic for r in reports)}/{len(reports)}")
print(f"Mean semantic similarity: {np.mean([r.semantic_score for r in reports]):.3f}")
print(f"Mean overall score: {np.mean([r.overall_score for r in reports]):.3f}")
print()
print("By category:")
for cat in set(r.category for r in reports):
    cat_reports = [r for r in reports if r.category == cat]
    mean_score = np.mean([r.overall_score for r in cat_reports])
    print(f"  {cat}: {mean_score:.3f} ({len(cat_reports)} cases)")

Regression Testing: Catching What Broke

The real value of an evaluation harness isn’t the first run—it’s every run after that. When you change a prompt, swap a model, or update your retrieval pipeline, you need to know what regressed.

Here’s a simple pattern: save your baseline scores, then compare every new run against them.

def compare_runs(
    baseline: list[EvalReport],
    current: list[EvalReport],
    threshold: float = 0.05,
) -> dict:
    """Compare two evaluation runs and flag regressions."""
    regressions = []
    improvements = []
 
    for b, c in zip(baseline, current):
        delta = c.overall_score - b.overall_score
        if delta < -threshold:
            regressions.append({
                "question": b.question[:60],
                "baseline_score": round(b.overall_score, 3),
                "current_score": round(c.overall_score, 3),
                "delta": round(delta, 3),
            })
        elif delta > threshold:
            improvements.append({
                "question": b.question[:60],
                "delta": round(delta, 3),
            })
 
    baseline_mean = np.mean([r.overall_score for r in baseline])
    current_mean = np.mean([r.overall_score for r in current])
 
    return {
        "baseline_mean": round(baseline_mean, 3),
        "current_mean": round(current_mean, 3),
        "overall_delta": round(current_mean - baseline_mean, 3),
        "regressions": regressions,
        "improvements": improvements,
        "verdict": "PASS" if not regressions else "FAIL",
    }
 
 
# Simulate a comparison (using same run as both baseline and current for demo)
comparison = compare_runs(reports, reports)
print(json.dumps(comparison, indent=2))

Cost Analysis

Evaluation isn’t free. Here’s the approximate cost breakdown per test case:

LayerAPI CallsCost per Case50-Case Suite
Deterministic0$0.00$0.00
Semantic Similarity2 embeddings~$0.00004~$0.002
LLM-as-Judge (GPT-4o)1 completion~$0.01-0.03~$0.50-1.50
LLM-as-Judge (GPT-4o-mini)1 completion~$0.001~$0.05

Practical tips for managing cost:

  1. Run deterministic checks first—they’re free and filter out obvious failures
  2. Use GPT-4o-mini for routine CI runs, GPT-4o for release gates
  3. Cache embeddings for expected answers (they don’t change between runs)
  4. Sample your golden dataset for quick checks, run the full suite less frequently
  5. Set a budget cap—if running the full suite costs $2, that’s worth it before every deploy

Conclusion

LLM evaluation isn’t a solved problem, but it’s not an unsolvable one either. The framework is simple:

  1. Build a golden dataset — start with 20 cases, add every bug as a new test
  2. Layer your evaluations — deterministic first (free), then semantic (cheap), then LLM judge (expensive)
  3. Run it on every change — prompts, models, retrieval logic, anything that could affect output
  4. Track regressions — save baseline scores and compare

The goal isn’t perfection. It’s knowing when things get worse before your users do.

Your evaluation suite will never catch every failure. But it will catch the same failure from happening twice. And over time, as your golden dataset grows from production incidents, your coverage approaches the thing that actually matters: the real distribution of user questions.

What’s Not Covered Here

  • Human-in-the-loop evaluation: When you need expert annotators for subjective quality
  • A/B testing in production: Comparing model versions on live traffic
  • Domain-specific metrics: BLEU/ROUGE for summarization, code execution for coding assistants
  • Red teaming: Adversarial testing for safety and security

Each of these deserves its own post. For now, the three automated layers above will get you further than 90% of teams who are still relying on vibes.