Hashorn

MLOps

Building Your First LLM Evaluation Harness: A Step-by-Step Guide

How to build an evaluation harness for LLM features from scratch. The eval set design, the scoring approach, how to wire it into CI, and what to watch in production.

By Hashorn TeamMay 31, 2026 6 min read

LLM features without an evaluation harness drift in quality silently. You change a prompt to fix one bug, you break another, no one notices for weeks until a customer complains. The evaluation harness is the regression test for the AI feature. This post walks through building one from scratch for a typical AI feature.

The harness shape

Four moving parts

Four pieces. Eval set defines what to test. Runner executes the feature. Scorer evaluates. Report shows the result.

Step 1: Define the eval set

The most important artefact of an AI product team.

For a "summarise this customer support email" feature, an eval set entry looks like:

{
  "id": "support-summary-001",
  "input": {
    "email_body": "Hi, my login isn't working since this morning. I tried resetting the password and got an error. Help!"
  },
  "expected": {
    "must_include": ["login", "password", "error"],
    "must_not_include": ["billing", "refund"],
    "tone": "concise",
    "max_length_words": 30
  }
}

Some checks are deterministic (must include certain keywords, length bounds). Others are softer (tone, helpfulness) and need LLM-as-judge or human review.

Aim for the eval set to cover:

  • Happy path examples. 60-70 percent.
  • Edge cases. Empty input, very long input, multilingual, unusual format. 20-25 percent.
  • Adversarial inputs. Prompt injection attempts, abusive content, off-topic. 10-15 percent.
  • Known-bad past failures. Whenever a bug ships, the next version of the eval set has the case that would have caught it.

Step 2: Build the runner

The runner takes each eval example, calls your AI feature, and captures the output.

A minimal Python runner:

import json
from your_app import summarise_email

with open("eval-set.json") as f:
    examples = json.load(f)

results = []
for example in examples:
    output = summarise_email(example["input"]["email_body"])
    results.append({
        "id": example["id"],
        "input": example["input"],
        "expected": example["expected"],
        "actual": output,
    })

with open("eval-results.json", "w") as f:
    json.dump(results, f, indent=2)

That's it. The runner is simple. Don't over-engineer it.

Step 3: Build the scorer

The scorer checks each result against the expected behaviour.

Three scoring layers:

Layer 1: Deterministic checks

Cheap, fast, exact. Run on every eval example.

def deterministic_score(actual, expected):
    score = {}
    for keyword in expected.get("must_include", []):
        score[f"includes_{keyword}"] = keyword.lower() in actual.lower()
    for keyword in expected.get("must_not_include", []):
        score[f"excludes_{keyword}"] = keyword.lower() not in actual.lower()
    if "max_length_words" in expected:
        score["within_length"] = len(actual.split()) <= expected["max_length_words"]
    return score

Layer 2: LLM-as-judge

For softer criteria. Tone, helpfulness, accuracy.

JUDGE_PROMPT = """
You are evaluating a customer-support email summary.

Original email:
{input}

Generated summary:
{actual}

Expected criteria:
- Tone should be: {tone}
- Summary should be helpful and concise.

Score on a 1-5 scale:
- 1: bad
- 5: excellent

Return JSON: {{"score": 1-5, "rationale": "..."}}
"""

def llm_judge_score(actual, expected, input_data):
    response = call_judge_model(
        prompt=JUDGE_PROMPT.format(
            input=input_data["email_body"],
            actual=actual,
            tone=expected.get("tone", "neutral"),
        )
    )
    return json.loads(response)

Layer 3: Human review

For the most important examples, or a random sample. A senior reviewer reads the input and output, scores 1-5.

Most teams don't need to score every eval example with a human. 10 to 20 percent sampling is enough to keep the LLM judge honest.

Step 4: Wire into CI

The harness runs on every PR that touches AI code, prompts, or models.

The CI workflow:

- name: Run LLM evals
  run: python eval/run.py
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Check eval thresholds
  run: python eval/check_thresholds.py
  # exits non-zero if score regresses past threshold

The threshold check is the gate. Typical thresholds:

  • Deterministic: 95+ percent pass rate.
  • LLM judge average: 4.0 or higher on a 1-5 scale.
  • No regression: average score must not drop by more than 0.2 from the main branch baseline.

Post the results as a PR comment so reviewers can see the change.

Step 5: Track in production

The eval harness runs offline. Production sampling watches the live feature.

Sample 1-5 percent of production calls. Run the same scorer over them. Track the score in your LLM observability dashboard. Compare offline eval score vs production sample score over time.

If they diverge, your eval set is missing real-world cases. Add them.

Cost considerations

Running 50 eval examples through GPT-4 or Claude on every PR costs roughly $0.50 to $5 depending on prompt size. Most teams budget $50 to $200 per month for eval infrastructure. Compared to the cost of a customer-impacting AI feature regression, this is rounding error.

If cost is a concern, run the cheap deterministic checks on every PR and run the LLM-judge layer only on merge to main.

Tools that help

  • Promptfoo: lightweight, CLI-first, runs eval sets and produces reports.
  • LangSmith Evaluations: integrated with LangSmith observability.
  • Braintrust: more polished UI, eval set management built in.
  • Custom Python with pytest: for teams that want to keep the harness in-repo with the test code.

We've used all four. Pick based on your team's preference for managed vs in-repo.

Pre-launch eval harness checklist

    Common mistakes

    • No eval set at all. Single biggest reason AI features regress silently.
    • Eval set drawn from training data. Always uses fresh real-world examples.
    • LLM-as-judge with a noisy prompt. Test the judge separately to confirm it scores consistently.
    • Treating eval scores as absolute. Trend matters more than absolute number.
    • Ignoring eval failures because "the human looking at it thinks it's fine." The eval is calibrated. If it fails, investigate seriously before overriding.

    How Hashorn builds eval harnesses

    Hashorn's MLOps engagement always starts with the eval harness. We build the eval set, the runner, the scorer, and the CI integration. We pair this with AI software development so the team's prompts and AI features go through the harness as part of normal development. For teams shipping AI in regulated environments, we pair MLOps with security engineering for prompt injection defence and output safety.

    Conclusion

    The LLM evaluation harness in 2026 is the regression test for AI features. Without it, AI products degrade silently. With it, AI products stay reliable as prompts and models evolve. Build the small version first. 30 examples. Deterministic + LLM-judge scoring. CI integration. Then grow the eval set as you discover edge cases. Two weeks of investment for a year of reliability.

    Frequently asked questions

    Need help building AI-powered software, QA automation, or secure cloud systems?

    Talk to Hashorn's engineering team. Dedicated senior engineers, QA, and security with same-week ramp.

    Have an engineering challenge you'd like a hand with?

    Tell us what you're building, we'll tell you how we'd ship it.

    Book an intro call →