How many evaluation examples do we need to start?

30 to 50 is enough for the first version. 200 to 500 is mature. Beyond 500 you're usually splitting into multiple eval sets (one per feature). Starting smaller is better than not starting; the eval set grows as you discover edge cases in production.

Should we use LLM-as-judge or human review for scoring?

Both. LLM-as-judge for the live dashboard and CI gates. Human review (30 to 100 samples per week by a senior reviewer) as the ground truth. They check each other. If LLM-as-judge says quality is fine but humans disagree, your judge prompt needs tuning.

Where do we store the eval set?

In your repo, as JSON or YAML files, version-controlled with your code. Treat the eval set like test fixtures. Some teams use a separate database for it; this is fine but introduces operational overhead the small benefit doesn't justify until you have thousands of examples.

Building Your First LLM Evaluation Harness: A Step-by-Step Guide

LLM features without an evaluation harness drift in quality silently. You change a prompt to fix one bug, you break another, no one notices for weeks until a customer complains. The evaluation harness is the regression test for the AI feature. This post walks through building one from scratch for a typical AI feature.

The harness shape

Four moving parts

Four pieces. Eval set defines what to test. Runner executes the feature. Scorer evaluates. Report shows the result.

Step 1: Define the eval set

The most important artefact of an AI product team.

For a "summarise this customer support email" feature, an eval set entry looks like:

{
  "id": "support-summary-001",
  "input": {
    "email_body": "Hi, my login isn't working since this morning. I tried resetting the password and got an error. Help!"
  },
  "expected": {
    "must_include": ["login", "password", "error"],
    "must_not_include": ["billing", "refund"],
    "tone": "concise",
    "max_length_words": 30
  }
}

Some checks are deterministic (must include certain keywords, length bounds). Others are softer (tone, helpfulness) and need LLM-as-judge or human review.

Aim for the eval set to cover:

Happy path examples. 60-70 percent.
Edge cases. Empty input, very long input, multilingual, unusual format. 20-25 percent.
Adversarial inputs. Prompt injection attempts, abusive content, off-topic. 10-15 percent.
Known-bad past failures. Whenever a bug ships, the next version of the eval set has the case that would have caught it.

Step 2: Build the runner

The runner takes each eval example, calls your AI feature, and captures the output.

A minimal Python runner:

import json
from your_app import summarise_email

with open("eval-set.json") as f:
    examples = json.load(f)

results = []
for example in examples:
    output = summarise_email(example["input"]["email_body"])
    results.append({
        "id": example["id"],
        "input": example["input"],
        "expected": example["expected"],
        "actual": output,
    })

with open("eval-results.json", "w") as f:
    json.dump(results, f, indent=2)

That's it. The runner is simple. Don't over-engineer it.

Step 3: Build the scorer

The scorer checks each result against the expected behaviour.

Three scoring layers:

Layer 1: Deterministic checks

Cheap, fast, exact. Run on every eval example.

def deterministic_score(actual, expected):
    score = {}
    for keyword in expected.get("must_include", []):
        score[f"includes_{keyword}"] = keyword.lower() in actual.lower()
    for keyword in expected.get("must_not_include", []):
        score[f"excludes_{keyword}"] = keyword.lower() not in actual.lower()
    if "max_length_words" in expected:
        score["within_length"] = len(actual.split()) <= expected["max_length_words"]
    return score

Layer 2: LLM-as-judge

For softer criteria. Tone, helpfulness, accuracy.

JUDGE_PROMPT = """
You are evaluating a customer-support email summary.

Original email:
{input}

Generated summary:
{actual}

Expected criteria:
- Tone should be: {tone}
- Summary should be helpful and concise.

Score on a 1-5 scale:
- 1: bad
- 5: excellent

Return JSON: {{"score": 1-5, "rationale": "..."}}
"""

def llm_judge_score(actual, expected, input_data):
    response = call_judge_model(
        prompt=JUDGE_PROMPT.format(
            input=input_data["email_body"],
            actual=actual,
            tone=expected.get("tone", "neutral"),
        )
    )
    return json.loads(response)

Layer 3: Human review

For the most important examples, or a random sample. A senior reviewer reads the input and output, scores 1-5.

Most teams don't need to score every eval example with a human. 10 to 20 percent sampling is enough to keep the LLM judge honest.

Step 4: Wire into CI

The harness runs on every PR that touches AI code, prompts, or models.

The CI workflow:

- name: Run LLM evals
  run: python eval/run.py
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

- name: Check eval thresholds
  run: python eval/check_thresholds.py
  # exits non-zero if score regresses past threshold

The threshold check is the gate. Typical thresholds:

Deterministic: 95+ percent pass rate.
LLM judge average: 4.0 or higher on a 1-5 scale.
No regression: average score must not drop by more than 0.2 from the main branch baseline.

Post the results as a PR comment so reviewers can see the change.

Step 5: Track in production

The eval harness runs offline. Production sampling watches the live feature.

Sample 1-5 percent of production calls. Run the same scorer over them. Track the score in your LLM observability dashboard. Compare offline eval score vs production sample score over time.

If they diverge, your eval set is missing real-world cases. Add them.

Cost considerations

Running 50 eval examples through GPT-4 or Claude on every PR costs roughly $0.50 to $5 depending on prompt size. Most teams budget $50 to $200 per month for eval infrastructure. Compared to the cost of a customer-impacting AI feature regression, this is rounding error.

If cost is a concern, run the cheap deterministic checks on every PR and run the LLM-judge layer only on merge to main.

Tools that help

Promptfoo: lightweight, CLI-first, runs eval sets and produces reports.
LangSmith Evaluations: integrated with LangSmith observability.
Braintrust: more polished UI, eval set management built in.
Custom Python with pytest: for teams that want to keep the harness in-repo with the test code.

We've used all four. Pick based on your team's preference for managed vs in-repo.

Pre-launch eval harness checklist

Common mistakes

No eval set at all. Single biggest reason AI features regress silently.
Eval set drawn from training data. Always uses fresh real-world examples.
LLM-as-judge with a noisy prompt. Test the judge separately to confirm it scores consistently.
Treating eval scores as absolute. Trend matters more than absolute number.
Ignoring eval failures because "the human looking at it thinks it's fine." The eval is calibrated. If it fails, investigate seriously before overriding.

How Hashorn builds eval harnesses

Hashorn's MLOps engagement always starts with the eval harness. We build the eval set, the runner, the scorer, and the CI integration. We pair this with AI software development so the team's prompts and AI features go through the harness as part of normal development. For teams shipping AI in regulated environments, we pair MLOps with security engineering for prompt injection defence and output safety.

Conclusion

The LLM evaluation harness in 2026 is the regression test for AI features. Without it, AI products degrade silently. With it, AI products stay reliable as prompts and models evolve. Build the small version first. 30 examples. Deterministic + LLM-judge scoring. CI integration. Then grow the eval set as you discover edge cases. Two weeks of investment for a year of reliability.

Frequently asked questions

TagsMLOps LLMOps Evaluation LLM Testing

Need help building AI-powered software, QA automation, or secure cloud systems?

Talk to Hashorn's engineering team. Dedicated senior engineers, QA, and security with same-week ramp.

Book a Demo MLOps