Applied GenAI

An evaluation harness you can ship on

You can't improve — or safely ship — what you can't measure. This is how enterprises turn 'the demo looked good' into a regression-gated eval suite that tells you, before deploy, whether a change made the product better or worse.

PythonCIAny model

#enterprise#evals#reliability#patterns

flowchart LR
  DS[Golden set<br/>inputs + expectations] --> RUN[Run system<br/>on every case]
  RUN --> SC[Score each case<br/>assertion · model-graded · human]
  SC --> AGG[Aggregate<br/>pass rate by slice]
  AGG --> GATE{Above bar?}
  GATE -->|yes| SHIP[Merge / deploy]
  GATE -->|no| BLOCK[Block + show regressions]
  PROD[(Production traces)] -.harvest failures.-> DS
  classDef stop fill:#f5e7e3,stroke:#b1361e;
  class BLOCK stop;
An eval loop: a golden set runs through the system, each case is scored, results aggregate into pass rates, and a gate blocks or ships; production traces feed new cases back into the golden set.

TL;DR

The problem: shipping blind

A team builds an LLM feature. It demos well, so it ships. Then someone tweaks the prompt to fix one bug, and unknowingly breaks three other cases nobody re-checked. A model version updates and behavior subtly shifts. A new retrieval source adds noise. None of this is visible, because “quality” lives in a few people’s heads and gets spot-checked by hand, inconsistently, under deadline.

This is shipping blind, and it’s the default state of most AI products. The symptom is telling: the team is afraid to change the prompt because they can’t predict what it’ll do. That fear is the absence of evals. With software, you’d never refactor without tests; with AI, teams routinely deploy without the equivalent, then wonder why quality feels like whack-a-mole.

Offline and online: two loops, one system

Evaluation happens in two places, and you need both.

Offline evals run before deploy, in CI, against a curated golden set. They answer: “did this change make the system better or worse?” They’re your merge gate.

Online evals run after deploy, against real traffic (sampled). They answer: “is the system actually working in the wild, on inputs we never imagined?” They’re your monitoring.

flowchart TB
    subgraph Offline["Offline (before deploy)"]
      direction TB
      G[Golden set] --> E[Eval run in CI]
      E --> D{Gate}
    end
    subgraph Online["Online (after deploy)"]
      direction TB
      L[Live traffic] --> S[Sampled scoring]
      S --> A[Alerts + dashboards]
      A --> F[Feedback signals]
    end
    D -->|ship| L
    F -.new hard cases.-> G
Two loops: offline evals run a golden set through CI with a gate before deploy; online evals score sampled live traffic into alerts and feedback that flows back into the golden set.

The connection between them is the source of all durable improvement: online evals surface the hard cases you didn’t anticipate, and those cases flow back into the offline golden set, where they become permanent regression tests.

The golden set: your source of truth

Everything rests on a curated set of cases — inputs paired with what a good output looks like. Building it well is the highest-leverage work in the whole effort.

A few principles. Start small and real: 30–50 cases drawn from actual usage beat 1,000 synthetic ones. Cover the slices that matter: not just the happy path but the categories you care about — different user types, languages, edge cases, known failure modes. Make expectations checkable: an expected exact answer, a rubric, a “must contain / must not contain,” or a reference output to compare against. Version it: the golden set lives in the repo and evolves through pull requests like any other critical asset.

Three tiers of scoring

How do you score an output? There’s a hierarchy, and a good harness uses all three deliberately — most checks cheap and deterministic, fewer expensive and human.

flowchart TB
    H[Human eval<br/>few, expensive, gold standard] --- MG[Model-graded<br/>many, cheap, needs calibration]
    MG --- AS[Assertions<br/>most, instant, deterministic]
    classDef base fill:#f5e7e3,stroke:#b1361e;
    class AS base;
An eval pyramid: assertions form the cheap deterministic base, model-graded checks the scalable middle, and human evaluation the rare gold-standard top.

Assertions are deterministic, instant, and free: did the JSON parse? Is the required field present? Does it cite a real source? Is it under the length limit? Did it avoid a forbidden phrase? Push as much as possible down to this tier — anything you can check with code, check with code.

Model-graded evals use a model to judge open-ended quality against a rubric (“Is this summary faithful to the source? Score 1–5”). They scale to thousands of cases cheaply, but they must be calibrated against human judgment or you’re just measuring one model’s opinion of another. Spot-check the grader regularly.

Human evals are the gold standard for subjective quality and the calibration anchor for the other two. They’re slow and expensive, so reserve them for a small, rotating sample and for settling cases where the cheaper tiers disagree.

# Assertion tier — deterministic, runs in milliseconds
def check_grounded(output, case):
    assert output.citations, "no citation provided"
    assert all(c.source_id in case.allowed_sources for c in output.citations)

# Model-graded tier — a rubric the grader model applies
FAITHFULNESS = """Rate 1-5: is every claim in the SUMMARY supported by the SOURCE?
Return JSON: {"score": int, "unsupported_claims": [str]}"""

Running it as a gate in CI

The harness earns its keep when it’s wired into CI as a merge gate: a pull request that drops the pass rate below the bar (or regresses any protected slice) cannot merge. This is what turns evals from a report nobody reads into a force that actually protects quality.

The structure mirrors a test suite: load the golden set, run the system on every case, score each with the appropriate tier, aggregate by slice, and compare against the baseline.

results = [score(run_system(case), case) for case in golden_set]
report = aggregate(results, by="slice")

if report.pass_rate < BAR or report.regressed_any_slice(baseline):
    print(report.diff(baseline))      # show exactly what broke
    sys.exit(1)                        # block the merge

The flywheel: where quality compounds

Here’s the practice that separates teams that improve from teams that thrash. Every time the system fails in production, you don’t just patch it — you capture it.

flowchart LR
    P[Production failure] --> C[Capture trace]
    C --> M[Minimize to a case]
    M --> ADD[Add to golden set]
    ADD --> FIX[Fix + re-run evals]
    FIX --> REG[Now a regression test forever]
    REG --> P
The eval flywheel: a production failure is captured as a trace, minimized to a case, added to the golden set, fixed, re-run, and becomes a permanent regression test.

A production failure becomes a captured trace (this is exactly why the agent harness records everything). You minimize it to the smallest reproducing case, add it to the golden set with the correct expectation, then fix the system and re-run. From that moment, the failure is a permanent regression test — the same mistake physically cannot ship again. Do this consistently and your golden set becomes a precise map of your product’s real-world weaknesses, and your quality compounds while competitors play whack-a-mole.

Metrics that actually matter

Resist the urge to track everything. For most enterprise AI features, a small set carries the signal:

MetricWhat it tells youWatch for
Task success ratedid it accomplish the user’s goalthe headline number; gate on it by slice
Faithfulness / groundingdid it stay true to sourcesthe enterprise trust metric; hallucination shows here
Refusal / escalation ratehow often it punts to a humantoo high = useless; too low = overconfident
Latency p50/p95is it fast enough to usemulti-step reasoning blows p95
Cost per taskunit economicsthe number that decides if the feature is viable

Trade-offs and honest limits

Evals cost real effort — curating the golden set, calibrating graders, maintaining the harness — and they’re never complete; a passing suite means “no known regressions,” not “correct.” Model-graded evals can be gamed and drift, so they need ongoing calibration. And an over-fit golden set can give false confidence if it stops reflecting real traffic. The mitigation for all of these is the flywheel: keep feeding real failures in, keep spot-checking the graders, and treat the suite as a living system, not a one-time project.

The alternative — shipping on vibes — is cheaper today and ruinously expensive the first time a silent regression reaches a customer. For anything enterprise-grade, evals aren’t optional; they’re the price of being trusted.

Pitfalls

How to adopt this

References

This recipe is the measurement backbone for the AI-native product architecture — it’s what lets you promote a capability up the maturity ladder with evidence instead of optimism. It consumes the traces produced by the agent harness and pairs with spec-driven development, whose executable acceptance criteria are evals by another name. Building blocks live in the cookbook.

MM
Mohit Mittal
Writes Applied GenAI — practical recipes for building with generative AI. Code lives in the cookbook.