The Eval Loop: Stop Guessing, Start Measuring

New models drop every few weeks. Each one might improve your outputs — or silently degrade them. Prompt best practices shift with every release. They’re not universal truths — they’re model-specific techniques that expire with every release.

You won’t know which of your prompts broke unless you measure. And if you’re manually spot-checking outputs every time a model updates, you’re going to miss things.

I’m building an iOS app with my co-founder Suraya — a startup using multimodal LLMs for visual and creative tasks. The models we build on have already changed multiple times under us. I love testing. I’ve always preferred measuring over hoping for the best. So I stopped optimizing prompts and started building eval loops.

Prompt techniques are model-coupled. Eval methodology is model-agnostic. The chain-of-thought trick that works on one model might be unnecessary — or harmful — on the next. But the eval loop that told me whether chain-of-thought helped? That works on every model, forever.

I still invest in prompt quality — but I invest in eval infrastructure first, because that’s what compounds across model generations.

In this article:

The Bakeoff — the minimum viable eval
Two Axes — contract compliance vs. semantic quality
The Evaluator-Generator Split — why the judge should be a different model
Three Eval Patterns — search, content generation, and image editing
The Exploration Benefit — evals as discovery tools
Patterns That Survive Production — practical tips

The Bakeoff

A new model drops. I need to know in 20 minutes whether my prompts still work. So I freeze inputs and compare — what else would you do?

I call this a bakeoff:

Freeze a set of inputs (your “golden set”)
Run those inputs through two variants (old prompt vs. new prompt, old model vs. new model — or both at once)
Compare outputs systematically using metrics, human labels, or an LLM judge

That’s it. Everything else builds on this foundation.

The golden set

Start with 10 carefully curated queries, not 100 random ones. Curation beats volume. Each query represents a different user intent — speed, dietary restrictions, ingredient-driven, social occasion, cuisine-specific:

{"query_id": "speed",       "query": "quick weeknight chicken",         "category": "speed",       "mode": "search", "judgments": {...}}
{"query_id": "dietary",     "query": "vegan meal prep",                 "category": "dietary",     "mode": "search", "judgments": {...}}
{"query_id": "restriction", "query": "gluten-free birthday cake",       "category": "restriction", "mode": "search", "judgments": {...}}
{"query_id": "ingredient",  "query": "uses up leftover rice",           "category": "ingredient",  "mode": "search", "judgments": {...}}
{"query_id": "social",      "query": "impressive dinner party starter", "category": "social",      "mode": "search", "judgments": {...}}
{"query_id": "audience",    "query": "kid-friendly lunch no nuts",      "category": "audience",    "mode": "search", "judgments": {...}}
{"query_id": "cuisine",     "query": "authentic pad thai",              "category": "cuisine",     "mode": "search", "judgments": {...}}
{"query_id": "macro",       "query": "high protein low carb snack",     "category": "macro",       "mode": "search", "judgments": {...}}
{"query_id": "exploratory", "query": "something with what I have",      "category": "exploratory", "mode": "suggest", "judgments": {...}}
{"query_id": "method",      "query": "30 minute one-pot pasta",         "category": "method",      "mode": "search", "judgments": {...}}

These 10 queries cover the intent space better than 1000 random ones would. Each one tests a different axis of search quality. When one degrades, you know exactly which capability broke. The format is JSONL — one case per line, easy to version, easy to extend.

Scoring against human judgments

For search, the ground truth is pre-annotated human relevance labels — not an LLM judge. Each golden query comes with known-relevant results, scored by hand. The eval runs the search, computes ranking metrics against those labels, and compares to the baseline. No LLM in the loop. Purely algorithmic.

A quick aside on language choice: my app is Swift, but my eval harnesses are Python and TypeScript. This is deliberate. Eval code isn’t production code — it’s throwaway tooling that calls APIs, parses JSON, and writes reports. Scripting languages are faster to iterate on, have better LLM SDK support, and LLMs write better Python/TypeScript than Swift for this kind of glue code. I treat the eval harness as a separate tool, not part of the app.

# For each golden query: run the search, score the results against
# human-labeled relevance, and compare to the previous baseline.
def run_search_bakeoff(
    golden_set: list[GoldenCase],
    candidate_config: SearchConfig,
    baseline_scores: dict[str, float],
) -> list[SearchBakeoffResult]:
    results: list[SearchBakeoffResult] = []
    for case in golden_set:
        # Run the search with the candidate configuration
        hits = search(case.query, config=candidate_config)

        # Score against human relevance labels:
        # NDCG = how well are the top 10 results ordered?
        # Precision@3 = of the top 3, how many are relevant?
        ndcg = compute_ndcg(hits, case.relevance_labels, k=10)
        p3 = compute_precision(hits, case.relevance_labels, k=3)

        result = SearchBakeoffResult(
            query=case.query,
            ndcg_at_10=ndcg,
            precision_at_3=p3,
            baseline_ndcg=baseline_scores[case.id],
            delta=ndcg - baseline_scores[case.id],  # positive = improvement
        )
        results.append(result)
        write_jsonl("search_bakeoff.jsonl", result)  # save incrementally
    return results

The simplest pattern: frozen inputs, deterministic metrics, human-labeled ground truth. No subjectivity, no LLM evaluator. When you need to judge subjective output quality — like whether a recipe adaptation doesn’t taste like garbage — that’s where LLM-as-judge comes in.

One practical detail: use template tokens to make variant creation systematic. Instead of copy-pasting prompts with one line changed, parameterize the variable parts:

# Prompt template with replaceable tokens (double-braces).
# The harness substitutes values at runtime, so you can test
# different reasoning strategies without rewriting the prompt.
ADAPTATION_TEMPLATE = """
You are a recipe adaptation assistant.
Dietary preference: {{dietary_preference}}
Taste profile: {{taste_profile}}
Original recipe: {{original_recipe}}
{{reasoning_style}}
Produce {{count}} adapted suggestions in the specified output format.
"""

# Each variant fills in the same template differently.
# The bakeoff runs all of them against the same golden set,
# so you know which reasoning approach actually helps.
variants = [
    {"reasoning_style": ""},                    # no reasoning
    {"reasoning_style": HIDDEN_COT_BLOCK},      # hidden chain-of-thought
    {"reasoning_style": CLASSIFY_FIRST_BLOCK},   # classify-then-adapt
]

This prevents the “I changed three things and don’t know which one helped” problem. When a new model drops, you test it against the same prompt variants with a config change instead of a rewrite.

Model comparison is a one-liner — add another setup with a different model and the harness runs everything in one pass:

{
  "setups": [
    { "id": "current",   "model": "gemini-2.5-flash",     "prompt": "templates/baseline-v3.txt" },
    { "id": "candidate", "model": "gemini-3.1-flash-lite", "prompt": "templates/baseline-v3.txt" }
  ]
}

Same prompt, different model. The harness handles the rest.

Dual-format output

Every bakeoff produces two artifacts:

Markdown for humans — a summary you paste into a PR description. Readable at a glance.
Structured JSON for automation — machine-readable scores that feed CI gates, trend tracking, or dashboards.

This costs almost nothing to implement and pays for itself immediately. The markdown tells you what happened. The JSON tells your pipeline whether to proceed.

Query	Baseline NDCG	Candidate NDCG	Delta	Verdict
”quick weeknight chicken”	0.72	0.81	+0.09	Improved
”vegan meal prep”	0.68	0.75	+0.07	Improved
”gluten-free birthday cake”	0.80	0.79	-0.01	Neutral
”uses up leftover rice”	0.51	0.78	+0.27	Improved
”impressive dinner party starter”	0.75	0.82	+0.07	Improved
”kid-friendly lunch no nuts”	0.69	0.70	+0.01	Neutral
”authentic pad thai”	0.48	0.76	+0.28	Improved
”high protein low carb snack”	0.71	0.73	+0.02	Neutral
”something with what I have”	0.60	0.74	+0.14	Improved
”30 minute one-pot pasta”	0.78	0.80	+0.02	Neutral

Average delta: +0.096. No regressions. Ship it.

A skimmer understands that table in 5 seconds. That’s the point.

Start here. A golden set plus one comparison run gives you more signal than months of vibes-based prompt tweaking.

Two Axes: Does It Follow the Rules? Is It Actually Good?

Early on I made a mistake that cost me a week: I treated eval as a single dimension. “Is this output good?” But “good” means two very different things.

Contract compliance

Did the output match the expected shape? This is structural, mechanical:

Did the JSON parse?
Are all required fields present and the correct types?
Did it return the expected number of suggestions?
Are titles within the character limit?
Do ingredient-search terms stay short and relevant?

Contract compliance is binary and automatable. You don’t need an LLM to check whether JSON parses:

// Machine-checked contract validation -- no LLM needed.
// If any of these fail, there's no point asking an LLM to judge quality.
if (suggestions.length !== expectedCount) {
    violations.push(`Expected ${expectedCount} suggestions but got ${suggestions.length}.`);
}
duplicateTitles.forEach((title) => {
    violations.push(`Duplicate title: ${title}`);
});
if (titleWordCount > titleWordLimit) {
    violations.push(`Title exceeds ${titleWordLimit} words.`);
}
if (ingredientQuery.length > maxCharacters) {
    violations.push(`Ingredient query exceeds ${maxCharacters} characters.`);
}

Semantic quality

Is the content actually good? This is subjective, domain-specific:

Does this adaptation actually make sense in a real kitchen?
Does the substitution actually work in the recipe’s chemistry?
Is the result something a normal person could follow?
Would the adapted dish still taste like a recognizable version of the original?

Semantic quality is graded and requires judgment. This is where you need the evaluator model.

Why the distinction matters

I changed a generation prompt and ran it through the golden set. The outputs were perfectly structured — correct count, all fields present, no duplicates, titles within limits. 100% contract compliance. I almost called it good.

Then I looked at the semantic scores. The suggestions were completely ungrounded — recommending substitutions for ingredients that weren’t in the original recipe, pairing flavors that contradict each other. Valid structure. Nonsensical content.

The contract layer said “ship it.” The evaluator said “absolutely not.”

The reverse happens too: a brilliant adaptation buried in malformed output is a semantic pass and a contract fail. Different failure modes, different fixes.

If you only check one axis, you’re flying half-blind.

The two-layer approach

I run contract checks first — fast, cheap, deterministic. Only outputs that pass the contract layer reach the LLM evaluator.

Here’s the trick: the evaluator receives the machine analysis as context. If the contract validator flagged borderline issues (say, a substitution that technically parses but references an ingredient not in the original recipe), the evaluator sees that annotation. The machine layer is the floor; the LLM layer is the ceiling.

This separation clarifies something important:

Your eval rubric is a product spec in disguise. If you can’t score it, you haven’t defined what “good” means.

Writing rubric dimensions forces you to articulate quality criteria that would otherwise live as unspoken assumptions.

The Evaluator-Generator Split

One structural decision changes everything: use a different model to judge than the one that generates.

The generation model runs in production, thousands of times, for every user. It needs to be fast and cheap. The evaluator model runs offline, tens of times, during development. You can afford a heavy one.

That mismatch is the whole trick. A lightweight model that generates recipe adaptations in 2 seconds can be judged by a heavyweight model that takes 30 seconds to evaluate. The cost works out: 10 golden cases, 3 setups, 3 runs each for averaging — that’s about 90 evaluator calls. In practice, most of my eval runs over the last few months cost under $5. Compare that to your hourly rate as an engineer and the math is obvious — this is practically free.

A model tends to be lenient about its own failure patterns. If the generator always suggests the same safe substitutions, the same model as evaluator may not notice — it shares the same biases. A different, stronger model catches those.

And when you switch generators frequently, you need a stable judge. Otherwise you’re comparing outputs scored by different evaluators. My evaluator stays fixed across generator experiments so that when scores shift, I know it’s the generator that changed, not the yardstick.

Three Eval Patterns

The bakeoff applies everywhere, but what you measure and how changes by surface. Start simple. Add complexity only when the simpler approach can’t capture what matters.

Search quality: “Did we find the right recipes?”

The search eval is the most deterministic of the three. Golden queries go in with pre-annotated relevance labels, ranked results come out, standard metrics do the rest. I track two numbers:

NDCG@10 (Normalized Discounted Cumulative Gain): measures how well the top 10 results are ordered. 1.0 means perfect ranking. 0.5 means relevant results exist but they’re buried or misordered.
Precision@3: of the top 3 results, what fraction are actually relevant? Users rarely scroll — the first few results are the product experience.

No LLM judge needed. I stratify scores by query category — cuisine, dietary restriction, method — so a regression in one category doesn’t hide behind improvements in another.

This eval started as a safety net for model migration. I switched embedding models and needed to know if ranking quality held. The golden set told me in minutes. But the interesting part isn’t the migration — it’s what the eval discovered.

I ran a bakeoff comparing pure semantic search against keyword search. The assumption was that semantic search would win everywhere — it understands intent, handles synonyms, all the things you read about.

It didn’t.

Query type	Semantic	Keyword	Hybrid
Vague, exploratory (“something with what I have”)	0.89	0.52	0.87
Specific, concrete (“uses up leftover rice”)	0.61	0.93	0.91

Semantic search dominated on vague, exploratory queries. But for specific ingredient-driven queries, keyword search was more precise. The semantic model returned thematically related recipes that didn’t actually contain the requested ingredient.

The hybrid approach meant users found the right recipe in the top 3 results instead of scrolling past 10. I would never have discovered this without the eval.

The eval didn’t just tell me my change was bad. It revealed a better approach I hadn’t considered.

Content generation: “Is this adaptation actually useful?”

Recipe adaptation is the most complex LLM surface. Users ask “make this vegan” or “adapt for a nut allergy,” and the system generates a modified recipe with substitutions, adjusted proportions, and updated instructions. Each adaptation comes with a title, a detailed prompt, and ingredient-search terms for downstream sourcing.

This is where the evaluator-generator split pays off. The generation model runs per-user and needs to be fast. The evaluator runs offline and can decompose quality into roughly eight weighted dimensions:

Dimension	Weight	What it measures	Example failure
Taste accuracy	20%	Do substitutions preserve the dish’s flavor profile?	Swapping soy sauce for coconut aminos where fermented depth is critical
Technique correctness	15%	Are cooking methods adjusted for new ingredients?	Same sear time after replacing chicken with tofu
Completeness	15%	Did the adaptation address all constraints?	Removing eggs but not adjusting leavening in a cake
Ingredient grounding	10%	Are all referenced ingredients in the source recipe?	Suggesting a swap for an ingredient the recipe doesn’t contain
Practicality	10%	Can a normal person follow this?	High-acyl gellan gum as an egg replacement
Sourcing	10%	Can you find these at a regular grocery store?	Three specialty-import ingredients for a “quick weeknight” dish
Safety	10%	No allergen contamination, no dangerous shortcuts	”Nut-free” adaptation using almond extract
Presentation clarity	10%	Are titles and descriptions easy to scan?	A title that reads like a sentence instead of a label

The principle is universal: decompose “good” into specific, scorable sub-questions with explicit weights. Instead of asking “is this good?” you ask eight specific questions and combine the answers. Anyone can look at that table and immediately understand what the product values — and argue about the weights productively.

Asking an LLM “is this good?” and getting back “7/10” tells you nothing. You can’t debug a number without a reason.

So one crucial detail: require the evaluator to cite specific observations, not hand-wave.

# What the evaluator LLM sees for one rubric dimension.
# Forces structured output: a numeric score, evidence, and
# a failure classification. Makes scores debuggable --
# when you disagree, you can see exactly where the judge went wrong.
TASTE_ACCURACY_RUBRIC = """
Score this recipe adaptation on Taste Accuracy (1-5):
- Does each substitution preserve the flavor role of the original ingredient?
- Are ratios adjusted for the substitute's intensity?
- Would this adaptation produce a dish that tastes like a recognizable version of the original?
Cite the specific substitution(s) that informed your score.
Respond with: score, one-sentence justification, failure_mode (if < 3).
"""

Instead of accepting “4/5 — good adaptation,” this forces the evaluator to say why: “Taste accuracy 4/5 — correctly identified that eggs serve as a binder in this brownie recipe and chose flax eggs, which provide equivalent binding. Deducted one point because the ratio is slightly low for a dense brownie batter.”

A practical note on scale: use 1-5, not 1-10. LLMs can reliably distinguish between a 2 and a 4, but the difference between a 6 and a 7 on a 10-point scale is too subtle for them to score consistently. A coarser scale with clear anchors gives you more stable, more meaningful scores.

This makes scores reproducible and debuggable. When you disagree with a score, you see exactly where the evaluator’s reasoning went wrong.

LLM judges aren’t perfect — they bias toward verbose outputs and can be inconsistent. I’ve had an evaluator give a high taste-accuracy score to an adaptation that completely ignored the original dish’s flavor profile. The weighted rubric with citation requirements catches this: when the evaluator has to justify its score with specific evidence, false positives become obvious on review. If the justification doesn’t match the score, you know to discard it.

One thing I always do: run evals multiple times and average the scores. You can pin temperature to 0, use structured output, do everything right — and the same evaluator will still give slightly different scores across runs. That’s just how LLMs work. Running the eval 3-5 times and averaging gives you a stable signal. If the scores vary wildly across runs, that’s telling you the rubric needs tightening, not that you should pick the most optimistic number.

The evaluator also classifies how things fail, not just that they failed. A substitution that’s unsafe (allergen contamination) is a different failure mode than one that’s impractical (requires specialty-import ingredients). Categorizing failure modes tells you what to fix in the prompt.

One thing that surprised me: this rubric approach works for creative output too. You’d think creative tasks are too subjective to score, but “creative” doesn’t mean “unmeasurable.” You define softer criteria — instead of “did the JSON parse?” it’s “does this feel like a natural extension of the original?” The acceptable score range widens, but the structure stays the same. You’re not looking for a single right answer — you’re looking for whether the output stays within a quality band.

Hidden reasoning became my biggest lever. I already knew chain-of-thought could help — nothing new there. The surprise was that hidden chain-of-thought — reasoning the user never sees — produced the same quality lift while keeping the output clean.

Before generating the adaptation, the model first analyzes the recipe in a reasoning block — what role does each ingredient play? Which properties must the substitute match? That block gets stripped from the final output. Users see the same concise adapted recipe either way.

The hidden-reasoning variant scored meaningfully higher across the board, with the biggest gains on complex multi-substitution adaptations. The model made fewer nonsensical substitutions because it was forced to reason about why an ingredient exists before replacing it.

That’s a technique I discovered through eval. And here’s the thing about techniques: when the next model drops, I’ll know in minutes whether hidden reasoning still helps, or whether the new model reasons well enough on its own. The technique might become obsolete. The eval that measures it won’t.

Image editing: “Does this visual edit look right?”

The third pattern adds a dimension the first two don’t have: the LLM needs to understand spatial context in an image. The model sees a photo, the user says “remove that object” or “swap this element,” and the model produces an edited image. The eval judges whether the edit was faithful, whether untouched regions stayed intact, and whether the result has visual artifacts.

The eval dimensions are the same regardless of what you’re editing — a dish photo, a room scene, a garden snapshot:

Removal/edit faithfulness: Did the edit actually do what was requested?
Preservation: Did parts that shouldn’t change stay the same?
Artifact quality: Ghost silhouettes, warped geometry, duplicate objects?

The evaluator is a multimodal model that sees both the original and the edited result alongside the prompt. It scores each dimension independently, cites specific evidence, and flags failure modes.

This is what sold me on the whole approach: a new image generation model launched, and I assumed it would be better. Newer version. Of course it’s better. Right?

Three rounds of a real bakeoff answered that.

Round 1: Which model is better? I compared the current model (Nanobanana 1, ~8 seconds per generation) against the new one (Nanobanana 2, ~20 seconds) across 4 prompt strategies on a set of benchmark images.

Quality scores were nearly identical on most setups — except one. On a specific prompt strategy, Nanobanana 1 catastrophically failed on half the images: scored zero on the primary removal dimension. Nanobanana 2 handled that strategy fine.

The obvious take: Nanobanana 2 is safer, use it everywhere. But the data told a different story. Nanobanana 1 matched or beat Nanobanana 2 on 3 of 4 strategies at 2.5-3x the speed. The catastrophic failure was isolated to one prompt-and-model combination. The new model wasn’t better. It was just slower and differently-bad.

Round 2: Does sending two images help? I’d been sending both the original photo and a marked-up reference to the generator. Maybe the model was confused by receiving two copies of the scene.

I reran Nanobanana 1 with a single image — just the marked-up reference, no original.

Prompt strategy	Both images	Single marked
Strategy A	9.70 / 100%	9.93 / 100%
Strategy B	9.33 / 75%	9.85 / 100%
Strategy C	6.73 / 50%	9.74 / 100%
Strategy D	9.89 / 100%	9.63 / 75%

Single image won on 3 of 4 strategies. The catastrophic failure from Round 1? Fixed entirely.

The model had been treating markup annotations as content to preserve rather than instructions to follow — but only when it also had the unmarked original to compare against.

I sat with this result for a day before accepting it. “More context is better” felt obviously true. The eval proved it wrong.

Round 3: Confirmation. Single-image on Nanobanana 2 to check whether the finding generalized. It didn’t speed Nanobanana 2 up — 17-23 seconds regardless. And Nanobanana 2’s quality was equivalent to Nanobanana 1’s single-image quality, at 2.5-3x the latency cost. No reason to prefer the newer model.

Final decision: Nanobanana 1, single marked image, two prompt strategies scoring above 9.8 with 100% pass rate. The eval collapsed a 2x2x4 decision space into a clear recommendation in three rounds, each under an hour.

That’s eval-as-workflow: each round answers one question, and the answer determines what to test next. Not a waterfall — a directed search where the data tells you where to look.

The Exploration Benefit

Most people build eval as a quality gate — a check that prevents regressions. That’s fine. But if that’s all you’re doing with it, you’re leaving the best part on the table.

The real power is exploration. New models drop every few weeks. Every release is a free experiment — but only if you can evaluate it quickly.

A new model isn’t a threat when you have eval loops — it’s an opportunity. The image editing bakeoff? Three rounds, one afternoon, conclusive answer. Without the eval, that would have been weeks of manual spot-checking.
Try a radically different prompt strategy and get signal, not vibes. The hidden chain-of-thought experiment? That was a hunch. The eval proved it in an afternoon.
Explore cost/quality tradeoffs systematically. I found that for one surface, a model three times cheaper produced results within 2% of the expensive model. The eval gave me the confidence to make that switch.

The eval loop didn’t just tell me if I got worse. It gave me the confidence to try things that made things dramatically better.

What makes this compound is the flywheel: eval catches a problem, you hypothesize a fix, a bakeoff confirms it, you ship with confidence, the eval catches the next problem. Each cycle improves both the product and the eval.

A case where keyword search beat semantic search? That became a permanent regression test. A prompt variant that caused ungrounded suggestions? That input became a new golden case.

I run evals every time a new model drops — which lately means every few weeks. Each run takes under an hour. About two-thirds led to shipped improvements. The rest told me “nope, the baseline is still better” — equally valuable. The cadence isn’t fixed. When models are releasing fast, I’m evaluating fast. When things are stable, I leave it alone. Without the eval loop, most of those experiments would never have happened, because nobody wants to run an experiment they can’t measure.

Patterns That Survive Production

If you want to build this tomorrow:

Start with 10 golden cases, not 100. Curation beats volume. Pick cases that span your intent space. I’ve been running evals for months and my golden sets are still under 20 cases each.

Tag your golden cases for slicing. I tag benchmark inputs with descriptive labels — “simple weeknight,” “elaborate dinner party,” “pantry-clearing.” This lets you see whether a prompt change helps across the board or only on easy inputs.

Treat golden set curation as a discipline. A good golden set needs diversity (different user intents), adversarial cases (the edge cases that actually break things), and ongoing maintenance. Every time an eval catches a real bug, add the failing case to the golden set. A golden set that doesn’t grow with your product gives you a false sense of security.

Pin your baseline. Always compare against the last known-good configuration. “This prompt scores 7.2” is meaningless. “This prompt scores 7.2, which is 0.8 higher than the baseline” is actionable.

Make rubric weights explicit and debatable. When you write “Safety: 10%, Taste accuracy: 20%,” those numbers represent product values, not objective truth. Making them explicit turns a vibes argument into a concrete, adjustable parameter. Change the weight, rerun the eval, see how rankings shift.

Write results incrementally. Stream each result to disk as it completes. If your eval crashes halfway through (and it will — LLM APIs are unreliable), you don’t lose what you already have.

Know when to look with your own eyes. Automated scores are great for fast iteration, but I always manually review the first few runs of any new rubric — that’s how you calibrate whether the evaluator scores the way you would. And for big decisions (switching models, shipping a major prompt change), I review the full eval output myself before pulling the trigger. The automation tells you where to look. Your judgment tells you what to do about it.

Make it easy to run. If your eval takes more than one command to launch, it won’t get used. Mine is npm run start -- --config ./config.json. Artifacts go to a timestamped runs/ directory. When a new model drops, evaluating it should take less time than making coffee.

And equally important — what not to over-engineer. Don’t build a platform before you have a pattern. Start with a script. Don’t automate judgment calls prematurely — let a human review the first 50 eval runs before you trust the evaluator enough to gate CI. Don’t build a dashboard until you’ve been pasting markdown summaries into PRs for a month and know what you actually look at. The tool should follow the habit, not the other way around.

Putting It Together

All the patterns above compose into a single loop. Here’s the core of my eval harness, stripped to its essentials:

async function runBakeoff(config: EvalConfig) {
  const results: BakeoffResult[] = [];

  for (const testCase of config.goldenSet) {
    for (const setup of config.setups) {
      const runs: RunScore[] = [];

      // Run multiple times to smooth out LLM variance
      for (let i = 0; i < config.runsPerSetup; i++) {
        const output = await generate(setup.model, setup.prompt, testCase);

        // Contract check first -- fast, deterministic, no LLM needed
        const contract = checkContract(output, testCase.expectations);
        if (!contract.passed) {
          runs.push({ contract, semantic: null });
          continue;
        }

        // Semantic eval with a different, stronger model.
        // The evaluator sees the contract analysis as context.
        const semantic = await evaluate(config.evaluatorModel, {
          input: testCase,
          output,
          contractAnalysis: contract,
          rubric: config.rubric,
        });
        runs.push({ contract, semantic });
      }

      results.push({
        caseId: testCase.id,
        setupId: setup.id,
        scores: averageScores(runs),
        passRate: runs.filter(r => r.semantic?.passed).length / runs.length,
      });

      // Write incrementally -- if the API crashes mid-run, you keep what you have
      await writeResultToDisk(results);
    }
  }

  // Dual output: markdown for PRs, JSON for automation
  await writeMarkdownReport(results, config.baseline);
  await writeJSONReport(results, config.baseline);
}

And the config that drives it:

{
  "goldenSet": "./golden-cases.jsonl",
  "setups": [
    { "id": "baseline",  "model": "gemini-2.5-flash", "prompt": "templates/baseline-v3.txt" },
    { "id": "candidate", "model": "gemini-2.5-flash", "prompt": "templates/hidden-cot.txt" },
    { "id": "new-model", "model": "gemini-3.0-flash", "prompt": "templates/baseline-v3.txt" }
  ],
  "evaluatorModel": "gemini-2.5-pro",
  "runsPerSetup": 3,
  "rubric": {
    "tasteAccuracy": {
      "weight": 0.20,
      "scale": 5,
      "prompt": "Does each substitution preserve the flavor role of the original ingredient? Are ratios adjusted for intensity? Cite the specific substitution(s) that informed your score. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "techniqueCorrectness": {
      "weight": 0.15,
      "scale": 5,
      "prompt": "Are cooking methods adjusted for the new ingredients? Would the technique still work? Cite the specific technique issue. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "completeness": {
      "weight": 0.15,
      "scale": 5,
      "prompt": "Did the adaptation address all stated constraints? Cite any unaddressed constraint. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "practicality": {
      "weight": 0.10,
      "scale": 5,
      "prompt": "Can a normal person follow this with ingredients from a regular grocery store? Cite any impractical ingredient or step. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "ingredientGrounding": {
      "weight": 0.10,
      "scale": 5,
      "prompt": "Are all referenced ingredients actually in the source recipe? Cite any ungrounded reference. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "sourcing": {
      "weight": 0.10,
      "scale": 5,
      "prompt": "Can you find these ingredients at a regular grocery store? Cite any hard-to-source ingredient. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "safety": {
      "weight": 0.10,
      "scale": 5,
      "prompt": "Any allergen contamination, dangerous shortcuts, or contradictions with the stated dietary constraint? Cite the specific safety concern. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    },
    "presentationClarity": {
      "weight": 0.10,
      "scale": 5,
      "prompt": "Are titles and descriptions easy to scan and act on? Cite any unclear or overly verbose element. Respond with: score, one-sentence justification, failure_mode (if < 3)."
    }
  }
}

Each rubric dimension carries its own evaluator prompt. The harness assembles these into a single evaluation call — one question per dimension, each scored on a 1-5 scale with a required justification. The weights determine how individual scores combine into the final ranking.

Three setups, one config. The harness tests every setup against every golden case, runs each one three times, averages the scores, and ranks the setups by weighted score. One command: npm run start -- --config bakeoff.json. Results land in a timestamped runs/ directory.

That’s the whole thing. A few hundred lines of TypeScript, a JSONL file with your golden cases, and a rubric that defines what “good” means. Nothing fancy. The value isn’t in the code — it’s in running it every time something changes.

Invest in the Thing That Lasts

The LLM feature you build on day one is never the best version. It can’t be — you haven’t learned enough yet about where the model struggles or which edge cases matter. And even if you had, the model will change next month.

The eval loop turns “built” into “systematically improving.” It’s not a big infrastructure investment. My entire eval setup is a few hundred lines of TypeScript, some golden set files, and a handful of rubrics.

The expensive part isn’t the code. It’s the habit: measuring before changing, every time, no exceptions.

I built this discipline before our first user, and I’m glad I did — the models have changed multiple times since I started. Each time, the eval told me within an hour what still worked and what didn’t.

The prompts I wrote at the beginning are mostly obsolete. The eval loops I built at the beginning still run with every release.

Everything in the LLM ecosystem is ephemeral — models, prompts, techniques. The eval loop is the only durable investment. It compounds across model generations, across prompt rewrites, across every change you make.

Start with one bakeoff. Pick 3 prompt variants, 10 test cases, one rubric. You’ll learn more from that single run than from months of manual spot-checking. Build the loop. Trust the signal over the vibes.