Skip to content
Back to Magazine
ai-operating-models 3 min read

Eval Flywheel: Production agents aren't fixed with prompts, they're fixed with cases

Does this apply to your company?

Free 30-min AI diagnostic →

Key Takeaways

  • - Capture: store real cases, good and bad.
  • - Label: classify intent, failure layer, and expected outcome.
  • - Dataset: turn them into a reusable collection.
  • - Grade: evaluate with rules, humans, or judge models.

Decision

Decide what governance, ownership or cadence is missing before scaling AI.

Room

Executive committee, AI portfolio review, transformation steering.

Risk

Mistaking activity, pilots and tooling for real operating capability.

Agent prompt: map decision rights, KPIs, risks and the next operational move

Problem

When an agent fails, the usual reaction is to tweak the prompt. A sentence is added, a rule is hardened, an example is inserted, and the case that just broke is tried again.

That may fix the moment. It can also break other invisible cases.

Without evaluations, each improvement is a gamble.

Thesis

Eval Flywheel is the mechanism that turns agent errors into accumulated learning.

A mature team does more than adjust prompts. It captures real cases, labels them, turns them into datasets, measures regressions, compares variants, and decides whether a change improves the whole system.

The difference between prompt engineering and AI operation is experimental memory.

Framework

The flywheel has six steps:

  • Capture: store real cases, good and bad.
  • Label: classify intent, failure layer, and expected outcome.
  • Dataset: turn them into a reusable collection.
  • Grade: evaluate with rules, humans, or judge models.
  • Compare: measure new prompt, model, tool, or retrieval against the baseline.
  • Promote: push to production only if it improves without breaking.

Mini-case: a sales agent qualifies leads. After a failure, the team creates five similar cases: one easy, two ambiguous, one with contradictory data, and one that must be escalated to a human. The new prompt is approved only if it improves the whole set, not just if it looks good in a demo.

Measurable signal: percentage of incidents turned into regression evals.

Posture: if a bug doesn’t enter the dataset, it will return disguised.

Why it matters now

OpenAI documents evaluations for agents with datasets, graders, and trace evaluation. LangSmith maintains workflows for datasets, experiments, and comparison. Anthropic also publishes tools and evaluation guides to test Claude’s behavior.

The market is converging: agent quality is no longer governed solely by the prompt. It is governed by repeatable tests.

Anti-example

“We ran twenty manual examples before launch.”

Better than nothing, but insufficient. If those examples aren’t versioned, labeled, and comparable, the next improvement starts from scratch.

Protocol (3 steps)

  1. Create an initial dataset of 50 cases. Real, varied, and with expected outcomes.
  2. Label by layer. Model, context, tool, permission, criterion, format, or verification.
  3. Block critical regressions. A change does not pass if it breaks high‑impact cases.
AssetFunctionOwner
datasetcase memoryproduct
gradersrepeatable criterionAI/ops
trace evalsprocess qualityengineering
regression suiteavoid backslidingQA
release noteexplain changesagent owner

Sources consulted

Next step

Don’t change the next prompt directly. First capture ten cases you want to protect, define the correct outcome, and create the first mini regression suite.


Translated from the Spanish original with AI assistance and reviewed for accuracy. Read the original in Spanish.

evals agents llmops ai-operating-models
Cite this article

Berthelius, V. (2026). “Eval Flywheel: Production agents aren't fixed with prompts, they're fixed with cases”. BRTHLS Magazine. https://www.brthls.com/magazine/eval-flywheel-fixing-agents-with-cases-en

Fractional CAIO · Free diagnostic

Is your company ready to operate with AI?

30 minutes. No pitch. An honest read on where you are and what to move first.

Book free diagnostic