Eval Flywheel: Production agents aren't fixed with prompts…

Problem

When an agent fails, the usual reaction is to tweak the prompt. A sentence is added, a rule is hardened, an example is inserted, and the case that just broke is tried again.

That may fix the moment. It can also break other invisible cases.

Without evaluations, each improvement is a gamble.

Thesis

Eval Flywheel is the mechanism that turns agent errors into accumulated learning.

A mature team does more than adjust prompts. It captures real cases, labels them, turns them into datasets, measures regressions, compares variants, and decides whether a change improves the whole system.

The difference between prompt engineering and AI operation is experimental memory.

Framework

The flywheel has six steps:

Capture: store real cases, good and bad.
Label: classify intent, failure layer, and expected outcome.
Dataset: turn them into a reusable collection.
Grade: evaluate with rules, humans, or judge models.
Compare: measure new prompt, model, tool, or retrieval against the baseline.
Promote: push to production only if it improves without breaking.

Mini-case: a sales agent qualifies leads. After a failure, the team creates five similar cases: one easy, two ambiguous, one with contradictory data, and one that must be escalated to a human. The new prompt is approved only if it improves the whole set, not just if it looks good in a demo.

Measurable signal: percentage of incidents turned into regression evals.

Posture: if a bug doesn’t enter the dataset, it will return disguised.

Why it matters now

OpenAI documents evaluations for agents with datasets, graders, and trace evaluation. LangSmith maintains workflows for datasets, experiments, and comparison. Anthropic also publishes tools and evaluation guides to test Claude’s behavior.

The market is converging: agent quality is no longer governed solely by the prompt. It is governed by repeatable tests.

Anti-example

“We ran twenty manual examples before launch.”

Better than nothing, but insufficient. If those examples aren’t versioned, labeled, and comparable, the next improvement starts from scratch.

Protocol (3 steps)

Create an initial dataset of 50 cases. Real, varied, and with expected outcomes.
Label by layer. Model, context, tool, permission, criterion, format, or verification.
Block critical regressions. A change does not pass if it breaks high‑impact cases.

Asset	Function	Owner
dataset	case memory	product
graders	repeatable criterion	AI/ops
trace evals	process quality	engineering
regression suite	avoid backsliding	QA
release note	explain changes	agent owner

Sources consulted

Next step

Don’t change the next prompt directly. First capture ten cases you want to protect, define the correct outcome, and create the first mini regression suite.

Translated from the Spanish original with AI assistance and reviewed for accuracy. Read the original in Spanish.

Eval Flywheel: Production agents aren't fixed with prompts, they're fixed with cases

Key Takeaways

Problem

Thesis

Framework

Why it matters now

Anti-example

Protocol (3 steps)

Sources consulted

Next step

Related Reading

Fable, Mythos and the political risk of the model: when your AI can be turned off outside the backlog

Fable, Mythos y el riesgo politico del modelo: cuando tu IA se puede apagar fuera del backlog

Eval Flywheel: los agentes de produccion no se arreglan con prompts, se arreglan con casos

Agent Frameworks 2026: Eve, Flue, LangGraph, CrewAI and Factory Don't Solve the Same Thing

Eval Flywheel: los agentes de produccion no se arreglan con prompts, se arreglan con casos