Problem
When an agent fails, the usual reaction is to tweak the prompt. A sentence is added, a rule is hardened, an example is inserted, and the case that just broke is tried again.
That may fix the moment. It can also break other invisible cases.
Without evaluations, each improvement is a gamble.
Thesis
Eval Flywheel is the mechanism that turns agent errors into accumulated learning.
A mature team does more than adjust prompts. It captures real cases, labels them, turns them into datasets, measures regressions, compares variants, and decides whether a change improves the whole system.
The difference between prompt engineering and AI operation is experimental memory.
Framework
The flywheel has six steps:
- Capture: store real cases, good and bad.
- Label: classify intent, failure layer, and expected outcome.
- Dataset: turn them into a reusable collection.
- Grade: evaluate with rules, humans, or judge models.
- Compare: measure new prompt, model, tool, or retrieval against the baseline.
- Promote: push to production only if it improves without breaking.
Mini-case: a sales agent qualifies leads. After a failure, the team creates five similar cases: one easy, two ambiguous, one with contradictory data, and one that must be escalated to a human. The new prompt is approved only if it improves the whole set, not just if it looks good in a demo.
Measurable signal: percentage of incidents turned into regression evals.
Posture: if a bug doesn’t enter the dataset, it will return disguised.
Why it matters now
OpenAI documents evaluations for agents with datasets, graders, and trace evaluation. LangSmith maintains workflows for datasets, experiments, and comparison. Anthropic also publishes tools and evaluation guides to test Claude’s behavior.
The market is converging: agent quality is no longer governed solely by the prompt. It is governed by repeatable tests.
Anti-example
“We ran twenty manual examples before launch.”
Better than nothing, but insufficient. If those examples aren’t versioned, labeled, and comparable, the next improvement starts from scratch.
Protocol (3 steps)
- Create an initial dataset of 50 cases. Real, varied, and with expected outcomes.
- Label by layer. Model, context, tool, permission, criterion, format, or verification.
- Block critical regressions. A change does not pass if it breaks high‑impact cases.
| Asset | Function | Owner |
|---|---|---|
| dataset | case memory | product |
| graders | repeatable criterion | AI/ops |
| trace evals | process quality | engineering |
| regression suite | avoid backsliding | QA |
| release note | explain changes | agent owner |
Related
- The serrated frontier of AI: the failure map every team needs before automating
- AI Evaluation Stack 2026: measuring without theater
- Token-to-Outcome: the KPI that separates used AI from profitable AI
Sources consulted
Next step
Don’t change the next prompt directly. First capture ten cases you want to protect, define the correct outcome, and create the first mini regression suite.
Translated from the Spanish original with AI assistance and reviewed for accuracy. Read the original in Spanish.