Problem
The conversation about AI usually treats capability as a continuous line: this model is better than the previous one, this benchmark rises, this task can already be automated, that other one cannot yet.
In practice it does not work that way. Two tasks that seem equally difficult for a person can fall on opposite sides of the model’s capability. One is solved with surprising quality. The other produces a convincing but wrong answer.
That pattern is known as the jagged frontier. And in 2026 it is a more important idea than ever because agents not only respond: they act.
Thesis
Before automating, every team should draw its jagged‑frontier map.
It is not enough to ask “can AI do this?” One must ask: in which variants does it perform well, in which does it appear to perform well but fail, what signals anticipate the failure, and where should it stop before executing.
The jagged frontier turns governance into concrete work. It is not a committee saying “be careful with AI”; it is a matrix of tasks, errors, tests, and operational limits.
Framework
Map each workflow into four zones:
- Green zone: repeatable, verifiable tasks with low error cost.
- Yellow zone: useful tasks with light human review.
- Red zone: tasks where a plausible answer could cause operational harm.
- Gray zone: tasks where there is still insufficient evidence to decide.
Mini‑case: an agent can summarize tickets and extract entities with good performance. It can also propose refund policies in exceptional cases. Both look like text tasks, but they do not belong to the same zone. The first is validated against data. The second mixes judgment, exceptions, legal risk, and customer experience.
Measurable signal: percentage of failures detected before execution versus failures discovered by the client, user, or downstream team.
Stance: automating without a frontier map is delegating not only work but also ignorance.
Why it matters now
The Harvard Business School and BCG study on the jagged frontier showed a paradox: inside the frontier, participants with AI completed more tasks, faster and with higher quality; outside it, they were less likely to produce correct solutions than those who did not use AI.
In March 2026, Harvard revisited the research after its formal publication in Organization Science. The lesson remains uncomfortable: AI does not simply help or fail. Sometimes it helps just enough for the failure to look professional.
Stanford HAI, in its AI Index 2026, also highlights governance, transparency, and hallucination problems. The context is not “AI is useless”; it is that systems need situated evaluation, not blind faith in the model.
Anti-example
“The benchmark says the model is good at reasoning, so it can make this decision.”
A general benchmark does not describe your local frontier. Your process has incomplete data, exceptions, internal policies, legacy systems, real customers, and consequences. The jagged frontier is not bought on a spec sheet; it is discovered in controlled execution.
Protocol (3 steps)
- List variants, not tasks. “Respond to tickets” is not a task; simple refund, potential fraud, and enterprise client are distinct variants.
- Force frontier cases. Test ambiguity, contradictory data, incomplete instructions, and rare exceptions.
- Define automatic stop. If a red signal appears, the agent must escalate, not improvise.
| Zone | Example | Minimum control |
|---|---|---|
| green | repeatable ticket classification | sampling and metrics |
| yellow | drafting sensitive response | light review |
| red | approving economic exception | responsible human |
| gray | new process without history | closed pilot |
Related
- Agent Reliability Score: how to know if an agent deserves autonomy
- AI Governance Backlog: turning risk into executable work
- Rollback Design for AI Workflows: how to shut down automations without breaking operations
Sources consulted
- Harvard D3: Navigating the Jagged Technological Frontier
- Harvard D3: Back to the Beginnings of AI at Work
- Stanford HAI: 2026 AI Index Report, Responsible AI
Next step
Choose a process you want to automate and do not run a generic test. Build twenty variants: ten easy, five ambiguous, three rare, and two dangerous. That’s where your real map starts.
Translated from the Spanish original with AI assistance and reviewed for accuracy. Read the original in Spanish.