Problem
Most teams measure AI with indicators that don’t explain business value: prompts launched, active users, tokens consumed, monthly cost, “saved” hours, or automated volume.
These metrics are useful for operations, but poor for decision-making. One agent may consume few tokens and not move the needle. Another may consume many and close a job that previously blocked three people. Without a unit that connects computational cost to outcome, the debate becomes moral: some ask to save, others ask to experiment more.
The problem isn’t the token. The problem is that nobody knows what outcome they’re buying.
Thesis
Token-to-Outcome should become the base KPI for any operation with agents.
It doesn’t measure if AI is used. It measures how many tokens, calls, tools, and human reviews a system needs to produce an accepted result: an incident resolved, a migration validated, a report published, an opportunity qualified, a piece approved, or a decision recorded.
The company that only looks at cost per token optimizes input. The one that looks at token-to-outcome optimizes the system.
Framework
A good token-to-outcome KPI needs four layers:
- Outcome unit: what counts as finished work.
- Computational cost: tokens, calls, tools, executions, and retries.
- Human cost: review, correction, waiting, escalation, and supervision.
- Verifiable quality: criteria that prevent counting cheap junk as success.
Mini-case: a support agent generates 10,000 responses at low cost. If only 20% resolve without recontact, the system is cheap but weak. Another agent consumes more tokens per case, checks three systems, verifies policies, and closes 65% without escalation. The second may seem expensive on the dashboard, but may be more profitable per outcome.
Measurable signal: total cost per accepted result, not cost per conversation or cost per token.
Posture: by 2026, the mature team doesn’t brag about using AI. They brag about knowing how much each unit of work costs to resolve.
Why It Matters Now
Agentic systems are making visible an economy that was previously hidden. OpenAI documents prices per token, usage dashboards, budgets, and spending limits. Anthropic has explained that multi-agent systems scale token usage for tasks that surpass a single agent, and an April 2026 study on coding agents found that consumption can vary greatly between equivalent executions.
That doesn’t mean agents are too expensive. It means cost can no longer be analyzed like a flat SaaS bill. Each workflow has a different curve: some tasks deserve more computation because they buy coverage, parallelism, or verification; others just burn tokens to simulate progress.
The question changes from “how much do we spend on AI” to “what outcomes do those tokens buy”.
Anti-Example
“We need to reduce tokens by 30%.”
May be correct. May also destroy margin if it cuts precisely the part that validated, contrasted, or prevented rework. Reducing tokens without separating exploratory, productive, and verifying tasks is like lowering factory costs by turning off quality control.
Protocol (3 steps)
- Define the atomic outcome. Don’t measure “AI usage”; measure a closed and accepted result.
- Separate spending by phase. Exploration, execution, verification, and rework don’t buy the same thing.
- Cross cost with quality. A cheap outcome that comes back as an incident isn’t cheap; it’s debt.
| Old metric | Token-to-outcome metric | Decision it enables |
|---|---|---|
| tokens consumed | tokens per accepted result | knowing if the workflow scales |
| monthly cost | cost per unit of work | comparing AI to current process |
| responses generated | verified resolutions | avoiding activity without value |
| active users | outcomes per user | detecting false adoption |
Related
- Zendesk Relate 2026: when the agent is paid per resolution, not per seat
- AI Evaluation Stack 2026: measuring without theater
- Proof-of-Value Theater: signs that your AI works but doesn’t move the business
Sources
- OpenAI Platform: Pricing
- OpenAI Platform: Rate limits
- Anthropic: How we built our multi-agent research system
- How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks
Next Step
Choose a workflow with visible cost and clear outcome. Don’t optimize the prompt yet. First, measure how much one accepted outcome costs. That number will tell you if you have product, theater, or debt.
Translated from the Spanish original with AI assistance and reviewed for accuracy. Read the original in Spanish.