All posts
Engineering

Our AI Agent Burned $4,200 in Tokens Overnight: How We Built a Budget Circuit Breaker

DevOS Platform TeamJune 11, 202613 min read

I woke up to 47 Slack notifications and a credit card alert. The timestamp on the first alert was 2:14 AM. By 7:30 AM when I finally checked my phone, our Anthropic bill had jumped by $4,217.43.

One agent. One night. Four grand.

The Planner agent — the one that handles sprint planning and breaks down epics into tasks — had gotten stuck in a refinement loop. It would generate a plan, evaluate the plan, decide the plan wasn't detailed enough, generate a more detailed plan, evaluate that plan, decide it still needed more detail. Rinse, repeat, for five and a half hours.

Nobody was watching. Why would they be? It was 2 AM on a Tuesday. The agent was supposed to be processing a backlog grooming request that came in around midnight from a design partner in Singapore.

Here's the thing that makes this embarrassing: we'd talked about adding budget limits. Multiple times. It was in the backlog. Literally in the same backlog the runaway agent was trying to groom. We just... hadn't prioritized it yet.

That changed fast.

What Actually Went Wrong

The root cause was deceptively simple. Our Planner agent has a self-evaluation step where it checks whether the plan it generated is "sufficiently detailed for implementation." This is usually good — it catches plans that are too vague to hand off to the Developer agent.

But the prompt for that evaluation step had a bug. When the input epic was unusually large (this one was a 3,000-word feature spec our design partner had pasted in), the agent's self-evaluation would sometimes flag the output as "needs more granularity" even when it was already at the task level. The agent would then try to break tasks into subtasks, which would trigger another evaluation, which would flag for more detail, and around we go.

The loop wasn't infinite in theory. The plan would eventually get detailed enough to satisfy the evaluator. In practice, with a 3,000-word input, that convergence point was somewhere around iteration 47. Each iteration involved multiple Claude API calls: one for planning, one for evaluation, one for revision reasoning. The context window kept growing because the agent was including previous iterations as "reference material."

By iteration 30, each round was burning $80-90 in tokens just from context window size. The last 17 iterations — the ones that ran while I was sleeping — cost more than the first 30 combined.

We'd built a very expensive infinite loop generator with extra steps.

Why the Obvious Fixes Don't Actually Fix It

The first thing everyone suggests is "just add a max iterations limit." Sure, fine, we did that. Capped at 5 iterations now. But that's a band-aid. It doesn't address the underlying cost visibility problem. (And honestly? I'm tired of band-aids. My entire career feels like band-aids on band-aids sometimes.)

What if a legitimate task genuinely needs 6 iterations? You've now blocked good work to prevent bad work. What if the expensive behavior isn't iterations but a single massive output? Max iterations doesn't catch that. What if it's not a loop but just a really expensive one-shot operation that shouldn't have been approved in the first place?

The second suggestion is usually "monitor your API costs." We were! We had a Grafana dashboard tracking daily spend. But daily spend dashboards don't help at 2 AM when nobody's looking at them. By the time the daily rollup showed a spike, the damage was done. Real-time observability matters — privacy-compliant analytics can help here.

The third idea is "use cheaper models." We do use multi-model routing — simpler tasks go to Haiku, complex ones to Sonnet or Opus. The Planner agent was already using Claude 3.5 Sonnet, not Opus. Cheaper than the most expensive option, still expensive enough to hurt when it loops 47 times.

None of these address what we actually needed: real-time cost enforcement with automatic shutdown when things go wrong.

Hot take: most "AI cost management" advice is written by people who haven't been woken up at 7 AM by their credit card company.

The Architecture That Actually Works: Budget Circuit Breakers

We spent two weeks after the incident building what we now call the cost control layer. Three components, all mandatory for any agent that touches an LLM API.

Component 1: Per-Task Budget Envelopes

Every task that gets assigned to an agent — whether it comes from the sprint board, an API call, or an internal handoff — gets a budget envelope attached.

task:
  id: "PLAN-1847"
  type: "epic-breakdown"
  assigned_agent: "planner"
  budget:
    soft_limit_usd: 2.00    # triggers alert
    hard_limit_usd: 8.00    # halts execution
    model_allowlist: ["claude-3-5-sonnet", "claude-3-haiku"]

The soft limit triggers a Slack alert and logs a warning but lets execution continue. The hard limit stops the agent immediately. The numbers come from historical data — we ran 200+ planning tasks during our design partner phase, measured the cost distribution, and set soft limits at 2x p95 and hard limits at 5x p95.

Different task types get different envelopes. A simple code review task might have a $0.50 hard limit. A multi-file refactoring might allow $15. The epic breakdown task that triggered our incident now caps at $8 — which would have stopped the loop around iteration 4 instead of iteration 47.

Component 2: Agent-Level Rolling Budgets

Per-task limits catch runaway individual operations. But what if an agent is processing lots of small tasks and each one is within budget, but the aggregate is insane?

We added rolling budget windows at the agent level:

agent:
  id: "planner"
  rolling_budgets:
    - window: "1h"
      limit_usd: 25.00
    - window: "24h"
      limit_usd: 150.00
    - window: "7d"
      limit_usd: 800.00

If the Planner agent burns through $25 in any single hour, it pauses and waits for human review. If it hits $150 in a day — even spread across hundreds of legitimate small tasks — same thing. The 7-day window catches sustained high-burn patterns that might slip through daily reviews.

When an agent hits a rolling budget limit, it doesn't just stop. It flags all pending tasks as "budget-paused," preserves its state, and notifies the team. We can resume after review, adjust budgets if the spend was legitimate, or investigate if something's wrong.

Component 3: Global Kill Switch

The nuclear option. If total spending across all agents exceeds a global threshold, everything stops.

global:
  kill_switch:
    trigger_threshold_usd: 500.00
    window: "1h"
    action: "pause_all_agents"
    require_manual_restart: true

We've never hit this. Hope we never do. But if something goes catastrophically wrong — multiple agents in loops simultaneously, or an API cost spike we didn't anticipate — the system shuts itself down and waits for a human. Full stop.

This is the same pattern you'd use for infrastructure circuit breakers. The principle is identical: automated systems need automated safety limits, because humans can't monitor everything all the time. The VeloCards team uses similar patterns for fraud detection thresholds.

Measuring What "Normal" Looks Like

The hardest part of setting budget limits is knowing what numbers to use. Too tight and you block legitimate work. Too loose and you don't catch problems until they're expensive problems.

Here's the measurement process we use now:

Step 1: Baseline for 2-3 weeks. Run agents without hard limits (but with monitoring) and collect cost data per task type. You need enough volume to get meaningful distributions — at least 50 tasks per category if possible.

Step 2: Calculate percentiles. For each task type, compute p50 (median), p90, p95, and p99 costs. The median tells you what typical tasks look like. The p95 tells you what expensive-but-legitimate tasks look like. The p99 tells you where outliers start.

Step 3: Set thresholds based on percentiles.

  • Soft limit: 2x p95 — catches unusual tasks for review
  • Hard limit: 5x p95 — stops definite problems
  • Per example: if p95 for epic breakdown is $1.80, soft limit is $3.60, hard limit is $9.00

Step 4: Adjust based on task complexity indicators. We found that input size (token count of the task description) correlates with cost. A 500-token task description costs less to process than a 3,000-token one. We now scale budget envelopes by input size:

def calculate_budget(task_type: str, input_tokens: int) -> Budget:
    base_limits = TASK_TYPE_LIMITS[task_type]  # from baseline data

    # Scale by input complexity
    complexity_multiplier = 1 + (input_tokens / 2000) * 0.5

    return Budget(
        soft_limit=base_limits.soft * complexity_multiplier,
        hard_limit=base_limits.hard * complexity_multiplier
    )

That 3,000-token epic that triggered our incident would now get a higher budget envelope automatically — but still a bounded one.

What Happens When Limits Are Hit

The worst thing you can do when an agent hits a budget limit is let it retry automatically. That's how contained problems become cascading failures.

Our shutdown sequence:

  1. Immediate halt. The agent stops mid-operation. No "let me just finish this one thing." Stop.

  2. State preservation. We dump the agent's current state — conversation history, working memory, pending tool calls — to storage. This is critical for debugging. You need to see exactly what the agent was doing when it got stopped.

  3. Detailed logging. Not just "budget exceeded" but: which task, which iteration, what was the last API call, what was the context window size, what was the cumulative spend at each step. We pipe this to JustAnalytics for aggregation.

  4. Alert with context. The Slack notification includes: task ID, agent ID, budget limit hit, current spend, a link to the state dump, and a "resume after review" button. The person reviewing doesn't have to dig through logs to understand what happened.

  5. Human review required. Agents don't auto-resume after budget failures. Someone has to look at what happened, decide if it was a bug or legitimate high-cost work, and explicitly restart with either the same budget (if we decided the limit was too tight) or a fix deployed (if something was wrong).

The incident that kicked off this whole project would have ended very differently with this system. Around 2:17 AM — three minutes into the loop — the per-task hard limit would have fired. Total cost: maybe $12 instead of $4,200. I would have woken up to one alert instead of 47. The state dump would have shown the evaluation loop immediately. We would have fixed the prompt bug and moved on.

Instead, I'm writing a blog post about how we burned $4,200 learning an obvious lesson. The universe has a sense of humor.

Honestly, Still Not Perfect

Look, this system catches the problems we've encountered so far. It won't catch everything.

It doesn't prevent expensive-but-legitimate work from being blocked. If a task genuinely needs $50 worth of tokens to complete — maybe a massive codebase migration or a complex architectural analysis — the budget system will halt it. You'll need to manually approve a higher budget. That's friction. Usually worth it, sometimes annoying.

It doesn't catch semantic waste. If an agent is doing useless work cheaply — generating irrelevant output, going down dead ends, producing code that won't pass review — the budget system won't flag that. It only sees tokens, not value. You still need quality gates at the output level.

It adds latency. Every API call now goes through a budget check. Not much — maybe 5-10ms — but it's there. For high-frequency agent operations, that adds up. We've accepted this tradeoff. Gladly, actually.

And it requires maintenance. Baseline measurements drift as your agents evolve, as task types change, as models get updated. We re-run baseline analysis monthly. It's work.

But compared to waking up to a $4,200 bill? I'll take the maintenance.

The Broader Point About Agent Reliability

Here's my real takeaway from this incident — and I'll be blunt, because I've wasted enough money to earn the right to an opinion.

Agent reliability isn't about making agents smarter. It's about building systems that constrain failure modes. Budget circuit breakers, schema validation, dry-run gates, signed runbooks for infrastructure — these are all constraints. They don't make the agent more intelligent. They make its failures smaller and more recoverable. Similar to how ClickzProtect constrains ad fraud before it drains your budget.

The agents-as-employees model that DevOS is built around depends on this kind of reliability infrastructure. You can't assign a ticket to an agent and walk away if you're not confident the agent won't bankrupt you overnight. The sprint board becomes useless if agents can run without bounds.

We're pre-launch. The budget circuit breaker system I've described here is part of what we're building into the core platform. Every agent that runs through DevOS — the Planner, Developer, QA, DevOps agents, plus whatever custom agents come from the marketplace — will have this cost control layer built in. Per-task envelopes, rolling agent budgets, global kill switch.

Because I don't want to write this blog post again with a bigger number.

Frequently Asked Questions

Why do AI agents sometimes burn through tokens unexpectedly?

AI agents can enter infinite loops, retry failed operations indefinitely, or get stuck generating increasingly long outputs when they hit edge cases the prompt didn't anticipate. Without hard spending limits, a single stuck agent can consume thousands of dollars in API calls before anyone notices. The most common culprits are planning agents that keep refining plans, tool-use loops where the agent retries a failing tool, and context window stuffing where the agent keeps adding to its working memory without summarizing.

What's a budget circuit breaker for AI agents?

A budget circuit breaker monitors token consumption in real-time and automatically halts agent execution when spending exceeds predefined thresholds. It works in layers: per-task limits that stop individual runaway operations, per-agent daily/hourly budgets that prevent sustained cost spikes, and global kill switches that pause all agents when something goes seriously wrong. The circuit breaker doesn't just stop execution — it preserves agent state so you can debug what went wrong.

How do you set appropriate token budgets for AI agents?

Start by measuring baseline costs for typical tasks over 2-3 weeks. Calculate percentiles — your p50 (median) and p95 costs tell you what normal looks like. Set soft limits at 2x p95 (triggers alerts) and hard limits at 5x p95 (halts execution). Adjust by task complexity: a simple code review might cap at $0.50, while a multi-file refactoring task might allow $15. The key is making limits tight enough to catch problems early while loose enough that legitimate work doesn't get blocked.

What should happen when an agent hits its budget limit?

When an agent hits a budget limit, it should stop immediately, preserve its current state and conversation history, log detailed metrics about what it was doing, and alert the team. The agent shouldn't retry or try to continue with a smaller scope — that's how you turn a contained problem into a cascading failure. After hitting a limit, human review should be required before the agent can resume. Automatic restarts after budget failures are how small incidents become expensive incidents.


Join the DevOS Waitlist

AI agents that work as employees inside your sprints, standups, and tickets — not single-task copilots. Planner / Developer / QA / DevOps agents pick up work from the backlog, ship in branches, request review. Linear-shaped backlog UI with AI underneath. Pre-launch.

Join the waitlist → · How agents-as-employees works

ai-agent-cost-controltoken-budgetcircuit-breakerllm-observabilityrunaway-llm-costsbuildinpublicsaasstudioaiworkforcebuildwithclaude