All posts
AI Agents

SWE-bench Is Not Enough: What We Actually Need to Measure AI Coding Agents

DevOS Platform TeamMay 30, 202613 min read

Last month, a well-funded AI agent startup hit 72% on SWE-bench Verified. The press release called it "human-level software engineering." Two weeks later, we watched the same agent fail to add a database column to a production Rails app — it created the migration, forgot to run it, then wondered why the column didn't exist. Forty-five minutes of flailing. A human finally intervened.

That disconnect isn't a bug. It's a measurement problem. (And honestly? We've made this exact mistake ourselves — getting excited about a benchmark number, then watching the agent faceplant on something a junior dev handles without thinking.)

SWE-bench is the best benchmark we have for AI coding agents. It's also wildly insufficient for evaluating what these agents need to do in production. And the gap between "scores well on SWE-bench" and "actually useful for real development work" is where a lot of teams are getting burned right now.

This isn't an attack on SWE-bench — the researchers at Princeton did excellent work, and the benchmark pushed the field forward. But we're now in a phase where vendors optimize for the benchmark, users assume the benchmark reflects their use case, and everyone's confused when the 70%+ scoring agent can't handle a moderately complex feature request.

So. What does a real evaluation framework for AI coding agents actually look like?

What SWE-bench Measures (And Measures Well)

Let's give credit where it's due.

SWE-bench tests agents on 2,294 real GitHub issues from 12 popular Python repositories — Django, Flask, Scikit-learn, Sympy, and others. Real bugs that real humans filed and fixed. The agent reads the issue, understands the codebase, writes a patch, and either passes or fails the test suite.

This is miles ahead of HumanEval, which just checks if agents can write isolated functions given a docstring. HumanEval is a coding interview. SWE-bench is closer to actual work.

SWE-bench Verified — the harder variant — filters to 500 issues where human annotators confirmed the tests are reliable. It's a cleaner signal. When an agent passes, you know it actually solved the problem — not that it got lucky with a flaky test.

For single-issue, single-repository, Python-only bug fixing? SWE-bench is legitimately useful. A 50% score means something. A 70% score means more. Progress is measurable. (If you want to track your own agent's performance over time, JustAnalytics can help you build those dashboards without the analytics overhead.)

The trouble is that single-issue bug fixing is maybe 15% of what production development actually involves.

The Five Dimensions SWE-bench Doesn't Touch

Dimension 1: Long-Horizon Feature Development

SWE-bench issues are solvable in a single session. Most take under an hour of wall-clock time for a capable agent. That's the point — you need a benchmark you can run thousands of times.

But real features take days. Sometimes weeks.

Building a payment integration isn't one task. It's: design the data model, create the API endpoints, build the webhook handler, add the frontend components, write tests, handle edge cases, get through code review, fix the review feedback, deploy to staging, find the bug you missed, fix that bug, deploy to production.

Each step depends on the previous. Decisions made on day one constrain what's possible on day three. And if the PM changes the requirements on day four? Good luck. The agent needs to remember context across sessions, maintain architectural coherence, and adapt when requirements shift mid-project.

SWE-bench measures none of this. An agent could score 80% on isolated bug fixes and completely fall apart when asked to build something that spans more than one PR. We've seen it happen. Multiple times.

Dimension 2: Multi-File Coordination and Refactoring

The median SWE-bench issue touches 1-3 files. The median production task touches 8-15.

Renaming a core function across a codebase. Migrating from one ORM pattern to another. Splitting a monolith service into two. These require coordinated changes across dozens of files, with careful attention to import chains, test updates, and dependency graphs.

Single-issue benchmarks actively select against this kind of work. Large-scale refactors rarely get filed as GitHub issues with clean acceptance criteria — they're tracked in project management tools, broken into multiple PRs, and completed over multiple sessions. (I'll admit: we underestimated this dimension when we started building DevOS. Thought single-file performance would transfer. It doesn't.)

An agent that can fix a Django ORM bug but can't safely rename a model class across 40 files isn't ready for production work. SWE-bench won't tell you which type you're getting.

Dimension 3: Infrastructure Operations

Here's a category SWE-bench doesn't touch at all: ops work.

Production development involves databases. Deployments. Environment variables. SSL certificates. DNS records. Queue configurations. Cron jobs. Log aggregation. Alert rules.

When we're evaluating agents for DevOS, we need to know: can this agent provision a Postgres database on Railway? Configure Redis correctly? Set up a deployment pipeline? Handle environment variable management across staging and production?

None of these are "coding" in the SWE-bench sense. All of them are necessary for shipping software. The agent that can't touch infrastructure can only do half the job — and it's often the easier half.

(For context: DevOS includes a built-in DevOps agent that handles database provisioning, Railway deployment, domain configuration, and env management. We built it because the coding-only benchmarks completely ignored this surface.)

Dimension 4: Failure Recovery and Backtracking

SWE-bench issues have a solution. The test suite verifies it. Pass or fail.

Real development is messier. You try an approach. It doesn't work. You backtrack. Try something else. Maybe the third approach works but creates a performance regression. Now you're optimizing.

Agents that score well on benchmarks often struggle badly with this loop. They commit to an approach early, generate code, hit a wall, and then... keep trying variations of the same broken approach. They don't know how to say "this design is wrong, let me start over."

We've watched agents spend 90 minutes trying to fix a bug caused by their own flawed architecture, when a human would've said "scrap it, different approach" after 15 minutes. It's genuinely frustrating to watch. You want to shake the agent and yell "STOP. Back up. Think differently." But you can't, because benchmarks never taught it that skill.

Recovery from failure isn't measured by any major benchmark. But it's half the job.

Dimension 5: Collaboration and Handoffs

Production development isn't solo work. Code review exists. Standups exist. Handoffs between team members exist.

Can the agent respond to PR feedback productively? Can it hand off work to another agent (or human) with sufficient context? Can it pick up a task someone else started?

This matters especially for multi-agent systems like DevOS, where a Planner agent might define the architecture, a Developer agent implements, a QA agent tests, and a DevOps agent deploys. The handoff quality between agents determines whether the system works or collapses.

SWE-bench is single-agent, single-task. No collaboration axis at all.

How Vendors Game the Numbers

I don't want to call anyone out specifically. But the incentives are obvious, and the gaming is rampant. Look — I get it. I'd probably do the same thing if my funding round depended on a benchmark number. Doesn't make it less of a problem.

Training on the test set (or close enough). SWE-bench's issues are public. If your training data includes GitHub issues from Django and Flask, you've seen similar problems before. Some vendors explicitly train on "SWE-bench style" issues. The benchmark score goes up; real-world performance doesn't. (This is why we're skeptical of vanity metrics in general — same problem we see in ad fraud detection where bots game the numbers.)

Cherry-picking attempts. "We achieved 72% on SWE-bench" — but was that pass@1 (one attempt per issue) or pass@5 (five attempts, report the best)? The difference matters. In production, you don't get five tries.

Prompt engineering for the benchmark. Vendors tune their prompts specifically for SWE-bench's repositories and issue format. Great for the benchmark, useless for your codebase that looks nothing like Django.

Filtering to easier issues. "On issues we attempted, we achieved 78%!" Okay, which issues did you skip?

When you see a SWE-bench score, ask: pass@1 or pass@k? Full test set or filtered? Held-out evaluation or trained on similar issues? The headline number hides a lot.

What a Real Evaluation Framework Needs

Okay, criticism is easy. What should we actually measure?

Here's our working proposal — the evaluation axes we think matter for production AI agents:

1. Long-Horizon Completion Rate Give the agent a feature spec that requires 3+ days of work across multiple PRs. Measure: did it finish? How many human interventions were needed? Did the final result match the spec?

Proposed benchmark: Curated set of 100 real product features from open-source projects (not just bug fixes), each requiring 5+ files and 2+ PRs. Score = percentage completed without human intervention.

2. Multi-File Refactor Accuracy Large-scale rename, pattern migration, or module extraction tasks. Measure: did all tests pass? Were all references updated? Any broken imports or dead code left behind?

Proposed benchmark: 50 refactoring tasks across codebases of varying sizes (10K, 50K, 200K lines). Score = percentage where all tests pass and no regressions introduced.

3. Infrastructure Operations Success Database provisioning, deployment configuration, environment management, DNS setup. Measure: does the service actually run? Are security best practices followed?

Proposed benchmark: 30 infrastructure tasks across major platforms (Railway, Vercel, AWS, GCP). Score = percentage where the deployed service functions correctly.

4. Recovery-After-Failure Rate Give the agent a task where the obvious approach doesn't work. Measure: does it recognize the failure? Does it try a genuinely different approach? How many attempts before success (or giving up appropriately)?

Proposed benchmark: 50 tasks with intentional "trap" solutions that pass local tests but fail integration tests or have hidden bugs. Score = percentage where agent recovers within 3 attempts.

5. Handoff Quality Score Multi-agent scenarios where work transfers between agents. Measure: does the receiving agent have sufficient context? Are there contradictions or gaps? Does the system complete the task end-to-end? (This mirrors how VeloCalls handles call routing and handoffs — context preservation is everything.)

Proposed benchmark: 25 multi-stage features where different agents own different phases. Score = percentage completed without coordination failures.

Why This Matters for Choosing Tools

If you're evaluating AI coding agents for your team — and you should be, the technology is real — don't stop at SWE-bench.

Ask the vendor: what's your performance on multi-day features? How do you handle infrastructure? What happens when the first approach fails?

If they don't have answers, that's an answer.

We built DevOS specifically for the dimensions SWE-bench misses — multi-agent coordination, persistent memory across sessions, built-in infrastructure operations. The four agents (Planner, Developer, QA, DevOps) hand off work through a Super Orchestrator because single-agent architectures hit walls that benchmarks don't show.

Does DevOS score well on SWE-bench? Honestly, we haven't optimized for it. We don't think it's the right metric for what we're building. (Is that a cop-out? Maybe. But I'd rather admit we're not chasing a number than pretend the number captures what matters.) A deployed feature that touches 15 files, provisions a database, and survives code review matters more than a benchmark designed for single-issue bug fixes.

A Prediction (And a Request)

By mid-2027, SWE-bench alone won't be credible for marketing AI coding agents. The community will demand multi-axis evaluation. Someone — maybe us, maybe a research lab, maybe a coalition of vendors — will publish a production-focused benchmark suite that covers the five dimensions above.

When that happens, the agents that optimized purely for SWE-bench will look worse. The agents that actually work in production environments will look better. The rankings will shift.

Until then: don't trust any single number. Ask harder questions. Run your own evaluations on tasks that match your actual workflow.

And if you're a researcher working on this problem — reach out. We'd love to contribute. We're not smart enough to build the right benchmark alone. Nobody is. The benchmark that actually measures production readiness doesn't exist yet, and building it is going to take collaboration. (Even tracking email deliverability requires multi-axis metrics — single scores never tell the full story.)

More from Velocity Digital Labs. Related reading: our analysis of why single-agent tools plateau past prototypes, the case for AI agents as sprint employees, and how the IDE might be ending.

Frequently Asked Questions

Why is SWE-bench considered the gold standard for AI coding agents?

SWE-bench tests agents on real GitHub issues from popular Python repositories — actual bugs that humans filed and fixed. Unlike synthetic benchmarks like HumanEval (which tests isolated function generation), SWE-bench requires understanding existing codebases, writing patches, and passing test suites. It's the closest thing we have to real-world coding work, which is why every agent vendor leads with their SWE-bench score.

What types of tasks does SWE-bench not measure?

SWE-bench focuses on single-issue bug fixes in Python repositories. It doesn't measure multi-file refactoring, long-horizon feature development, infrastructure operations (databases, deployments, env vars), recovery from failures mid-task, or coordination between multiple agents. A 70% SWE-bench score tells you nothing about whether an agent can ship a feature that touches 15 files over 3 days.

What benchmarks should complement SWE-bench for agent evaluation?

We need benchmarks for: (1) long-horizon tasks spanning multiple sessions, (2) multi-agent coordination where agents hand off work, (3) infrastructure operations like provisioning databases and configuring deployments, (4) failure recovery when an agent's approach doesn't work and it needs to backtrack, and (5) real production environments with CI/CD, review cycles, and merge conflicts. No single benchmark covers these — we need a multi-axis evaluation framework.

How do vendors game SWE-bench scores?

Common tactics include training on SWE-bench's exact issues (or similar ones), optimizing prompts specifically for the benchmark's repositories, cherry-picking which issues to attempt, and running multiple attempts per issue and reporting best results. Some vendors report "pass@k" where k>1, meaning they try multiple times. Always ask: is this pass@1 on a held-out test set, or something more favorable?


Join the DevOS Waitlist

AI agents that work as employees inside your sprints, standups, and tickets — not single-task copilots. Planner / Developer / QA / DevOps agents pick up work from the backlog, ship in branches, request review. Linear-shaped backlog UI with AI underneath. Pre-launch.

Join the waitlist → · How agents-as-employees works