All posts
Engineering

Put an AI Agent on Flaky-Test Duty: Quarantine, Reproduce, Fix, Repeat

DevOS Platform TeamJune 21, 202611 min read

Wednesday, 11:47 AM. CI fails. You click through to the logs. The test that failed? test_user_session_expires_correctly. You've seen this one before. It passed yesterday. It passed this morning. It failed 14 days ago. It passed 47 times in between.

Flaky.

You re-run the pipeline. It passes. You merge. You move on with your life, slightly irritated, having learned nothing about the actual bug. (I've done this maybe 200 times in the last year. Not proud of it.)

This is how flaky tests kill engineering velocity — not with a dramatic failure, but with a thousand paper cuts. Every retry wastes 3-8 minutes. Every "just re-run it" erodes trust in the test suite. Every ignored flake is a real bug hiding in plain sight, mocking you. I've watched teams accumulate 40+ flaky tests over six months, each one a small tax on every PR. We had 53 at one point. Fifty-three! Our CI was basically a slot machine. (If you're struggling with AI agent cost overruns while trying to fix this manually, this workflow solves that too.)

What if an AI agent just... owned this? Not "helped you debug when you got around to it" — actually worked through the flaky-test backlog like an employee with a standing assignment. Quarantine the test, reproduce the flake, patch the root cause, verify stability, re-enable. Repeat until the backlog is empty.

That's what we're building here. Fair warning: it won't fix everything — we'll get to why — but it handles the tedious 50-60% so you can focus on the genuinely weird stuff.

What We're Building

By the end of this, you'll have a workflow where:

  1. An AI agent monitors CI for flaky tests (tests that sometimes pass, sometimes fail on the same code)
  2. Agent auto-quarantines flagged tests so they stop blocking builds
  3. Agent picks up quarantined tests as tickets, one at a time
  4. For each: reproduce the flake locally, identify root cause, patch, run 50+ times to confirm stability
  5. Agent re-enables the test and closes the ticket
  6. Flaky-test backlog trends toward zero instead of infinity

The agent doesn't need babysitting. It doesn't need you to context-switch into "flaky test debugging mode" at 4 PM on a Friday. It just works through the list.

Honestly? The first time I saw this working, I felt a little embarrassed we'd been manually re-running CI for years.

Prerequisites

Before this makes sense for your team:

  • A CI system with test history (GitHub Actions, CircleCI, GitLab CI — anything that tracks pass/fail per test per run)
  • A test framework with quarantine/skip support (Jest, pytest, Playwright, Vitest, Mocha — most do)
  • At least 2 weeks of CI history (the agent needs data to identify flakes)
  • A sprint board where the agent can create and close tickets (Jira, Linear, or similar)
  • Basic familiarity with your test framework's configuration
  • Analytics tracking for measuring flake rates over time (optional but recommended)

If your CI doesn't retain test-level history — just "build passed" or "build failed" — you'll need to add test result reporting first. Most CI systems support JUnit XML output; start there. (This takes maybe an afternoon to set up. Do it. You'll thank yourself.)

Step 1: Identify Flaky Tests From CI History

The agent's first job: find the flakes. Not by guessing, not by you manually tagging them — by analyzing actual CI data.

A test is flaky if it produces different outcomes on the same code. The agent queries your CI's test history API (or parses JUnit XML artifacts) looking for:

  • Tests that failed at least once in the last 100 runs
  • Where the code under test didn't change between the pass and fail
  • With a flake rate above a threshold (we use 5% — adjust based on your tolerance)

Example query logic (pseudocode):

flaky_tests = []
for test in all_tests:
    outcomes = get_outcomes_last_100_runs(test)
    if outcomes.has_failures() and outcomes.has_passes():
        # Same code, different outcomes = flaky
        flake_rate = outcomes.failure_count / outcomes.total_count
        if flake_rate > 0.05:
            flaky_tests.append({
                'name': test.name,
                'file': test.file_path,
                'flake_rate': flake_rate,
                'last_failure': outcomes.last_failure_date,
                'failure_logs': outcomes.last_failure_logs
            })

The output: a ranked list of flaky tests, sorted by flake rate (worst offenders first) or by last failure (most recent first — your call).

This runs on a schedule — nightly or weekly — and creates quarantine tickets for any new flakes that cross the threshold. No human required to notice the pattern. The agent just quietly does its homework while you sleep.

Step 2: Quarantine the Flaky Test

Once identified, the agent quarantines immediately. Quarantine means: the test still runs (you need the data), but its failure doesn't block the build.

# Before
def test_user_session_expires_correctly():
    ...

# After — quarantined
@pytest.mark.skip(reason="Quarantined: flaky - ticket PROJ-4521")
def test_user_session_expires_correctly():
    ...

Jest uses test.skip, Playwright has test.fixme(). The agent commits, opens a PR, creates a ticket with test name, flake rate, and last failure logs. Takes about 90 seconds end-to-end.

Quarantine PRs should merge fast — don't let them sit in review for three days while the test keeps failing CI. Auto-merge after linter check. Seriously, if you add a 48-hour review cycle to quarantine PRs, you've missed the entire point. We covered CI guardrails for agent PRs in another post.

Step 3: Reproduce the Flake Locally

Here's where manual debugging fails. You run the test ten times — all pass. You close the ticket "couldn't reproduce." It fails again next week.

The agent takes a different approach: brute force. Run the test 100 times until it fails.

for i in {1..100}; do
  npm test -- --testNamePattern="user session expires"
  if [ $? -ne 0 ]; then echo "Flake reproduced on run $i"; exit 0; fi
done

Most flaky tests fail within 20-50 runs when isolated. If it passes 100 times locally but fails in CI, the flake is environmental — CI-specific timing or resource contention. The agent captures failure output, stack traces, and timing data. Now you have an actual reproduction.

This is grunt work. It's boring. It's perfect for an agent.

Step 4: Identify the Root Cause

Most flaky tests fall into a few categories: race conditions (missing await, concurrent writes), timing dependencies (hardcoded waits that work locally but fail on slow CI), shared state (globals not cleaned up between tests), or external dependencies (real HTTP calls that occasionally timeout). Understanding these patterns matters more than any benchmark — why SWE-bench isn't enough for real-world agent evaluation.

The agent reads the failing test, the code under test, and the failure logs. It outputs a hypothesis:

Root cause hypothesis: RACE_CONDITION
Evidence: Test asserts on `user.lastSeen` immediately after `updateUserSession()`.
`updateUserSession()` is async but not awaited — test passes when the async
completes first (~93% of runs), fails when the assertion wins the race (~7%).
Proposed fix: Add await before assertion.

For common patterns, the agent's accuracy is around 50-60% on first hypothesis. For weird edge cases? Lower. Sometimes embarrassingly wrong. But even a wrong hypothesis narrows the search space — you learn what it isn't.

Step 5: Apply the Fix

The agent writes a minimal patch:

// Before (flaky) — updateUserSession not awaited
updateUserSession(testUser.id);
expect(testUser.lastSeen).toBeDefined();

// After (stable)
await updateUserSession(testUser.id);
expect(testUser.lastSeen).toBeDefined();

The agent runs the test 50 times to verify stability, then opens a PR with the fix, root cause explanation, and reproduction data. If the fix doesn't hold, the agent tries a second hypothesis or escalates. "Couldn't stabilize after 2 attempts" is valid — some flakes need human debugging. I'd rather the agent escalate than keep guessing forever.

The agent handles the easy 50-60%. That's not a failure. That's 50-60% of your flaky-test backlog gone without you touching it. (For solo founders, this kind of automation is how you ship like a team without hiring one.)

Step 6: Verify Stability Before Re-enabling

Non-negotiable. A "fixed" test that gets re-enabled without verification will just flake again.

The agent runs: 50 times in isolation, then 20 times with the full suite (catches shared-state issues). Any failure in either batch? Not stable — try again or escalate. Only after both pass does the agent remove the @pytest.mark.skip and re-enable the test.

Ticket closes with a summary: original flake rate, root cause, fix applied, verification passed. Clean.

Common Errors and Fixes

These will happen. Don't panic.

"Flake reproduced but root cause unclear"

Escalate to human review. The agent adds all reproduction data to the ticket — failure logs, timing, stack traces. Sometimes the answer is obvious once you look; sometimes it's a genuine puzzle.

"Test passes 100 times locally but fails in CI"

Environmental flake. CI has different timing, resources, or parallelism settings. Run reproduction in Docker with resource limits matching CI runners. Or accept that some flakes only reproduce in CI and work from logs alone.

"Fix applied but test still flakes at lower rate"

Iterate. A 3% flake rate is still a flake. The agent tries a second fix or combines approaches (await + cleanup + timeout). If rate drops below 1% after 200 runs, that might be acceptable.

"Quarantine backlog keeps growing"

You're creating flakes faster than the agent can fix them. This is a codebase health problem, not an agent problem. Look at which areas produce the most flakes — probably missing async handling or shared-state antipatterns spreading through the codebase. Fix the source, not the symptoms. Consider whether your agents are hallucinating infra decisions that introduce instability.

Next Steps

Once the basic workflow is running:

Track time-to-fix. How long from quarantine to re-enable? 3 days, great. 3 weeks, the agent is stuck. Your CI/CD pipeline metrics should track this.

Set deletion policies. Tests quarantined 3-4 weeks with no progress? Delete them. Controversial opinion: a test nobody can fix probably isn't testing the right thing. Write a new one.

Expand to coverage gaps. Same agent pattern works for uncovered code. DevOS's QA agent handles both — flaky-test duty is one workflow, coverage expansion is another.

In DevOS's planned workflow, the agent owns a "Flaky Tests" label. Any ticket with that label goes to the agent automatically. Your sprint velocity stays predictable because flaky-test work isn't consuming developer time. JustAnalytics can graph flake rates over time — you should see a downward trend once the agent is active.

Frequently Asked Questions

How does an AI agent detect which tests are flaky?

The agent analyzes CI history for tests that pass and fail on the same code. Most CI systems track test outcomes per run — the agent queries this data, flags tests with inconsistent results over the last 50-100 runs, and creates quarantine tickets for anything above a 5% flake rate. No manual tagging required.

What does quarantining a flaky test actually mean?

Quarantine means marking the test so it runs but doesn't block CI. The test still executes (you need data), but a failure doesn't fail the build. The agent adds a skip annotation or moves the test to a quarantine suite, depending on your framework. Jest uses test.skip, pytest uses pytest.mark.skip, and so on.

Can an AI agent actually fix flaky tests?

For common flake patterns — race conditions, timing dependencies, shared state, missing async/await — agents fix about 50-60% of cases autonomously. Complex flakes involving external services, database state, or browser timing often need human investigation. The agent triages and fixes what it can, escalates the rest.

How long should a test stay quarantined before it's deleted?

We recommend 2-3 sprint cycles (2-4 weeks). If the agent can't stabilize the test after multiple fix attempts, and no human has prioritized it, the test probably isn't worth keeping. Delete it and write a new, stable test for the same behavior — often easier than debugging a fundamentally broken test.


Join the DevOS Waitlist

DevOS puts AI agents inside your sprint as assignees — Planner, Developer, QA, DevOps agents that pick up tickets, open PRs, and hand off work. The QA agent owns flaky-test duty as one of its workflows, alongside coverage expansion and acceptance testing.

Pre-launch. Join the waitlist →