Cutting E2E Flake Rate from 100% to 25% in Five Weeks
How we killed flaky end-to-end tests so coding agents could actually iterate, and a checklist to replicate the workflow.
Problem
When coding agents write nearly all your code, a flaky end-to-end (E2E) test suite lies to agents and breaks their feedback loop.
Constraints
- Data available: Playwright reports, CI pipeline history, per-test pass/fail data, application performance monitoring (APM) traces via W3C Trace Context response headers.
- Tools: Two to three independent coding agents (e.g. Claude Code, Codex, Google Gemini), a custom Playwright reporter, GitHub Actions, and owning-team code reviewers.
Approach
- Multi-agent consensus for test triage. Three agents independently categorize each test as
delete,combine,convert, orleavewith a confidence score. A second pair cross-checks and negotiates until they agree on every test. - Human-in-the-loop review. Domain-owning teams make the final call on each change.
- Guardrails against regression. Update
AGENTS.mdwith a pre-E2E checklist and the rule, “When in doubt, write an integration test.” - Agent-readable failure reports. A custom Playwright reporter emits structured JSON: unified timeline, network and console events, base64 screenshots, and a first-class
traceIdparsed from thetraceparentheader so APM traces can provide backend context. - Right agent for the job. Benchmark three agents on the same real failure before standardizing. For us, Codex consistently searched deeper and resorted to retries and timeout bumps least often.
Before → After
| Metric | Week 0 | Week 5 |
|---|---|---|
| PRs hitting a flaky E2E | 100% | 25% |
| E2E test count | 174 | 87 |
| Typical agent fix for a flake | Add retry or bump timeout | Trace-backed root cause spanning 5-9 files |
| GitHub Actions flakiness costs | $102,600/yr | $25,650/yr |
How to replicate
- Write a prompt that asks an agent to audit every E2E test and categorize each as
delete,combine,convert, orleavewith a confidence score and reason. Output structured JSON keyed byfile:testName. - Run that prompt against three agents that differ in both model and harness. Save each result.
- Normalize test names across the three files, randomize their order to obfuscate which agent came to which conclusion, and merge into one
research-combined.json. - Give two fresh agents the combined file. Each produces a
final-*.jsondecision per test. - Diff the two files and loop: Feed each agent the other’s reasoning and have them come to consensus on each test. Stop when the diff is empty.
- Group final decisions by test file to limit merge conflicts. Ship one PR per file, code-reviewed by the owning team.
- Add a pre-E2E checklist to
AGENTS.mdto prevent regression. - Use the custom Playwright reporter to emits structured JSON, including a parsed
traceparentheader. - Use the /flaky-test-debugger skill to point agents at the reporter output and perform APM trace lookups.
- Benchmark agents on debugging tasks to choose the best one.
Pitfalls
- Handing agents only the stack trace and the repo. Nearly every agent adds a retry or bumps a timeout. Without a unified timeline and APM traces, there is nothing to root-cause against.
- Scraping Playwright’s HTML report. The report renders for humans, not machines. Parsing logic grows brittle and information-sparse, and any upstream format change breaks it. Build a reporter instead of parsing one.
- Letting agents set the final cut list. “These tests rarely catch logic regressions, but they prevent API-breaking changes” is a constraint agents cannot see from the test code alone.
Starter kit
@clipboard-health/playwright-reporter-llm: The custom reporter./flaky-test-debuggerskill: What agents follow on every failure.- Full narrative write-up: Context, examples, and the triage prompt.