Invocation
Workflow
Detect test runner
Identify the project’s test framework and runner command by scanning config files in the project root.Detection order:
pytest.ini / pyproject.toml / conftest.py → Jest / Vitest / Mocha (package.json) → Gradle / Maven (build.gradle, pom.xml) → go.mod → Cargo.toml → *.csproj → RSpec / Minitest (Gemfile).The detected runner determines the exact command templates used in all subsequent experiments.Confirm flakiness
Run the target test 10 times in isolation, record pass/fail per run, and compute the fail rate.
| Fail count | Interpretation | Next step |
|---|---|---|
| 0 of 10 | Possibly not flaky in this environment | Increase to 20 runs; ask user for CI context |
| 1–3 of 10 | Flakiness confirmed | Proceed to isolation |
| 4–7 of 10 | Highly flaky | Proceed to isolation |
| 8–10 of 10 | Likely consistently broken | Inform user; still run isolation to verify |
Isolation test
Run the target test alone (5 runs), then with the full suite (5 runs), and compare pass rates.
| Isolated | In-suite | Interpretation | Next step |
|---|---|---|---|
| Always passes | Sometimes fails | Ordering-dependent | Proceed to ordering bisection |
| Sometimes fails | Sometimes fails | Not ordering-dependent | Skip ordering; proceed to timing |
| Sometimes fails | Always passes | Possible self-induced resource leak | Proceed to timing |
| Always fails | Always fails | Consistently broken — not flaky | Report as real bug |
Ordering analysis
If the test passes in isolation but fails in-suite, bisect the test suite to find the specific test(s) that interfere with it.Binary bisection is used — never one-by-one elimination:
- List all tests before the target in execution order
- Split into two halves; run each half + target (3 runs each)
- Recurse into the half where the target fails
- Repeat until a single interfering test is isolated
- Confirm by running
INTERFERER + TARGET5 times
Timing analysis
Add timing instrumentation to detect race conditions, slow setup/teardown, and timeout sensitivity.
- Run the test 5 times with verbose duration output enabled
- Record setup time, test body time, and teardown time per run
- Compare parallel vs serial execution (disable workers with
--runInBand,-p no:xdist, or-parallel 1) - If the test stops flaking in serial mode, the root cause is parallel execution (shared state or resource contention)
- For Go projects, run
go test -raceto engage the built-in race detector
Environment analysis
Check for external dependencies, parallel execution configuration, resource leaks, and non-determinism sources.
| Factor | How to check |
|---|---|
| Parallelism | Read runner config for workers, forks, parallel, --jobs |
| CI vs local | Ask user whether flakiness differs between environments |
| Database | Check for test DB config, migrations, and per-test cleanup |
| Network | Grep test code for real HTTP URLs not routed to mocks |
| Filesystem | Grep for file operations and temp path usage |
| Time | Grep for time-dependent assertions or sleep calls |
Read the test code
Read the failing test and its fixtures or setup functions, searching for known flakiness signals.Signals to search for:
sleep,setTimeout,time.Sleep,Thread.sleep— time-based synchronizationstatic,global, hard-coded ports — shared mutable stateMath.random,random.random,uuid— non-deterministic inputos.listdir,readdir,globin assertions — non-deterministic orderinghttp://,https://in test files — real external callsopen(without context manager,new FileInputStreamwithout try-with-resources — resource leaks
Classify root cause
Assign one of the 6 root cause categories based on all experiment results. See root cause categories below.Assign confidence level:
- HIGH — at least 2 independent experiments confirm the category
- MEDIUM — 1 experiment confirms, consistent with code analysis signals
- LOW — experiments inconclusive; code analysis points to a probable cause
Generate diagnosis report
Output the structured report with all experiment data, the root cause verdict, and a specific fix recommendation.The report includes:
- Verdict: root cause category, confidence level, and fail rate
- Evidence summary: 2–3 sentences referencing specific experiment results
- Experiment results: raw pass/fail per run, isolation results, bisection trace (if applicable), timing variance table
- Code analysis: specific file and line references for flakiness signals found
- Recommended fix: before/after code snippet with exact file path and line numbers
- Verification command: exact shell command to confirm the fix works
Supported test frameworks
pytest
Detected via
pytest.ini, pyproject.toml [tool.pytest], setup.cfg [tool:pytest], or conftest.py. Supports --tb=short, --setup-show, --durations=0, and pytest-randomly for seed-based ordering.Jest / Vitest
Detected via
jest.config.* or vitest.config.* in package.json. Supports --runInBand for serial execution, --detectOpenHandles, and test name pattern filtering.Gradle / Maven
Detected via
build.gradle or pom.xml. Gradle supports --no-build-cache --rerun-tasks. Maven uses -Dtest=CLASS#METHOD. Both target fully qualified class and method names.go test
Detected via
go.mod. Natively supports -count=N for repeated runs, -race for race detection, -shuffle=on for order randomization, and -parallel for concurrency control.cargo test
Detected via
Cargo.toml. Supports --exact for single-test targeting, --nocapture for output, and --test-threads=1 to disable parallelism.RSpec / dotnet
RSpec detected via
Gemfile with rspec; supports --order rand:SEED and FILE_PATH:LINE_NUMBER targeting. dotnet test uses --filter FullyQualifiedName~TEST_NAME.Root cause categories
Every diagnosis is classified into exactly one of these 6 categories:| Category | Signature | Common cause |
|---|---|---|
| ORDERING | Passes alone, fails in-suite; bisection finds a specific interferer | Prior test mutates shared state (DB rows, env vars, module globals, temp files) without cleanup |
| TIMING | Fails in isolation; failing runs take significantly longer | Race condition, sleep()-based synchronization, or timeout sensitivity under load |
| SHARED_STATE | Fails in isolation; passes when run serially | Multiple test threads access the same mutable resource (singleton, shared port, shared DB) |
| EXTERNAL_DEPENDENCY | Fail rate varies across environments (local vs CI) | Test calls a real external service instead of a mock; time or locale assertions |
| RESOURCE_LEAK | First N runs pass, then failures increase progressively | File handles, DB connections, or goroutines allocated but never released |
| NON_DETERMINISM | Fails in isolation at a consistent rate; no timing variance | Random values, hash map iteration order, filesystem readdir order used in assertions |
If no single root cause clearly emerges, the report outputs
INCONCLUSIVE with the top two candidate categories and recommended additional experiments. A well-evidenced “inconclusive” is a valid and honest diagnosis.Self-review checklist
Before delivering the report, verify all of the following:- Flakiness confirmed: the test failed at least once AND passed at least once across experiment runs
- Fail rate computed from a minimum of 10 runs (not fewer)
- Isolation vs in-suite comparison completed (both were run)
- Root cause category is one of the 6 defined categories
- Fix recommendation references specific lines in the test or fixture code
- Report includes raw run data (pass/fail per run number) as evidence
- If ordering-dependent: the interfering test is identified by name
- If timing-dependent: the specific race condition or timeout is identified
Golden rules
1. Never guess the root cause
1. Never guess the root cause
Every diagnosis must be supported by experiment data. If experiments are inconclusive, say “inconclusive” and recommend further experiments — never fabricate a cause.
2. Always run isolation before ordering
2. Always run isolation before ordering
Run the test alone first. If it fails in isolation, ordering analysis is irrelevant — skip to timing and environment analysis.
3. Bisect, never brute-force
3. Bisect, never brute-force
When searching for an interfering test, use binary bisection of the test suite, not one-by-one elimination. Cut the search space in half each iteration.
4. Capture exact commands
4. Capture exact commands
Every experiment must log the exact shell command run so the user can reproduce it. Never paraphrase a command — copy it verbatim into the report.
5. Minimum 10 runs for any statistical claim
5. Minimum 10 runs for any statistical claim
Never say “always passes” or “always fails” with fewer than 10 runs. Flaky tests can have fail rates under 10%.
6. Never modify test code during diagnosis
6. Never modify test code during diagnosis
The goal is to find the cause, not fix it during experiments. All instrumentation must be temporary and reverted before the report is delivered.