Skip to main content
Runs structured experiments to identify the root cause of a flaky test and produces a diagnosis report with a concrete fix recommendation. Use when a test is intermittent, non-deterministic, or randomly failing.

Invocation

/flaky-test-diagnoser [test name or identifier]
Examples:
/flaky-test-diagnoser tests/test_payments.py::test_charge_idempotency
/flaky-test-diagnoser "UserService > should update profile on concurrent requests"
/flaky-test-diagnoser com.example.OrderServiceTest#testProcessOrder

Workflow

1

Detect test runner

Identify the project’s test framework and runner command by scanning config files in the project root.Detection order: pytest.ini / pyproject.toml / conftest.py → Jest / Vitest / Mocha (package.json) → Gradle / Maven (build.gradle, pom.xml) → go.modCargo.toml*.csproj → RSpec / Minitest (Gemfile).The detected runner determines the exact command templates used in all subsequent experiments.
2

Confirm flakiness

Run the target test 10 times in isolation, record pass/fail per run, and compute the fail rate.
# pytest example
for i in $(seq 1 10); do
  echo "--- Run $i ---"
  pytest "tests/test_payments.py::test_charge_idempotency" -v --tb=short 2>&1
  echo "EXIT: $?"
done
Fail countInterpretationNext step
0 of 10Possibly not flaky in this environmentIncrease to 20 runs; ask user for CI context
1–3 of 10Flakiness confirmedProceed to isolation
4–7 of 10Highly flakyProceed to isolation
8–10 of 10Likely consistently brokenInform user; still run isolation to verify
3

Isolation test

Run the target test alone (5 runs), then with the full suite (5 runs), and compare pass rates.
IsolatedIn-suiteInterpretationNext step
Always passesSometimes failsOrdering-dependentProceed to ordering bisection
Sometimes failsSometimes failsNot ordering-dependentSkip ordering; proceed to timing
Sometimes failsAlways passesPossible self-induced resource leakProceed to timing
Always failsAlways failsConsistently broken — not flakyReport as real bug
4

Ordering analysis

If the test passes in isolation but fails in-suite, bisect the test suite to find the specific test(s) that interfere with it.Binary bisection is used — never one-by-one elimination:
  1. List all tests before the target in execution order
  2. Split into two halves; run each half + target (3 runs each)
  3. Recurse into the half where the target fails
  4. Repeat until a single interfering test is isolated
  5. Confirm by running INTERFERER + TARGET 5 times
Each bisection step runs 3 times to account for inherent flakiness in the result.
5

Timing analysis

Add timing instrumentation to detect race conditions, slow setup/teardown, and timeout sensitivity.
  • Run the test 5 times with verbose duration output enabled
  • Record setup time, test body time, and teardown time per run
  • Compare parallel vs serial execution (disable workers with --runInBand, -p no:xdist, or -parallel 1)
  • If the test stops flaking in serial mode, the root cause is parallel execution (shared state or resource contention)
  • For Go projects, run go test -race to engage the built-in race detector
6

Environment analysis

Check for external dependencies, parallel execution configuration, resource leaks, and non-determinism sources.
FactorHow to check
ParallelismRead runner config for workers, forks, parallel, --jobs
CI vs localAsk user whether flakiness differs between environments
DatabaseCheck for test DB config, migrations, and per-test cleanup
NetworkGrep test code for real HTTP URLs not routed to mocks
FilesystemGrep for file operations and temp path usage
TimeGrep for time-dependent assertions or sleep calls
7

Read the test code

Read the failing test and its fixtures or setup functions, searching for known flakiness signals.Signals to search for:
  • sleep, setTimeout, time.Sleep, Thread.sleep — time-based synchronization
  • static, global, hard-coded ports — shared mutable state
  • Math.random, random.random, uuid — non-deterministic input
  • os.listdir, readdir, glob in assertions — non-deterministic ordering
  • http://, https:// in test files — real external calls
  • open( without context manager, new FileInputStream without try-with-resources — resource leaks
8

Classify root cause

Assign one of the 6 root cause categories based on all experiment results. See root cause categories below.Assign confidence level:
  • HIGH — at least 2 independent experiments confirm the category
  • MEDIUM — 1 experiment confirms, consistent with code analysis signals
  • LOW — experiments inconclusive; code analysis points to a probable cause
If no single category fits clearly, report the top two candidates with individual confidence levels and recommend additional experiments.
9

Generate diagnosis report

Output the structured report with all experiment data, the root cause verdict, and a specific fix recommendation.The report includes:
  • Verdict: root cause category, confidence level, and fail rate
  • Evidence summary: 2–3 sentences referencing specific experiment results
  • Experiment results: raw pass/fail per run, isolation results, bisection trace (if applicable), timing variance table
  • Code analysis: specific file and line references for flakiness signals found
  • Recommended fix: before/after code snippet with exact file path and line numbers
  • Verification command: exact shell command to confirm the fix works

Supported test frameworks

pytest

Detected via pytest.ini, pyproject.toml [tool.pytest], setup.cfg [tool:pytest], or conftest.py. Supports --tb=short, --setup-show, --durations=0, and pytest-randomly for seed-based ordering.

Jest / Vitest

Detected via jest.config.* or vitest.config.* in package.json. Supports --runInBand for serial execution, --detectOpenHandles, and test name pattern filtering.

Gradle / Maven

Detected via build.gradle or pom.xml. Gradle supports --no-build-cache --rerun-tasks. Maven uses -Dtest=CLASS#METHOD. Both target fully qualified class and method names.

go test

Detected via go.mod. Natively supports -count=N for repeated runs, -race for race detection, -shuffle=on for order randomization, and -parallel for concurrency control.

cargo test

Detected via Cargo.toml. Supports --exact for single-test targeting, --nocapture for output, and --test-threads=1 to disable parallelism.

RSpec / dotnet

RSpec detected via Gemfile with rspec; supports --order rand:SEED and FILE_PATH:LINE_NUMBER targeting. dotnet test uses --filter FullyQualifiedName~TEST_NAME.

Root cause categories

Every diagnosis is classified into exactly one of these 6 categories:
CategorySignatureCommon cause
ORDERINGPasses alone, fails in-suite; bisection finds a specific interfererPrior test mutates shared state (DB rows, env vars, module globals, temp files) without cleanup
TIMINGFails in isolation; failing runs take significantly longerRace condition, sleep()-based synchronization, or timeout sensitivity under load
SHARED_STATEFails in isolation; passes when run seriallyMultiple test threads access the same mutable resource (singleton, shared port, shared DB)
EXTERNAL_DEPENDENCYFail rate varies across environments (local vs CI)Test calls a real external service instead of a mock; time or locale assertions
RESOURCE_LEAKFirst N runs pass, then failures increase progressivelyFile handles, DB connections, or goroutines allocated but never released
NON_DETERMINISMFails in isolation at a consistent rate; no timing varianceRandom values, hash map iteration order, filesystem readdir order used in assertions
If no single root cause clearly emerges, the report outputs INCONCLUSIVE with the top two candidate categories and recommended additional experiments. A well-evidenced “inconclusive” is a valid and honest diagnosis.

Self-review checklist

Before delivering the report, verify all of the following:
  • Flakiness confirmed: the test failed at least once AND passed at least once across experiment runs
  • Fail rate computed from a minimum of 10 runs (not fewer)
  • Isolation vs in-suite comparison completed (both were run)
  • Root cause category is one of the 6 defined categories
  • Fix recommendation references specific lines in the test or fixture code
  • Report includes raw run data (pass/fail per run number) as evidence
  • If ordering-dependent: the interfering test is identified by name
  • If timing-dependent: the specific race condition or timeout is identified

Golden rules

Every diagnosis must be supported by experiment data. If experiments are inconclusive, say “inconclusive” and recommend further experiments — never fabricate a cause.
Run the test alone first. If it fails in isolation, ordering analysis is irrelevant — skip to timing and environment analysis.
When searching for an interfering test, use binary bisection of the test suite, not one-by-one elimination. Cut the search space in half each iteration.
Every experiment must log the exact shell command run so the user can reproduce it. Never paraphrase a command — copy it verbatim into the report.
Never say “always passes” or “always fails” with fewer than 10 runs. Flaky tests can have fail rates under 10%.
The goal is to find the cause, not fix it during experiments. All instrumentation must be temporary and reverted before the report is delivered.