Back to Blog

Flaky Test Debugging in CI/CD: A Forensic Method That Finds the Real Root Cause

Flaky tests are the silent killers of CI/CD pipelines — they erode team trust, slow deployments, and hide real bugs. This forensic guide shows you how to systematically identify, diagnose, and eliminate test flakiness for good.

Published

10 min read

Reading time

Flaky Test Debugging in CI/CD: A Forensic Method That Finds the Real Root Cause

The first flaky test in a CI/CD pipeline is barely a nuisance. You re-run the pipeline, it passes, and you move on. The tenth flaky test breaks your trust in the entire suite. The fiftieth means your team has learned to ignore red builds. And the day someone ships a real regression because "it was probably just flaky" is the day flaky tests become a production incident.

Flaky tests are not just an inconvenience. They are a systemic trust failure that compounds over time. This guide treats flakiness as a forensic problem — something to be diagnosed, categorized, traced to root causes, and remediated permanently.


Defining the Problem: What Makes a Test Flaky?

A flaky test is a test that passes and fails non-deterministically without any changes to the code under test. The same commit, the same environment, the same test — different results.

The critical distinction: a flaky test is not a test that fails. It is a test that sometimes fails and sometimes passes. This makes them far more dangerous than consistently failing tests, because:

  1. They are hard to reproduce locally — CI environments differ from development machines
  2. They destroy signal — a developer re-running a flaky test to get a green build might be hiding a real failure
  3. They normalize "re-run culture" — teams stop investigating failures and just re-run
  4. They slow pipelines — retries add 20–100% latency to CI runs

The Taxonomy of Flakiness: Eight Root Causes

Effective remediation starts with correct diagnosis. Flakiness has eight primary root causes:

mindmap
  root((Flaky Tests))
    Timing & Async
      Hard-coded sleeps
      Race conditions
      Polling without timeout
    Resource Contention
      Shared test database
      Port conflicts
      File system races
    External Dependencies
      Third-party APIs
      Email/SMS services
      Clock/timezone
    Test Order Dependency
      Shared global state
      Database leftovers
      Auth state bleed
    Environment Differences
      OS path separators
      Locale settings
      Node version
    Network Issues
      CDN latency
      DNS resolution
      WebSocket drops
    Visually Non-Deterministic
      Animation timing
      Font rendering
      Dynamic content
    Browser Process
      Memory pressure
      GPU process crash
      Browser version mismatch

Forensic Techniques: Finding the Root Cause

Technique 1: Flakiness Rate Tracking

The first step is measurement. Without a flakiness rate, you are operating blind. Track test results over time:

// scripts/track-flakiness.ts
// Run after each CI build to accumulate flakiness data

interface TestResult {
  testName: string;
  passed: boolean;
  commitSha: string;
  runId: string;
  timestamp: Date;
}

async function recordResults(junitXmlPath: string) {
  const results = parseJUnitXml(junitXmlPath);
  await db.insert('test_runs', results);
}

async function getFlakinessReport() {
  const query = `
    SELECT 
      test_name,
      COUNT(*) as total_runs,
      SUM(CASE WHEN passed = false THEN 1 ELSE 0 END) as failures,
      ROUND(SUM(CASE WHEN passed = false THEN 1 ELSE 0 END)::decimal / COUNT(*) * 100, 1) as flakiness_rate
    FROM test_runs
    WHERE timestamp > NOW() - INTERVAL '14 days'
    GROUP BY test_name
    HAVING SUM(CASE WHEN passed = false THEN 1 ELSE 0 END) > 0
    ORDER BY flakiness_rate DESC
    LIMIT 20;
  `;
  return db.query(query);
}

A test is considered flaky if it fails more than 1% of runs where nothing changed. Prioritize remediation by flakiness rate.

Technique 2: The Isolation Replay

Run the suspected flaky test in isolation, 20 times in a row. If it fails any of those runs, you have confirmed flakiness and can begin debugging:

# Run the specific test 20 times and report any failures
for i in $(seq 1 20); do
  npx playwright test --grep "test name" --reporter=dot 2>&1
  echo "Run $i complete"
done | grep -E "failed|passed|Run"

Playwright also has a built-in repeat mechanism:

npx playwright test --repeat-each=10 tests/flaky-test.spec.ts

Technique 3: Playwright Trace Analysis

Playwright's trace viewer is the single most powerful tool for diagnosing flaky test failures. When a test fails in CI, the trace file captures every action, snapshot, network request, and console log.

// playwright.config.ts — generate traces for all failures
export default defineConfig({
  use: {
    trace: 'on-first-retry', // capture trace on first retry
    screenshot: 'only-on-failure',
    video: 'retain-on-failure',
  },
  retries: process.env.CI ? 2 : 0,
});

In the trace viewer, look for:

  • Timing gaps — long pauses before assertions that suggest async issues
  • Network 500s — backend errors that appear intermittently
  • Missing elements — elements that should be visible but were not found yet
  • Console errors — JavaScript errors that explain unexpected behavior

Technique 4: The Surgeon's Checklist — Hard Waits

The most common cause of flakiness in Playwright tests is using page.waitForTimeout() (hard sleep) instead of event-based waits. Search your codebase for these and replace every one:

// ❌ Flaky: fixed sleep that might not be long enough on slow CI
await page.waitForTimeout(2000);
await page.getByText('Success').click();

// ✅ Reliable: wait for the actual network event
await page.waitForResponse((res) => res.url().includes('/api/save'));
await page.getByText('Success').click();

// ✅ Reliable: wait for the element to actually be visible
await page.getByText('Success').waitFor({ state: 'visible' });
await page.getByText('Success').click();

// ✅ Reliable: for polling-based state changes
await expect(page.getByTestId('status')).toHaveText('Complete', { timeout: 10_000 });

The Eight Fixes: Matched to Root Causes

Root Cause Fix
Hard-coded sleeps Replace with waitForResponse, waitFor, expect assertions with timeout
Shared test database Isolate each test with unique data or transaction rollback
Test order dependency Reset state in beforeEach; never rely on previous test side effects
External API calls Mock third-party services in test environments
Animation timing Use page.addInitScript to disable CSS animations in tests
Port conflicts Use random available ports; check before binding
Browser memory pressure Limit parallelism; reuse browser contexts (worker scope)
CI environment differences Use Docker containers with locked dependencies

Disabling CSS Animations: Quick Win

CSS animations cause visual timing issues especially in screenshot and visual regression tests. Disable them globally in your test setup:

// tests/setup/disable-animations.ts
export async function disableAnimations(page: Page) {
  await page.addInitScript(() => {
    const style = document.createElement('style');
    style.textContent = `
      *, *::before, *::after {
        animation-duration: 0s !important;
        animation-delay: 0s !important;
        transition-duration: 0s !important;
        transition-delay: 0s !important;
      }
    `;
    document.head.appendChild(style);
  });
}

Apply it in a fixture for all tests that do visual assertions.


Handling Test Isolation: The Most Impactful Fix

Tests that leave data in a shared database are a primary source of order-dependent flakiness. The gold standard is transaction rollback between tests:

// tests/fixtures/db.ts - Wrap each test in a database transaction
export const test = base.extend<{ db: DatabaseClient }>({
  db: async ({}, use) => {
    const client = await pool.connect();
    await client.query('BEGIN');

    await use(client);

    await client.query('ROLLBACK'); // Undo all changes after each test
    client.release();
  },
});

If transaction rollback is not feasible (e.g., your tests run against a remote staging database), use unique identifiers for all test-created data and clean up in afterEach:

let createdProjectId: string;

test.beforeEach(async ({ request }) => {
  const res = await request.post('/api/projects', {
    data: { name: `Test-${crypto.randomUUID()}` },
  });
  createdProjectId = (await res.json()).id;
});

test.afterEach(async ({ request }) => {
  await request.delete(`/api/projects/${createdProjectId}`);
});

CI-Specific Flakiness: The Environmental Factor

Tests that pass locally but fail in CI are often caused by CI-specific conditions:

flowchart LR
    A[Passes locally\nFails in CI] --> B{Check these first}
    B --> C[Different Node version\nnvm use / .nvmrc]
    B --> D[Different OS\npath separator bugs]
    B --> E[Fewer CPU cores\nslower JS execution]
    B --> F[No GPU / headless\nrendering differences]
    B --> G[Network latency\nslower API calls]
    B --> H[Timezone UTC\nvs local time]

The most reliable fix for environment-related flakiness is containerizing your test runs. Using the same Docker image locally and in CI eliminates the "it works on my machine" class of failures entirely.

# docker/test.Dockerfile
FROM mcr.microsoft.com/playwright:v1.51.0-jammy
WORKDIR /app
COPY package*.json ./
RUN npm ci
COPY . .
CMD ["npx", "playwright", "test"]

Flaky Test Quarantine: A Pragmatic Tactic

While you are working on permanent fixes, quarantine known flaky tests to prevent them from blocking deployments:

// Add to flaky tests while fix is in progress
test.fixme('this test is flaky - tracked in PROJ-1234', async ({ page }) => {
  // test body
});

Or in your CI, mark as allowed-to-fail while tracking separately:

# .github/workflows/test.yml
- name: Run tests
  run: npx playwright test
  continue-on-error: ${{ contains(github.event.head_commit.message, '[skip-flaky-check]') }}

Keep a flaky test backlog in your issue tracker and review it in every sprint. Let no flaky test go unaddressed for more than two sprints.


Connecting to Production Reliability

Flaky tests are a signal about the health of both your test suite and your application. An application with a high proportion of timing-dependent test failures may also have timing-dependent production issues — race conditions, eventual consistency bugs, and state management problems.

Fixing flaky tests is therefore not just about CI stability — it is about understanding and improving the reliability of your actual application. This is why robust automated test infrastructure and production monitoring go hand in hand.

ScanlyApp's scan infrastructure is designed with flakiness resistance at its core — multiple network retries, animation-neutral screenshot capture, and intelligent assertion timing mean scan results reflect actual application state, not transient rendering conditions.

Know the difference between a flaky test and a real regression: Try ScanlyApp free and establish a stable production monitoring baseline for your most critical user flows.


The Anti-Flakiness Code Review Checklist

Add this to your PR review process to prevent new flakiness from entering the codebase:

  • No hard waitForTimeout calls without a documented justification
  • All test data is either uniquely named or cleaned up in afterEach
  • Async operations await waitForResponse or specific element states
  • No assertions on text that could change between renders (dynamic counts, dates)
  • Network mocks are explicitly cleared between tests if using a global interceptor
  • Visual screenshots disable animations

Summary: The Flakiness Remediation Workflow

1. MEASURE   → Track flakiness rate per test over 14 days
2. TRIAGE    → Sort by flakiness rate; focus on top 10
3. ISOLATE   → Run each flaky test 20x in isolation to confirm
4. TRACE     → Use Playwright Trace Viewer to find exact failure moment
5. CATEGORIZE → Which of the 8 root causes applies?
6. FIX       → Apply the matched remediation
7. VERIFY    → Run 20x again; confirm fix
8. MONITOR   → Re-check flakiness dashboard in 2 weeks

Flaky tests are not random. They have causes, and causes have solutions. The teams with the most reliable CI/CD pipelines are not the ones who got lucky — they are the ones who treated flakiness as a first-class engineering problem worth solving systematically.

Further Reading

Related articles: Also see the complete playbook for identifying every category of flaky test, how parallel execution can both expose and eliminate flakiness, and implementing continuous testing that surfaces instability early.


Want to add production monitoring that filters out the noise of network blips? Set up a ScanlyApp scan with smart retry logic on your critical paths and know instantly when a real regression hits.

Related Posts

Playwright vs. Selenium vs. Cypress: The 2026 Showdown
Playwright & Automation
10 min read

Playwright vs. Selenium vs. Cypress: The 2026 Showdown

Selenium, Cypress, and Playwright are the three titans of browser automation. In 2026, one has clearly pulled ahead — but the right choice still depends on your team, stack, and goals. Here's the definitive, opinionated comparison.