Chaos Engineering: Break Your System on Purpose Before Your Users Do It for You

Netflix's Chaos Monkey became famous for randomly terminating production servers. The philosophy behind it was simple: if you don't test your resilience, your failures will be tests you didn't design. Better to discover the weakness yourself, on a weekday afternoon with your full team available, than at 2 AM on Black Friday.

Chaos engineering does not mean randomly breaking things. It means forming a hypothesis ("the system will continue to serve requests within normal latency bounds even if the database response time increases to 5 seconds"), designing a controlled experiment to test it, and learning from the result.

This guide covers practical chaos engineering for SaaS applications — without requiring a dedicated chaos platform to get started.

The Chaos Engineering Process

flowchart LR
    A[Define steady state\n what does healthy look like?] --> B
    B[Form hypothesis\n if X fails, Y should happen] --> C
    C[Design experiment\n minimal blast radius] --> D
    D[Run experiment\n inject failure] --> E
    E{Hypothesis\nconfirmed?}
    E -->|Yes| F[Document &\nexpand scope]
    E -->|No| G[Fix weakness\nre-test]
    G --> D

Starting Without a Chaos Platform

You do not need Chaos Mesh or LitmusChaos to start. Simple chaos experiments can be run with standard Linux tools:

Network Latency with `tc` (Traffic Control)

# On the application server or within the container:
# Add 500ms latency to all outbound traffic to the database
sudo tc qdisc add dev eth0 root netem delay 500ms

# Add realistic jitter (normal distribution: 500ms ± 150ms)
sudo tc qdisc add dev eth0 root netem delay 500ms 150ms 25%

# Add packet loss (1%)
sudo tc qdisc add dev eth0 root netem loss 1%

# Combined: 200ms delay + 50ms jitter + 0.1% loss (realistic network)
sudo tc qdisc add dev eth0 root netem delay 200ms 50ms 25% loss 0.1%

# Remove all rules
sudo tc qdisc del dev eth0 root netem

Targeting Specific Services

# Only inject latency to Redis (port 6379)
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 3000ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
  match ip dport 6379 0xffff flowid 1:3

Application-Level Chaos Testing with Playwright

For black-box chaos testing, use Playwright's route interception to inject failures at the network layer:

// tests/chaos/latency.test.ts
import { test, expect } from '@playwright/test';

test('dashboard loads with degraded acceptable UX when API responds in 3s', async ({ page }) => {
  // Inject 3-second latency on API calls
  await page.route('**/api/**', async (route) => {
    await new Promise((resolve) => setTimeout(resolve, 3000));
    await route.continue();
  });

  await page.goto('/dashboard');

  // Should show loading states, not error pages
  await expect(page.locator('[data-testid="loading-skeleton"]')).toBeVisible();

  // Should eventually load (not time out or show error)
  await expect(page.locator('[data-testid="dashboard-content"]')).toBeVisible({
    timeout: 15_000,
  });

  // Should not flash an error state
  await expect(page.locator('[data-testid="error-state"]')).not.toBeVisible();
});

test('graceful degradation when search API is completely down', async ({ page }) => {
  // Total failure of the search API
  await page.route('**/api/search**', (route) => route.abort('failed'));

  await page.goto('/dashboard');
  await page.fill('[data-testid="search-input"]', 'test query');
  await page.keyboard.press('Enter');

  // Should show user-friendly error, not broken UI
  await expect(page.locator('[data-testid="search-error"]')).toBeVisible();
  await expect(page.locator('[data-testid="search-error"]')).toContainText(/unavailable|try again/i);

  // Navigation and other features should still work
  await page.click('[data-testid="nav-settings"]');
  await expect(page).toHaveURL('/settings');
});

test('retry button works after transient payment API failure', async ({ page }) => {
  let callCount = 0;

  await page.route('**/api/billing/upgrade**', async (route) => {
    callCount++;
    if (callCount === 1) {
      // First call fails (simulates transient provider outage)
      await route.fulfill({ status: 503, body: 'Service Unavailable' });
    } else {
      await route.continue();
    }
  });

  await page.goto('/billing/upgrade');
  await page.click('[data-testid="upgrade-btn"]');

  // Should show retry option, not permanent failure
  await expect(page.locator('[data-testid="retry-btn"]')).toBeVisible();

  // Retry should succeed
  await page.click('[data-testid="retry-btn"]');
  await expect(page.locator('[data-testid="upgrade-success"]')).toBeVisible();
});

Chaos Mesh: Production-Grade Chaos Injection

For teams running Kubernetes, Chaos Mesh provides a declarative chaos injection API:

# chaos/experiments/database-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: database-latency-experiment
  namespace: production
spec:
  action: delay
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: frontend
  delay:
    latency: '500ms'
    correlation: '25'
    jitter: '100ms'
  direction: egress
  externalTargets:
    - 'postgres-service.production.svc.cluster.local'
  # Automatically stop after 5 minutes (blast radius control)
  duration: '5m'

# chaos/experiments/redis-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: redis-pod-failure
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - production
    labelSelectors:
      app: redis
  duration: '60s' # Kill Redis for 60 seconds, test reconnection

Run the experiment and verify through your observability stack:

# Apply the chaos experiment
kubectl apply -f chaos/experiments/database-latency.yaml

# Monitor key metrics during the experiment
watch kubectl get netchaos database-latency-experiment

# Check application health during experiment
curl https://app.scanlyapp.com/api/health | jq .

# After experiment concludes, check error rate in Grafana
open https://grafana.internal/dashboard/errors?duration=past15m

# Remove chaos experiment (also auto-removes after duration)
kubectl delete netchaos database-latency-experiment

The Game Day Format

Chaos experiments are most valuable when run as structured "Game Days":

Phase	Duration	Activity
Pre-experiment	30 min	Define steady state metrics; baseline all SLIs
Experiment design	30 min	Agree on hypothesis; set abort criteria
Injection (chaos)	15–60 min	Run experiment; observe; document
Analysis	45 min	Compare to hypothesis; identify failures
Action items	30 min	Write remediation tasks; prioritize
Retrospective	15 min	What did we learn? What did we fear?

Abort criteria (stop if):

Error rate exceeds 5% for more than 2 minutes
Any transaction in the experiment payment path fails
On-call receives a customer escalation

Building Your Chaos Experiment Library

Experiment	Hypothesis	Expected Outcome
Database +500ms latency	App serves requests with degraded performance	p95 < 3s, zero 500s
Redis cache unavailable	App falls back to DB, higher latency	p95 < 5s, cache-miss state visible
Worker pod killed	Scan queue pauses, resumes on restart	No scan data loss, queue drains
CDN edge node failure	Traffic reroutes to origin	Higher latency, no errors
External API (Stripe/Paddle) +10s	Checkout shows retry flow	No data corruption, user informed
DNS resolution failure	Dependent service unreachable	Graceful error messages

Chaos engineering is not about inducing chaos — it's about systematically eliminating it by discovering and remediating weaknesses before they cause production incidents.

Always know your system's health after any deployment or infrastructure change: Try ScanlyApp free and run automated smoke tests that verify functional correctness alongside your chaos experiments.

Chaos Engineering: Break Your System on Purpose Before Your Users Do It for You

Chaos Engineering: Break Your System on Purpose Before Your Users Do It for You

The Chaos Engineering Process

Starting Without a Chaos Platform

Network Latency with `tc` (Traffic Control)

Targeting Specific Services

Application-Level Chaos Testing with Playwright

Chaos Mesh: Production-Grade Chaos Injection

The Game Day Format

Building Your Chaos Experiment Library

Related Posts

API Cost Optimisation: How Engineering Teams Cut Cloud Spend by 40%

Webhook Testing: How to Guarantee Delivery, Retry Logic, and Correct Event Ordering

Testing Helm Charts: Catch Kubernetes Configuration Bugs Before They Reach Production

Chaos Engineering: Break Your System on Purpose Before Your Users Do It for You

The Chaos Engineering Process

Starting Without a Chaos Platform

Network Latency with tc (Traffic Control)

Targeting Specific Services

Application-Level Chaos Testing with Playwright

Chaos Mesh: Production-Grade Chaos Injection

The Game Day Format

Building Your Chaos Experiment Library

Related Posts

API Cost Optimisation: How Engineering Teams Cut Cloud Spend by 40%

Webhook Testing: How to Guarantee Delivery, Retry Logic, and Correct Event Ordering

Testing Helm Charts: Catch Kubernetes Configuration Bugs Before They Reach Production

Network Latency with `tc` (Traffic Control)