Chaos Engineering: Break Your System on Purpose Before Your Users Do It for You
Netflix's Chaos Monkey became famous for randomly terminating production servers. The philosophy behind it was simple: if you don't test your resilience, your failures will be tests you didn't design. Better to discover the weakness yourself, on a weekday afternoon with your full team available, than at 2 AM on Black Friday.
Chaos engineering does not mean randomly breaking things. It means forming a hypothesis ("the system will continue to serve requests within normal latency bounds even if the database response time increases to 5 seconds"), designing a controlled experiment to test it, and learning from the result.
This guide covers practical chaos engineering for SaaS applications — without requiring a dedicated chaos platform to get started.
The Chaos Engineering Process
flowchart LR
A[Define steady state\n what does healthy look like?] --> B
B[Form hypothesis\n if X fails, Y should happen] --> C
C[Design experiment\n minimal blast radius] --> D
D[Run experiment\n inject failure] --> E
E{Hypothesis\nconfirmed?}
E -->|Yes| F[Document &\nexpand scope]
E -->|No| G[Fix weakness\nre-test]
G --> D
Starting Without a Chaos Platform
You do not need Chaos Mesh or LitmusChaos to start. Simple chaos experiments can be run with standard Linux tools:
Network Latency with tc (Traffic Control)
# On the application server or within the container:
# Add 500ms latency to all outbound traffic to the database
sudo tc qdisc add dev eth0 root netem delay 500ms
# Add realistic jitter (normal distribution: 500ms ± 150ms)
sudo tc qdisc add dev eth0 root netem delay 500ms 150ms 25%
# Add packet loss (1%)
sudo tc qdisc add dev eth0 root netem loss 1%
# Combined: 200ms delay + 50ms jitter + 0.1% loss (realistic network)
sudo tc qdisc add dev eth0 root netem delay 200ms 50ms 25% loss 0.1%
# Remove all rules
sudo tc qdisc del dev eth0 root netem
Targeting Specific Services
# Only inject latency to Redis (port 6379)
sudo tc qdisc add dev eth0 root handle 1: prio
sudo tc qdisc add dev eth0 parent 1:3 handle 30: netem delay 3000ms
sudo tc filter add dev eth0 protocol ip parent 1:0 prio 3 u32 \
match ip dport 6379 0xffff flowid 1:3
Application-Level Chaos Testing with Playwright
For black-box chaos testing, use Playwright's route interception to inject failures at the network layer:
// tests/chaos/latency.test.ts
import { test, expect } from '@playwright/test';
test('dashboard loads with degraded acceptable UX when API responds in 3s', async ({ page }) => {
// Inject 3-second latency on API calls
await page.route('**/api/**', async (route) => {
await new Promise((resolve) => setTimeout(resolve, 3000));
await route.continue();
});
await page.goto('/dashboard');
// Should show loading states, not error pages
await expect(page.locator('[data-testid="loading-skeleton"]')).toBeVisible();
// Should eventually load (not time out or show error)
await expect(page.locator('[data-testid="dashboard-content"]')).toBeVisible({
timeout: 15_000,
});
// Should not flash an error state
await expect(page.locator('[data-testid="error-state"]')).not.toBeVisible();
});
test('graceful degradation when search API is completely down', async ({ page }) => {
// Total failure of the search API
await page.route('**/api/search**', (route) => route.abort('failed'));
await page.goto('/dashboard');
await page.fill('[data-testid="search-input"]', 'test query');
await page.keyboard.press('Enter');
// Should show user-friendly error, not broken UI
await expect(page.locator('[data-testid="search-error"]')).toBeVisible();
await expect(page.locator('[data-testid="search-error"]')).toContainText(/unavailable|try again/i);
// Navigation and other features should still work
await page.click('[data-testid="nav-settings"]');
await expect(page).toHaveURL('/settings');
});
test('retry button works after transient payment API failure', async ({ page }) => {
let callCount = 0;
await page.route('**/api/billing/upgrade**', async (route) => {
callCount++;
if (callCount === 1) {
// First call fails (simulates transient provider outage)
await route.fulfill({ status: 503, body: 'Service Unavailable' });
} else {
await route.continue();
}
});
await page.goto('/billing/upgrade');
await page.click('[data-testid="upgrade-btn"]');
// Should show retry option, not permanent failure
await expect(page.locator('[data-testid="retry-btn"]')).toBeVisible();
// Retry should succeed
await page.click('[data-testid="retry-btn"]');
await expect(page.locator('[data-testid="upgrade-success"]')).toBeVisible();
});
Chaos Mesh: Production-Grade Chaos Injection
For teams running Kubernetes, Chaos Mesh provides a declarative chaos injection API:
# chaos/experiments/database-latency.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
name: database-latency-experiment
namespace: production
spec:
action: delay
mode: one
selector:
namespaces:
- production
labelSelectors:
app: frontend
delay:
latency: '500ms'
correlation: '25'
jitter: '100ms'
direction: egress
externalTargets:
- 'postgres-service.production.svc.cluster.local'
# Automatically stop after 5 minutes (blast radius control)
duration: '5m'
# chaos/experiments/redis-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
name: redis-pod-failure
spec:
action: pod-kill
mode: one
selector:
namespaces:
- production
labelSelectors:
app: redis
duration: '60s' # Kill Redis for 60 seconds, test reconnection
Run the experiment and verify through your observability stack:
# Apply the chaos experiment
kubectl apply -f chaos/experiments/database-latency.yaml
# Monitor key metrics during the experiment
watch kubectl get netchaos database-latency-experiment
# Check application health during experiment
curl https://app.scanlyapp.com/api/health | jq .
# After experiment concludes, check error rate in Grafana
open https://grafana.internal/dashboard/errors?duration=past15m
# Remove chaos experiment (also auto-removes after duration)
kubectl delete netchaos database-latency-experiment
The Game Day Format
Chaos experiments are most valuable when run as structured "Game Days":
| Phase | Duration | Activity |
|---|---|---|
| Pre-experiment | 30 min | Define steady state metrics; baseline all SLIs |
| Experiment design | 30 min | Agree on hypothesis; set abort criteria |
| Injection (chaos) | 15–60 min | Run experiment; observe; document |
| Analysis | 45 min | Compare to hypothesis; identify failures |
| Action items | 30 min | Write remediation tasks; prioritize |
| Retrospective | 15 min | What did we learn? What did we fear? |
Abort criteria (stop if):
- Error rate exceeds 5% for more than 2 minutes
- Any transaction in the experiment payment path fails
- On-call receives a customer escalation
Related articles: Also see the QA guide to chaos engineering principles and tooling, load testing the system before introducing chaos failures, and observability needed to measure system behaviour during chaos experiments.
Building Your Chaos Experiment Library
| Experiment | Hypothesis | Expected Outcome |
|---|---|---|
| Database +500ms latency | App serves requests with degraded performance | p95 < 3s, zero 500s |
| Redis cache unavailable | App falls back to DB, higher latency | p95 < 5s, cache-miss state visible |
| Worker pod killed | Scan queue pauses, resumes on restart | No scan data loss, queue drains |
| CDN edge node failure | Traffic reroutes to origin | Higher latency, no errors |
| External API (Stripe/Paddle) +10s | Checkout shows retry flow | No data corruption, user informed |
| DNS resolution failure | Dependent service unreachable | Graceful error messages |
Chaos engineering is not about inducing chaos — it's about systematically eliminating it by discovering and remediating weaknesses before they cause production incidents.
Always know your system's health after any deployment or infrastructure change: Try ScanlyApp free and run automated smoke tests that verify functional correctness alongside your chaos experiments.
