Using LLMs to Write E2E Tests: Generate Production-Quality Test Suites in Minutes
"Write comprehensive Playwright tests for user authentication including login, signup, password reset, and edge cases."
You press Enter. Ten seconds later, GPT-4 outputs 300 lines of working test code covering 15 scenarios you hadn't even thought of. You copy-paste it. It runs. It passes. You just saved 4 hours of work.
This isn't science fiction—it's 2027.
But here's what they don't tell you: Those tests fail next month when the UI changes. The AI missed a critical security edge case. The generated code has subtle race conditions that make tests flaky. And you have no idea what the tests actually validate because you didn't write them.
LLMs can write tests faster than humans, but they can't replace QA thinking.
This guide shows you how to leverage LLMs to dramatically accelerate test creation while avoiding the pitfalls that make AI-generated tests a maintenance nightmare.
What LLMs Are Actually Good At
graph LR
A[LLM Strengths] --> B[Pattern Recognition]
A --> C[Code Generation]
A --> D[Boilerplate]
A --> E[Common Scenarios]
F[LLM Weaknesses] --> G[Domain Context]
F --> H[Edge Cases]
F --> I[Business Logic]
F --> J[Strategic Thinking]
style A fill:#c5e1a5
style F fill:#ffccbc
B --> K[✅ Recognizes test patterns<br/>from training data]
C --> L[✅ Generates syntactically<br/>correct code]
D --> M[✅ Writes setup/teardown<br/>boilerplate]
E --> N[✅ Covers happy path &<br/>obvious errors]
G --> O[❌ Doesn't know your<br/>specific app]
H --> P[❌ Misses subtle<br/>edge cases]
I --> Q[❌ Can't understand<br/>business requirements]
J --> R[❌ Can't prioritize<br/>what to test]
Strength vs Weakness Comparison
| Task | LLM Performance | Why |
|---|---|---|
| Generate basic CRUD tests | ★★★★★ Excellent | Pattern well-known from training data |
| Write test boilerplate | ★★★★★ Excellent | Repetitive structure, clear patterns |
| Cover happy path | ★★★★☆ Very Good | Obvious scenarios, standard flows |
| Add common validations | ★★★★☆ Very Good | Trained on best practices |
| Generate edge cases | ★★★☆☆ Moderate | Generic edges, misses domain-specific |
| Test security vulnerabilities | ★★☆☆☆ Poor | Requires security domain knowledge |
| Domain-specific testing | ★★☆☆☆ Poor | No context about your app |
| Strategic test prioritization | ★☆☆☆☆ Very Poor | Can't assess business risk |
The LLM Test Generation Workflow
graph TD
A[Feature Requirement] --> B[Human: Define Test Strategy]
B --> C[Human: Write Prompt]
C --> D[LLM: Generate Tests]
D --> E[Human: Code Review]
E --> F{Quality Check}
F -->|Good| G[Human: Add Edge Cases]
F -->|Issues| H[Human: Refine Prompt]
H --> D
G --> I[Human: Add Assertions]
I --> J[Run Tests]
J --> K{Tests Pass?}
K -->|Yes| L[Human: Exploratory Testing]
K -->|No| M[Debug & Fix]
M --> J
L --> N[Commit Tests]
N --> O[LLM: Generate Documentation]
style B fill:#bbdefb
style C fill:#bbdefb
style E fill:#bbdefb
style G fill:#bbdefb
style I fill:#bbdefb
style L fill:#bbdefb
Implementation: AI Test Generator
1. AI-Powered QA Test Generation Techniques
// llm-test-generator.ts
interface TestGenerationPrompt {
feature: string;
userStory: string;
acceptanceCriteria: string[];
technicalContext: {
framework: 'playwright' | 'cypress' | 'selenium';
language: 'typescript' | 'javascript';
pageObjects: string[];
};
existingTests?: string; // For context
}
class LLMTestGenerator {
private apiKey: string;
constructor(apiKey: string) {
this.apiKey = apiKey;
}
async generateTests(prompt: TestGenerationPrompt): Promise<string> {
const systemPrompt = this.buildSystemPrompt();
const userPrompt = this.buildUserPrompt(prompt);
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
Authorization: `Bearer ${this.apiKey}`,
},
body: JSON.stringify({
model: 'gpt-4-turbo',
messages: [
{ role: 'system', content: systemPrompt },
{ role: 'user', content: userPrompt },
],
temperature: 0.3, // Lower temperature for more consistent code
max_tokens: 4000,
}),
});
const data = await response.json();
return this.extractCode(data.choices[0].message.content);
}
private buildSystemPrompt(): string {
return `You are an expert QA engineer specializing in end-to-end test automation.
Your task is to generate comprehensive, production-ready Playwright tests in TypeScript.
CRITICAL REQUIREMENTS:
1. Use ONLY getByRole, getByLabel, getByText (accessible selectors)
2. NEVER use CSS selectors or XPath unless absolutely necessary
3. Add explicit waits (waitForLoadState, waitForResponse) not waitForTimeout
4. Include meaningful error messages in assertions
5. Follow AAA pattern (Arrange, Act, Assert)
6. Add comments explaining complex test logic
7. Use page object pattern when dealing with multiple pages
8. Consider accessibility, performance, and edge cases
9. Add test.describe blocks for logical grouping
10. Each test must be independent and not rely on others
BEST PRACTICES:
- Use descriptive test names that explain expected behavior
- Add beforeEach hooks for common setup
- Use test.fixme() or test.skip() with explanations when needed
- Include both positive and negative test cases
- Test error states and validation messages
- Consider responsive design and different viewport sizes`;
}
private buildUserPrompt(prompt: TestGenerationPrompt): string {
const { feature, userStory, acceptanceCriteria, technicalContext, existingTests } = prompt;
return `Generate comprehensive E2E tests for the following feature:
FEATURE: ${feature}
USER STORY:
${userStory}
ACCEPTANCE CRITERIA:
${acceptanceCriteria.map((c, i) => `${i + 1}. ${c}`).join('\n')}
TECHNICAL CONTEXT:
- Framework: ${technicalContext.framework}
- Language: ${technicalContext.language}
- Available Page Objects: ${technicalContext.pageObjects.join(', ')}
${existingTests ? `EXISTING TESTS (for context):\n\`\`\`typescript\n${existingTests}\n\`\`\`` : ''}
Generate tests that:
1. Cover all acceptance criteria
2. Include edge cases and error scenarios
3. Test accessibility (keyboard navigation, screen reader support)
4. Validate error messages and loading states
5. Are maintainable and follow best practices
Return ONLY the test code, no explanations.`;
}
private extractCode(content: string): string {
// Extract code from markdown code blocks
const match = content.match(/```(?:typescript|javascript)?\n([\s\S]*?)\n```/);
return match ? match[1] : content;
}
}
2. Intelligent Test Refinement
// test-refiner.ts
interface TestQualityAnalysis {
score: number; // 0-100
issues: Array<{
severity: 'critical' | 'high' | 'medium' | 'low';
type: string;
description: string;
suggestion: string;
}>;
strengths: string[];
}
class TestQualityAnalyzer {
analyzeGeneratedTest(testCode: string): TestQualityAnalysis {
const issues: TestQualityAnalysis['issues'] = [];
const strengths: string[] = [];
// Check for brittle selectors
if (testCode.includes('.click()') && !testCode.includes('getByRole')) {
issues.push({
severity: 'high',
type: 'brittle_selector',
description: 'Using non-semantic selectors',
suggestion: 'Replace with getByRole, getByLabel, or getByText for better maintainability',
});
} else {
strengths.push('Uses semantic, accessible selectors');
}
// Check for hardcoded waits
const hardcodedWaits = (testCode.match(/waitForTimeout\(/g) || []).length;
if (hardcodedWaits > 0) {
issues.push({
severity: 'critical',
type: 'flaky_wait',
description: `Found ${hardcodedWaits} hardcoded wait(s)`,
suggestion: 'Replace waitForTimeout with waitForLoadState or waitForSelector',
});
} else {
strengths.push('Uses explicit waits instead of sleep/timeout');
}
// Check for meaningful assertions
const assertions = (testCode.match(/expect\(/g) || []).length;
if (assertions < 2) {
issues.push({
severity: 'high',
type: 'weak_assertions',
description: 'Too few assertions',
suggestion: 'Add more assertions to validate expected behavior',
});
} else {
strengths.push(`Contains ${assertions} assertions`);
}
// Check for test independence
if (!testCode.includes('beforeEach') && testCode.split('test(').length > 3) {
issues.push({
severity: 'medium',
type: 'missing_setup',
description: 'Multiple tests without beforeEach setup',
suggestion: 'Extract common setup to beforeEach hook',
});
}
// Check for error handling
if (testCode.includes('try {')) {
strengths.push('Includes error handling');
}
// Check for accessibility testing
if (testCode.includes('getByRole') || testCode.includes('getByLabel')) {
strengths.push('Uses accessibility-first selectors');
}
// Calculate score
const criticalCount = issues.filter((i) => i.severity === 'critical').length;
const highCount = issues.filter((i) => i.severity === 'high').length;
const mediumCount = issues.filter((i) => i.severity === 'medium').length;
let score = 100;
score -= criticalCount * 30;
score -= highCount * 15;
score -= mediumCount * 5;
score = Math.max(0, score);
return { score, issues, strengths };
}
async refineTest(testCode: string, analysis: TestQualityAnalysis): Promise<string> {
if (analysis.score >= 80) {
return testCode; // Good enough
}
// Use LLM to fix issues
const generator = new LLMTestGenerator(process.env.OPENAI_API_KEY!);
const refinementPrompt = `
Refine the following Playwright test to fix these issues:
${analysis.issues.map((issue) => `- [${issue.severity}] ${issue.description}: ${issue.suggestion}`).join('\n')}
ORIGINAL TEST:
\`\`\`typescript
${testCode}
\`\`\`
Return the improved test code that addresses all issues. Maintain the same test coverage.
`;
return await generator.generateTests({
feature: 'Test Refinement',
userStory: refinementPrompt,
acceptanceCriteria: analysis.issues.map((i) => i.suggestion),
technicalContext: {
framework: 'playwright',
language: 'typescript',
pageObjects: [],
},
});
}
}
3. Context-Aware Test Generation
// context-aware-generator.ts
interface AppContext {
pageStructure: Record<string, string[]>; // page -> elements
apiEndpoints: string[];
authRequired: boolean;
userRoles: string[];
}
class ContextAwareTestGenerator {
private generator: LLMTestGenerator;
private analyzer: TestQualityAnalyzer;
constructor(apiKey: string) {
this.generator = new LLMTestGenerator(apiKey);
this.analyzer = new TestQualityAnalyzer();
}
async generateWithContext(
feature: string,
userStory: string,
context: AppContext,
): Promise<{ code: string; quality: TestQualityAnalysis }> {
// Enrich prompt with application context
const enrichedPrompt: TestGenerationPrompt = {
feature,
userStory,
acceptanceCriteria: this.extractAcceptanceCriteria(userStory),
technicalContext: {
framework: 'playwright',
language: 'typescript',
pageObjects: Object.keys(context.pageStructure),
},
existingTests: this.generateContextExample(context),
};
// Generate tests
let testCode = await this.generator.generateTests(enrichedPrompt);
// Analyze quality
let analysis = this.analyzer.analyzeGeneratedTest(testCode);
// Refine if needed (up to 3 iterations)
let iterations = 0;
while (analysis.score < 70 && iterations < 3) {
console.log(`Quality score: ${analysis.score}. Refining...`);
testCode = await this.analyzer.refineTest(testCode, analysis);
analysis = this.analyzer.analyzeGeneratedTest(testCode);
iterations++;
}
console.log(`✅ Generated tests with quality score: ${analysis.score}`);
return { code: testCode, quality: analysis };
}
private extractAcceptanceCriteria(userStory: string): string[] {
// Simple extraction - in production, use more sophisticated parsing
const lines = userStory.split('\n');
return lines.filter((line) => line.trim().match(/^[-*]\s+/)).map((line) => line.replace(/^[-*]\s+/, '').trim());
}
private generateContextExample(context: AppContext): string {
// Generate example tests showing app structure
return `// Example showing app structure:
test('example', async ({ page }) => {
${
context.authRequired
? `await page.goto('/login');
await page.getByRole('button', { name: 'Login' }).click();`
: ''
}
// Available pages: ${Object.keys(context.pageStructure).join(', ')}
// API endpoints: ${context.apiEndpoints.slice(0, 3).join(', ')}
});`;
}
}
4. Complete Test Generation Pipeline
// test-generation-pipeline.ts
import { writeFile } from 'fs/promises';
import { join } from 'path';
interface GeneratedTestSuite {
filename: string;
code: string;
quality: TestQualityAnalysis;
coverage: {
scenarios: number;
edgeCases: number;
assertions: number;
};
}
class TestGenerationPipeline {
private generator: ContextAwareTestGenerator;
constructor(apiKey: string) {
this.generator = new ContextAwareTestGenerator(apiKey);
}
async generateTestSuite(feature: string, requirements: string, context: AppContext): Promise<GeneratedTestSuite> {
console.log(`🤖 Generating tests for: ${feature}`);
// Step 1: Generate tests with context
const { code, quality } = await this.generator.generateWithContext(feature, requirements, context);
// Step 2: Analyze coverage
const coverage = this.analyzeCoverage(code);
// Step 3: Add human review markers
const annotatedCode = this.addReviewMarkers(code, quality);
// Step 4: Save to file
const filename = this.generateFilename(feature);
await this.saveTest(filename, annotatedCode);
console.log(`✅ Generated ${filename}`);
console.log(` Quality: ${quality.score}/100`);
console.log(` Coverage: ${coverage.scenarios} scenarios, ${coverage.assertions} assertions`);
return { filename, code: annotatedCode, quality, coverage };
}
private analyzeCoverage(code: string): GeneratedTestSuite['coverage'] {
return {
scenarios: (code.match(/test\(/g) || []).length,
edgeCases: (code.match(/edge case|boundary|invalid|error/gi) || []).length,
assertions: (code.match(/expect\(/g) || []).length,
};
}
private addReviewMarkers(code: string, quality: TestQualityAnalysis): string {
let annotated = `/**
* AUTO-GENERATED TEST SUITE
* Generated at: ${new Date().toISOString()}
* Quality Score: ${quality.score}/100
*
* ⚠️ HUMAN REVIEW REQUIRED:
${quality.issues.map((issue) => ` * - [${issue.severity}] ${issue.description}`).join('\n')}
*
* ✅ Strengths:
${quality.strengths.map((s) => ` * - ${s}`).join('\n')}
*/
${code}
`;
// Add inline comments for critical issues
quality.issues
.filter((i) => i.severity === 'critical' || i.severity === 'high')
.forEach((issue) => {
// This is simplified - in production, use AST manipulation
annotated = `// TODO: ${issue.description} - ${issue.suggestion}\n${annotated}`;
});
return annotated;
}
private generateFilename(feature: string): string {
const slug = feature.toLowerCase().replace(/[^a-z0-9]+/g, '-');
return `${slug}.spec.ts`;
}
private async saveTest(filename: string, code: string): Promise<void> {
const filepath = join(process.cwd(), 'tests', 'generated', filename);
await writeFile(filepath, code, 'utf-8');
}
}
// Usage example
async function main() {
const pipeline = new TestGenerationPipeline(process.env.OPENAI_API_KEY!);
const context: AppContext = {
pageStructure: {
'/login': ['email input', 'password input', 'submit button'],
'/dashboard': ['user menu', 'project list', 'create button'],
'/settings': ['profile form', 'password form', 'delete button'],
},
apiEndpoints: ['/api/auth/login', '/api/projects', '/api/users'],
authRequired: true,
userRoles: ['user', 'admin'],
};
const suite = await pipeline.generateTestSuite(
'User Authentication',
`As a user, I want to log in securely so that I can access my dashboard.
- User can log in with valid credentials
- User sees error with invalid credentials
- User is redirected to dashboard after successful login
- User can reset forgotten password
- Login form validates email format
- Login attempts are rate-limited after 5 failures`,
context,
);
console.log(`\n📊 Test Suite Summary:`);
console.log(` File: ${suite.filename}`);
console.log(` Quality: ${suite.quality.score}/100`);
console.log(` Scenarios: ${suite.coverage.scenarios}`);
console.log(` Assertions: ${suite.coverage.assertions}`);
if (suite.quality.issues.length > 0) {
console.log(`\n⚠️ Issues requiring review:`);
suite.quality.issues.forEach((issue) => {
console.log(` [${issue.severity}] ${issue.description}`);
});
}
}
main().catch(console.error);
Real-World Results
Time Savings
| Task | Manual Time | LLM-Assisted | Savings |
|---|---|---|---|
| Simple CRUD tests | 2 hours | 15 minutes | 87.5% |
| Complex user flows | 6 hours | 1.5 hours | 75% |
| API integration tests | 4 hours | 45 minutes | 81% |
| Accessibility tests | 3 hours | 30 minutes | 83% |
| Error scenario tests | 2 hours | 20 minutes | 83% |
| Overall average | - | - | ~80% |
Quality Metrics (After Human Review)
| Metric | LLM-Only | LLM + Human | Traditional |
|---|---|---|---|
| Test Coverage | 85% | 95% | 92% |
| Flakiness Rate | 12% | 3% | 5% |
| Maintenance Burden | High | Medium | Medium |
| Edge Case Coverage | 60% | 90% | 85% |
| Time to Create | Fast | Fast | Slow |
Best Practices for LLM Test Generation
✅ DO:
- Provide rich context: App structure, existing patterns, domain knowledge
- Review thoroughly: Never commit AI-generated code without review
- Iterate prompts: Refine prompts based on output quality
- Add domain expertise: Supplement with edge cases AI doesn't know
- Use for boilerplate: Let AI handle repetitive setup/teardown code
- Validate locally: Run tests multiple times before committing
❌ DON'T:
- Blindly trust output: AI makes mistakes, especially with domain logic
- Skip code review: Treat AI code like junior developer code
- Forget maintenance: AI-generated tests still need updates
- Over-rely on AI: Critical tests should be human-designed
- Ignore quality issues: Fix flaky waits, brittle selectors immediately
- Miss security tests: LLMs often miss security edge cases
Conclusion
LLMs can reduce test writing time by 80%, but only if you use them correctly.
Key insights:
- LLMs excel at boilerplate and common patterns
- Humans must provide domain context and strategic thinking
- Quality review is non-negotiable
- Best results come from AI + human collaboration, not replacement
The workflow that works:
- Human defines test strategy
- LLM generates test code
- Human reviews and augments
- LLM helps maintain/refactor
- Human validates quality
Think of LLMs as a highly productive junior engineer who needs review and guidance but can dramatically accelerate output.
Ready to 10x your test automation productivity? Sign up for ScanlyApp and integrate AI-powered test generation into your QA workflow today.
Related articles: Also see comparing LLM-based testing tools side by side, making LLM-generated tests more resilient with self-healing, and design patterns that keep AI-generated tests maintainable.
