Catalypt LogoCatalypt.ai

Industry Focus

Developer Options

Resources

Back to Blog

The Hidden Cost of AI-Generated Tests

February 20, 2024 Josh Butler Technical

"We achieved 100% code coverage!" The junior dev was ecstatic. AI had generated 1,247 tests overnight. Every single line of code was tested. The coverage report was a beautiful sea of green.

Then we pushed a breaking change to production. Not a single test failed.

The Test That Tests Nothing

Here's what AI was generating:

// The function
function calculateDiscount(price, customerType) {
  if (customerType === 'premium') {
    return price * 0.8;
  }
  return price;
}

// AI's "test"
describe('calculateDiscount', () => {
  it('should calculate discount', () => {
    const result = calculateDiscount(100, 'premium');
    expect(result).toBeDefined();
    expect(typeof result).toBe('number');
    expect(result).toBeLessThanOrEqual(100);
  });
});

Technically correct. Completely useless. The test would pass if the function returned 0, 80, or 99.99.

The Coverage Illusion

Our coverage report showed 100%. Here's what was actually tested:

  • ✅ Functions exist
  • ✅ Functions return something
  • ✅ No syntax errors
  • ❌ Business logic
  • ❌ Edge cases
  • ❌ Integration between components
  • ❌ Actual user scenarios

Real AI Test Disasters

The Mock Everything Approach

// Component that fetches user data
function UserProfile({ userId }) {
  const user = useUserData(userId);
  return <div>{user.name}</div>;
}

// AI test
test('renders UserProfile', () => {
  // AI mocked EVERYTHING
  jest.mock('./hooks/useUserData', () => ({
    useUserData: () => ({ name: 'Test User' })
  }));
  
  render(<UserProfile userId={1} />);
  expect(screen.getByText('Test User')).toBeInTheDocument();
});

The test passes even if useUserData is completely broken. We're testing our mock, not our code.

The Snapshot Spam

// AI's solution to testing complex components
test('matches snapshot', () => {
  const component = render(<ComplexDashboard />);
  expect(component).toMatchSnapshot();
});

test('matches snapshot with props', () => {
  const component = render(<ComplexDashboard showChart />);
  expect(component).toMatchSnapshot();
});

// 47 more snapshot tests...

Now every tiny CSS change breaks 50 tests. Developers just update snapshots without looking.

The Tautological Test

// Function
function addNumbers(a, b) {
  return a + b;
}

// AI test
test('addNumbers adds numbers', () => {
  const mockAdd = jest.fn((a, b) => a + b);
  expect(mockAdd(2, 3)).toBe(5);
});

// It's... testing its own mock

Why AI Tests Look Good But Aren't

1. AI Optimizes for Coverage, Not Quality

Ask for tests, AI thinks: "How can I execute every line?" not "How can I verify this works correctly?"

2. No Understanding of Intent

// What the function does
function validateAge(age) {
  return age >= 18 && age <= 100;
}

// What AI tests
expect(validateAge(50)).toBe(true); // ✓ Happy path

// What AI misses
validateAge(-1)    // Should be false
validateAge(0)     // Edge case
validateAge(17)    // Boundary
validateAge(18)    // Boundary
validateAge(100)   // Boundary  
validateAge(101)   // Boundary
validateAge('18')  // Type coercion?
validateAge(null)  // Error handling?

3. The Implementation Mirror

AI reads the code and writes tests that mirror the implementation. If the implementation is wrong, the tests "prove" it's right.

The Real Cost

  • False confidence: "We have tests!" (that test nothing)
  • Maintenance burden: 1000 bad tests to update
  • Slow CI/CD: Running useless tests takes time
  • Hidden bugs: Real issues slip through
  • Developer fatigue: Updating broken tests constantly
""

How to Generate Tests That Actually Work

1. Behavior-Driven Prompts

// Bad prompt
"Write tests for this function"

// Good prompt
"Write tests for calculateDiscount that verify:
- Premium customers get exactly 20% off
- Regular customers get no discount
- Invalid customer types throw errors
- Negative prices are handled
- Test actual business rules, not implementation"

2. The Test Template That Works

Generate tests following this pattern:

describe('[Function Name]', () => {
  describe('Happy Path', () => {
    // Test expected usage
  });
  
  describe('Edge Cases', () => {
    // Boundaries, empty values, nulls
  });
  
  describe('Error Cases', () => {
    // What should fail and how
  });
  
  describe('Business Rules', () => {
    // Specific requirements
  });
});

3. Integration Over Unit

// Instead of testing every function in isolation
// Test user journeys

test('user can complete purchase flow', async () => {
  // Login
  await userLogin('[email protected]', 'password');
  
  // Add item
  await addToCart('PRODUCT-123');
  
  // Checkout
  await checkout({ 
    payment: 'card',
    shipping: 'standard'
  });
  
  // Verify
  expect(await getOrderStatus()).toBe('confirmed');
});

The Test Pyramid (AI Edition)

What AI Generates:           What You Actually Need:
                            
      UI Tests                    UI Tests
         1%                         10%
                            
  Integration Tests           Integration Tests
        5%                         70%
                              
   Unit Tests                   Unit Tests
      94%                         20%

"Test everything!"         "Test what matters!"

Red Flags in AI Tests

  • Tests that never fail when you break the code
  • Excessive mocking (especially of the thing being tested)
  • Tests that just check types or existence
  • Copy-pasted tests with minor variations
  • No edge case testing
  • No error case testing

The Approach That Works

  1. AI generates test scenarios (not code)
  2. Human reviews scenarios for completeness
  3. AI writes initial test code
  4. Human verifies tests actually fail when code is broken
  5. Keep only valuable tests

A Good AI-Generated Test

describe('calculateDiscount', () => {
  describe('Premium customers', () => {
    test('receive exactly 20% discount', () => {
      expect(calculateDiscount(100, 'premium')).toBe(80);
      expect(calculateDiscount(50.50, 'premium')).toBe(40.40);
    });
  });
  
  describe('Regular customers', () => {
    test('receive no discount', () => {
      expect(calculateDiscount(100, 'regular')).toBe(100);
    });
  });
  
  describe('Edge cases', () => {
    test('handles zero price', () => {
      expect(calculateDiscount(0, 'premium')).toBe(0);
    });
    
    test('throws on negative price', () => {
      expect(() => calculateDiscount(-10, 'premium'))
        .toThrow('Price cannot be negative');
    });
    
    test('throws on invalid customer type', () => {
      expect(() => calculateDiscount(100, 'invalid'))
        .toThrow('Unknown customer type');
    });
  });
});

AI-generated tests are like AI-generated art - impressive quantity, questionable quality. The goal isn't to have the most tests or the highest coverage. It's to have tests that actually catch bugs before your users do. Use AI to handle the boilerplate, but always verify the tests actually test something meaningful. A failing test that catches a real bug is worth a thousand passing tests that catch nothing.

Get Started