Catalypt LogoCatalypt.ai

Industry Focus

Developer Options

Resources

Back to Blog

AI Regex: Now You Have Two Problems

March 5, 2024 Josh Butler Technical

"Some people, when confronted with a problem, think 'I know, I'll use regular expressions.' Now they have two problems." - Jamie Zawinski

When you ask AI to write regex, you get three problems. The third is explaining to your team why the email validator is 3,000 characters long and requires 2GB of RAM.

The Simple Request

Me: "I need a regex to validate email addresses"

AI: "Here's a comprehensive email validation regex!"

/(?:[a-z0-9!#$%&'*+/=?^_`\{|\}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`\{|\}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")/

Me: "What have you done"

The Phone Number Catastrophe

// Request: "Validate US phone numbers"
// AI's response:

const phoneRegex = /^(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]\{2\})\s*(?:[.-]\s*)?([0-9]\{4\})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?$/;

// Matches:
// ✓ (555) 123-4567
// ✓ 555.123.4567
// ✓ +1-555-123-4567
// ✓ 555-GET-FOOD (wait, what?)
// ✓ 123456789012345 (that's... too many)
// ✓ My childhood trauma (somehow)

The URL Validator of Doom

// AI's URL regex (actual output, shortened for sanity)
/^(?:(?:(?:https?|ftp):)?\/\/)(?:\S+(?::\S*)?@)?(?:(?!(?:10|127)(?:\.\d\{1,3\})\{3\})(?!(?:169\.254|192\.168)(?:\.\d\{1,3\})\{2\})(?!172\.(?:1[6-9]|2\d|3[0-1])(?:\.\d\{1,3\})\{2\})(?:[1-9]\d?|1\d\d|2[01]\d|22[0-3])(?:\.(?:1?\d\{1,2\}|2[0-4]\d|25[0-5]))\{2\}(?:\.(?:[1-9]\d?|1\d\d|2[0-4]\d|25[0-4]))|(?:(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)(?:\.(?:[a-z\u00a1-\uffff0-9]-*)*[a-z\u00a1-\uffff0-9]+)*(?:\.(?:[a-z\u00a1-\uffff]\{2,\})))(?::\d\{2,5\})?(?:[\/?#]\S*)?$/i

// Browser: *catches fire*
// CPU: "I need a vacation"
// Me: "Maybe just check for 'http' at the start?"

Real AI Regex Disasters

The Password Validator That Validates Everything

// Requirements: 8+ chars, 1 uppercase, 1 lowercase, 1 number, 1 special
// AI's regex:
/^(?=.*[a-z])(?=.*[A-Z])(?=.*d)(?=.*[@$!%*?&])[A-Za-zd@$!%*?&]{8,}$/

// Looks good until:
"Password1!" // ✓ Valid
"P@ssw0rd" // ✓ Valid  
"????????" // ✓ Valid (8 special chars)
"AAAAAAAA1!" // ✓ Valid (no lowercase)

// The bug: lookaheads check the entire string, not each character

The Credit Card Validator

// AI regex for credit cards
/^(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11}|6(?:011|5[0-9]{2})[0-9]{12}|(?:2131|1800|35d{3})d{11})$/

// Validates card numbers!
// Also validates my social security number
// And my phone number with extra digits
// But not actual valid test cards

The Catastrophic Backtracking Special

// AI's "simple" regex for parsing HTML (don't do this)
/<(w+)(s+w+s*=s*("[^"]*"|'[^']*'|[^>]*))*s*>/

// Test string: <div class="test" id="boom">
// Time: 0.1ms ✓

// Test string: <div class="test" id="boom" data-value="<nested>">
// Time: 47 seconds
// CPU: 100%
// Fans: Taking off

// Catastrophic backtracking has entered the chat

The International Disaster

// Me: "Validate international names"
// AI: "I'll handle all Unicode!"

/^[p{L}p{M}p{Zs}'-]{2,50}$/u

// Sounds good until:
"José" // ✓
"Māori" // ✓
"null" // ✓ (Actual name in some cultures)
"<script>alert('hi')</script>" // ✗ Good
"👨‍👩‍👧‍👦" // ✓ That's... a family emoji
"­" // ✓ That's an invisible character

The Date Parser From Hell

// AI's date validation regex
/^(?:(?:31(\/|-|\.)(?:0?[13578]|1[02]))\1|(?:(?:29|30)(\/|-|\.)(?:0?[13-9]|1[0-2])\2))(?:(?:1[6-9]|[2-9]\d)?\d\{2\})$|^(?:29(\/|-|\.)0?2\3(?:(?:(?:1[6-9]|[2-9]\d)?(?:[048]|[2468][048]|[13579][26])|(?:(?:16|[2468][048]|[3579][26])00))))$|^(?:0?[1-9]|1\d|2[0-8])(\/|-|\.)(?:(?:0?[1-9])|(?:1[0-2]))\4(?:(?:1[6-9]|[2-9]\d)?\d\{2\})$/

// Handles leap years!
// Also handles:
// 31/02/2024 (February 31st?)
// 99/99/9999 (The end times)
// My will to live (gone)

The Regex That Became Sentient

// AI tried to validate... everything
/^(?=.*[a-zA-Z])(?=.*\d)(?=.*[!@#$%^&*()_+\-=\[\]\{\};':'\\|,.<>\/?])(?=.*[^\w\s])(?!.*\s)(?!.*(.)\1\{2,\})(?!.*(012|123|234|345|456|567|678|789|890|098|987|876|765|654|543|432|321|210))(?!.*(abc|bcd|cde|def|efg|fgh|ghi|hij|ijk|jkl|klm|lmn|mno|nop|opq|pqr|qrs|rst|stu|tuv|uvw|vwx|wxy|xyz))(?!.*(password|123456|12345678|qwerty|abc123|monkey|1234567|letmein|trustno1|dragon|baseball|111111|iloveyou|master|sunshine|ashley|bailey|shadow|123123|654321|superman|michael)).*$/

// I asked for password validation
// AI delivered existential dread

How to Not Regex with AI

Option 1: Use a Library

// Instead of regex hell
import validator from 'validator';

if (validator.isEmail(email)) {
  // Done. No regex. No tears.
}

Option 2: Simple and Sufficient

// Email: has @ and a dot after it
const isValidEmail = (email) => {
  const parts = email.split('@');
  return parts.length === 2 && parts[1].includes('.');
};

// Covers 99% of cases, readable by humans

Option 3: Progressive Enhancement

// Start simple
let emailRegex = /@/;

// Add complexity only if needed
emailRegex = /.+@.+/;

// Still readable
emailRegex = /.+@.+..+/;

// Stop here. Please.

AI Regex Prompt That Actually Works

"Create a SIMPLE regex for [purpose].
Requirements:
- Maximum 50 characters
- No lookaheads/lookbehinds unless essential
- Must be readable by humans
- Prefer false positives over complexity
- Include test cases
- Explain what each part does"

The Debugging Nightmare

// Coworker: "The regex isn't working"
// Me: "Which part?"
// The regex:
/^(?:(?:(?:(?:(?:(?:[a-zA-Z0-9])|(?:[a-zA-Z0-9][a-zA-Z0-9-]{0,61}[a-zA-Z0-9])).)*(?:(?:[a-zA-Z0-9])|(?:[a-zA-Z0-9][a-zA-Z0-9-]{0,61}[a-zA-Z0-9])))|[(?:(?:(?:(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5])).){3}(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5])))|(?:(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):){6}(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5])).){3}(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5])))))))|(?:::(?:(?:(?:[0-9a-fA-F]{1,4})):){5}(?:(?:(?:(?:(?:[0-9a-fA-F]{1,4})):(?:(?:[0-9a-fA-F]{1,4})))|(?:(?:(?:(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5])).){3}(?:(?:[0-9])|(?:[1-9][0-9])|(?:1[0-9]{2})|(?:2[0-4][0-9])|(?:25[0-5]))))))))])$/

// Me: "Yes."

The Lessons Learned

  1. If you can't read it, you can't debug it
  2. AI doesn't understand "simple"
  3. Most validation doesn't need regex
  4. Test with edge cases, not happy paths
  5. Sometimes "good enough" is perfect

My Favorite AI Regex Moment

// Me: "I need to match a number"
// AI: "Here's a comprehensive number matcher!"

/^[+-]?(?:(?:(?:d{1,3}(?:(?:,d{3})|(?:.d{3}))*)|(?:d+))(?:.d+)?|.d+)(?:[eE][+-]?d+)?$/

// Me: "I meant like... d+"
// AI: "But what about scientific notation?"
// Me: "It's for a ZIP code"
// AI: "...oh"

AI writing regex is like using a Formula 1 car for your morning commute - technically it works, but you'll spend more time fixing problems than solving them. Regular expressions are already write-only code; adding AI to the mix creates write-never-read-never code. Sometimes the best regex is no regex. And if you must use regex, remember: the goal is to match patterns, not to prove the Riemann hypothesis. Keep it simple, keep it readable, and keep your sanity.

Get Started