Catalypt LogoCatalypt.ai

Industry Focus

Developer Options

Resources

Back to Blog

How I Caught AI Copy-Pasting from Stack Overflow (And Built a Detection System)

2025-07-10T00:00:00.000Z Catalypt AI Team ai-first

The pull request looked great. Complex sorting algorithm, clean implementation, even had comments. One problem: I'd seen this exact code before. Right down to the variable named tempVar and the comment "//TODO: optimize this later".

It was the top Stack Overflow answer. From 2019. Complete with the same typo in the comments.

The First Red Flag

// The PR submission
function quickSort(arr) {
  if (arr.length <= 1) return arr;
  
  const pivot = arr[0];
  const left = [];
  const right = [];
  
  // TODO: optimize this later
  for (let i = 1; i < arr.length; i++) {
    if (arr[i] < pivot) {
      left.push(arr[i]);
    } else {
      right.push(arr[i]);  // Fixed tpyo here
    }
  }
  
  const tempVar = [...quickSort(left), pivot, ...quickSort(right)];
  return tempVar;
}

Googled "quicksort javascript tempVar" - first result, exact match. Even the weird spacing in the comment.

// Stack Overflow answer from 2019 (2.7k upvotes)
// "Here's a simple quicksort implementation:"
function quickSort(arr) {
  if (arr.length <= 1) return arr;
  
  const pivot = arr[0];
  const left = [];
  const right = [];
  
  // TODO: optimize this later
  for (let i = 1; i < arr.length; i++) {
    if (arr[i] < pivot) {
      left.push(arr[i]);
    } else {
      right.push(arr[i]);  // Fixed tpyo here
    }
  }
  
  const tempVar = [...quickSort(left), pivot, ...quickSort(right)];
  return tempVar;
}

// EDIT: Fixed typo in comment
// EDIT 2: Thanks @user12345 for pointing out the edge case

The Investigation Deepens

I started digging deeper and found more copied code:

// AI's "original" debounce function
function debounce(func, wait, immediate) {
  var timeout;  // Why var in 2024?
  return function() {
    var context = this, args = arguments;
    var later = function() {
      timeout = null;
      if (!immediate) func.apply(context, args);
    };
    var callNow = immediate && !timeout;
    clearTimeout(timeout);
    timeout = setTimeout(later, wait);
    if (callNow) func.apply(context, args);
  };
}

// Found on SO: "Underscore.js debounce implementation"
// Posted by David Walsh in 2014
// AI's regex for email validation
const validateEmail = (email) => {
  // This regex courtesy of @regexguru on SO
  const re = /^(([^<>()\[\]\\.,;:\s@"]+(\.[^<>()\[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
  return re.test(String(email).toLowerCase());
};

// Googled the regex - 47,000 results, all from the same SO answer
// Comment "courtesy of @regexguru" was the giveaway

The Dead Giveaways

  • TODO comments that make no sense in context
  • Variable names from a different naming convention
  • Comments referencing "the OP" or "the asker"
  • Edit history in comments: "EDIT: Fixed typo"
  • Thanks to specific usernames
  • Links to JSFiddle that expired in 2018
// Example of obvious SO copy-paste
function deepClone(obj) {
  // Thanks to @javascriptNinja for this solution!
  // EDIT: Fixed issue with circular references (see comments)
  // EDIT 2: Added Date object support as suggested by OP
  
  if (obj === null || typeof obj !== "object") {
    return obj;
  }
  
  // Handle Date (as pointed out in the comments)
  if (obj instanceof Date) {
    return new Date(obj.getTime());
  }
  
  // Handle Array
  if (obj instanceof Array) {
    var clonedArr = [];
    for (var i = 0; i < obj.length; i++) {
      clonedArr[i] = deepClone(obj[i]);
    }
    return clonedArr;
  }
  
  // Handle Object
  var clonedObj = {};
  for (var key in obj) {
    if (obj.hasOwnProperty(key)) {
      clonedObj[key] = deepClone(obj[key]);
    }
  }
  
  return clonedObj;
}

Found the exact component on SO. Answer from 2017. Before structuredClone() existed.

The License Trap

// AI submitted this "utility library"
export const utils = {
  // Smooth scroll implementation
  smoothScroll: function(target, duration) {
    var targetElement = document.querySelector(target);
    var targetPosition = targetElement.getBoundingClientRect().top;
    var startPosition = window.pageYOffset;
    var distance = targetPosition - startPosition;
    var startTime = null;
    
    function animation(currentTime) {
      if (startTime === null) startTime = currentTime;
      var timeElapsed = currentTime - startTime;
      var run = ease(timeElapsed, startPosition, distance, duration);
      window.scrollTo(0, run);
      if (timeElapsed < duration) requestAnimationFrame(animation);
    }
    
    function ease(t, b, c, d) {
      t /= d / 2;
      if (t < 1) return c / 2 * t * t + b;
      t--;
      return -c / 2 * (t * (t - 2) - 1) + b;
    }
    
    requestAnimationFrame(animation);
  }
};

Looks professional! One issue: It's from a popular MIT-licensed library, comment for comment. Without attribution.

The Frankenstein Code

The worst case - AI stitching together multiple SO answers:

// "Original" form validation
function validateForm(formData) {
  // Email validation from SO answer #1
  const emailRegex = /^[^\s@]+@[^\s@]+\.[^\s@]+$/;
  
  // Phone validation from SO answer #2 (different author)
  const phoneRegex = /^\(?([0-9]{3})\)?[-. ]?([0-9]{3})[-. ]?([0-9]{4})$/;
  
  // Password strength from SO answer #3 (notice style change)
  var checkPasswordStrength = function(pwd) {
    var strength = 0;
    if (pwd.length >= 8) strength++;
    if (pwd.match(/[a-z]+/)) strength++;
    if (pwd.match(/[A-Z]+/)) strength++;
    if (pwd.match(/[0-9]+/)) strength++;
    if (pwd.match(/[$@#&!]+/)) strength++;
    return strength;
  }
  
  // Date validation from SO answer #4 (ES6 suddenly appears)
  const isValidDate = (dateString) => {
    const regEx = /^\d{4}-\d{2}-\d{2}$/;
    if(!dateString.match(regEx)) return false;
    const d = new Date(dateString);
    const dNum = d.getTime();
    if(!dNum && dNum !== 0) return false;
    return d.toISOString().slice(0,10) === dateString;
  };
  
  // Mix of var, const, function styles = dead giveaway
}

How to Detect Stack Overflow Copy-Paste

  1. Unusual variable names - tempVar, myVar, foo, bar
  2. Outdated patterns - var in 2024, deprecated APIs
  3. Inconsistent style - Different formatting mid-function
  4. Random TODOs - That will never be done
  5. Google unique comments - Often verbatim from SO
  6. Check for common SO bugs - The broken regex everyone copies
// My detection script
const detectSOCopyPaste = (code) => {
  const indicators = {
    hasVar: /\bvar\s+\w+/.test(code),
    hasTempVar: /\b(tempVar|myVar|foo|bar)\b/.test(code),
    hasEditComments: /\/\/\s*(EDIT|UPDATE|Fixed):/i.test(code),
    hasThanks: /\/\/\s*(Thanks|Credit|Courtesy)/.test(code),
    hasStackOverflowLinks: /stackoverflow\.com/.test(code),
    hasOldAPIs: /\b(escape|unescape|eval)\b/.test(code),
    hasTODOs: /\/\/\s*TODO:/.test(code),
    inconsistentQuotes: code.includes('"') && code.includes("'"),
    weirdSpacing: /\s{3,}/.test(code),
    oldDateCheck: /new Date\(\)\.getYear\(\)/.test(code)
  };
  
  const score = Object.values(indicators).filter(Boolean).length;
  
  if (score >= 3) {
    console.warn('High probability of SO copy-paste:', indicators);
    return true;
  }
  
  return false;
};

The Real Problems

  • License issues - SO is CC BY-SA, not public domain
  • Quality issues - Old answers may be wrong now
  • Security issues - That SQL injection fix from 2010? Not great
  • Maintenance issues - Copying code you don't understand
// Common security disaster from old SO answers
// DO NOT USE - SQL injection vulnerable!
function getUserData(userId) {
  // This was acceptable in 2008...
  const query = "SELECT * FROM users WHERE id = '" + userId + "'";
  return db.execute(query);
}

// The 2024 version should be:
async function getUserData(userId) {
  const query = "SELECT * FROM users WHERE id = ?";
  return await db.execute(query, [userId]);
}

Building a Better Detection System

// Advanced SO copy detection with fuzzy matching
class SOCopyDetector {
  constructor() {
    this.knownPatterns = new Map();
    this.loadCommonSOPatterns();
  }
  
  async checkCode(code) {
    const results = {
      exact_matches: [],
      fuzzy_matches: [],
      suspicious_patterns: [],
      license_issues: []
    };
    
    // Check for exact matches
    const codeHash = this.hashCode(this.normalizeCode(code));
    if (this.knownPatterns.has(codeHash)) {
      results.exact_matches.push({
        source: this.knownPatterns.get(codeHash),
        confidence: 1.0
      });
    }
    
    // Check for suspicious patterns
    const patterns = [
      { regex: /\/\/\s*Credit:/i, name: 'credit_comment' },
      { regex: /EDIT\s*\d*:/i, name: 'edit_history' },
      { regex: /@[\w]+\s*for/i, name: 'username_thanks' },
      { regex: /\[JSFiddle\]|\[demo\]/i, name: 'dead_links' },
      { regex: /var\s+\w+\s*=.*var\s+\w+/s, name: 'old_js_style' },
      { regex: /\barguments\.\w+/g, name: 'deprecated_patterns' }
    ];
    
    patterns.forEach(({ regex, name }) => {
      if (regex.test(code)) {
        results.suspicious_patterns.push(name);
      }
    });
    
    // Check common variable names
    const commonSOVars = ['foo', 'bar', 'baz', 'temp', 'tempVar', 'myVar', 'test'];
    const varMatches = code.match(/\b(\w+)\s*=/g) || [];
    const suspiciousVars = varMatches.filter(v => 
      commonSOVars.some(sv => v.includes(sv))
    );
    
    if (suspiciousVars.length > 2) {
      results.suspicious_patterns.push('common_so_variables');
    }
    
    // Google search for unique strings
    const uniqueStrings = this.extractUniqueStrings(code);
    for (const str of uniqueStrings) {
      const searchResults = await this.searchGoogle(str);
      if (searchResults.includes('stackoverflow.com')) {
        results.fuzzy_matches.push({
          string: str,
          source: 'stackoverflow',
          confidence: 0.8
        });
      }
    }
    
    return results;
  }
  
  normalizeCode(code) {
    return code
      .replace(/\s+/g, ' ')
      .replace(/['"`]/g, '"')
      .trim();
  }
  
  hashCode(str) {
    let hash = 0;
    for (let i = 0; i < str.length; i++) {
      const char = str.charCodeAt(i);
      hash = ((hash << 5) - hash) + char;
      hash = hash & hash;
    }
    return hash;
  }
  
  extractUniqueStrings(code) {
    const comments = code.match(/\/\/.*$/gm) || [];
    const strings = code.match(/["'`]([^"'`]{10,})["'`]/g) || [];
    return [...comments, ...strings].slice(0, 5);
  }
}

Prevention Strategies

Now I always:

  1. Google suspicious comments
  2. Check for outdated patterns
  3. Look for inconsistent naming
  4. Verify "clever" one-liners
  5. Test edge cases (SO code often misses these)
  6. Run automated detection scripts
  7. Check license compatibility
  8. Verify the code actually works

The Positive Side

When used properly, Stack Overflow is invaluable. The key is:

// Good approach: Learn from SO, don't copy
// 1. Understand the solution
// 2. Adapt to your needs
// 3. Modernize if needed
// 4. Add proper attribution
// 5. Test thoroughly

// Example: Inspired by SO but properly adapted
class Debouncer {
  constructor(delay = 300) {
    this.delay = delay;
    this.timeouts = new Map();
  }
  
  debounce(key, callback) {
    // Clear existing timeout
    if (this.timeouts.has(key)) {
      clearTimeout(this.timeouts.get(key));
    }
    
    // Set new timeout
    const timeoutId = setTimeout(() => {
      callback();
      this.timeouts.delete(key);
    }, this.delay);
    
    this.timeouts.set(key, timeoutId);
  }
  
  // Modern additions not in original
  cancel(key) {
    if (this.timeouts.has(key)) {
      clearTimeout(this.timeouts.get(key));
      this.timeouts.delete(key);
    }
  }
  
  cancelAll() {
    for (const timeout of this.timeouts.values()) {
      clearTimeout(timeout);
    }
    this.timeouts.clear();
  }
}

Lessons Learned

  1. AI doesn't understand licensing - Always verify code origin
  2. Old solutions create new problems - 2010 best practices ≠ 2024 best practices
  3. Attribution matters - Even for AI-generated code
  4. Test everything - SO code often handles happy path only
  5. Understand before using - If you can't explain it, don't ship it

The Detection Checklist

Before accepting any AI-generated code:

  • Google unique comments and variable names
  • Check for style inconsistencies
  • Look for outdated patterns (var, callbacks, old APIs)
  • Verify no license violations
  • Test edge cases not covered in examples
  • Ensure you understand every line
  • Add proper attribution if needed

Remember: Stack Overflow is a learning resource, not a code repository. Use it to understand solutions, not to avoid thinking.

The irony? This blog post will probably end up on Stack Overflow, and some AI will copy it verbatim. If you're reading this in a code comment... hi there! 👋

Get Started