Catalypt LogoCatalypt.ai

Industry Focus

Developer Options

Resources

Back to Blog

I Tried Running LLMs Locally So You Don't Have To

June 8, 2024 Josh Butler Technical

"We need to run everything locally for security." Famous last words. I spent $4,000 on hardware, burned through three weekends, and ended up with a system that makes GPT-3.5 look like Einstein. Here's my journey into local LLM hell.

The Hardware Arms Race

Started confident. "I have a decent gaming rig!" Then I read the requirements:

  • LLaMA 70B: 140GB of RAM minimum
  • Mixtral 8x7B: 90GB RAM
  • Even "small" 13B models: 26GB minimum

My 16GB gaming rig: "Am I a joke to you?"

The shopping list grew:

      RTX 4090 (24GB VRAM):     $1,600
      128GB DDR5 RAM:           $600
      2TB NVMe for model storage: $200
      New PSU to power this beast: $200
      Therapy for credit card:   Priceless

The Installation Nightmare

Day 1: "Just run ollama pull llama2"
Day 2: Compiling llama.cpp from source because the binary doesn't work
Day 3: Learning about quantization formats (Q4_K_M vs Q5_K_S anyone?)
Day 4: Why is Python using system RAM when I have VRAM?
Day 5: Discovering my GPU compute capability is 0.1 version too old

The Performance Reality Check

Finally got Mistral 7B running locally. The benchmarks:

      Task: "Write a Python function to sort a list"
      
      GPT-4:        0.8 seconds, perfect code
      GPT-3.5:      0.5 seconds, working code
      Claude:       0.6 seconds, elegant code
      Local Mistral: 47 seconds, forgot what a list was halfway through

But wait, it gets better. The responses:

Me: "Write a function to validate email addresses"

Local LLM: "Here"s a function to validate email addresses:
def validate_email(email):
  return '@' in email

This checks if the email contains an @ symbol, which all valid emails must have."
''

Technically correct. Practically useless.

The Quantization Lottery

"Just quantize the model!" they said. Here's what they don't tell you:

  • Q8: Barely fits in RAM, 2 tokens/second
  • Q5: Fits better, 5 tokens/second, noticeably dumber
  • Q4: Fast! 15 tokens/second! Also can't count past 3
  • Q2: Lightning fast! Outputs random words!

It's like compressing a JPEG until it's fast to load but you can't tell if it's a cat or a toaster.

The Context Window Catastrophe

      OpenAI: "Here's 128k context'
      Local LLM: "Best I can do is 2k before OOM"
      
      Me: *Feeds 4k tokens*
      GPU: *Thermal throttling noises*
      Model: "CUDA out of memory"
      My electricity bill: *Grows exponentially*

When Local LLMs Actually Made Sense

After all that pain, I found exactly three use cases where local worked:

1. Regex/Pattern Generation
Simple, deterministic tasks where even a dumb model works:

      "Generate a regex for phone numbers"
      Local LLM: d{3}-d{3}-d{4}
      
      Good enough!

2. Code Completion (Very Basic)
When you just need variable name suggestions or simple completions. Copilot is still 100x better though.

3. Sensitive Data Processing
When you absolutely cannot send data to external APIs. But then you realize the local model is too dumb to handle sensitive data properly anyway.

The Hidden Costs Nobody Mentions

  • Electricity: Running a 4090 at full tilt? That's 450W. For hours.
  • Cooling: My office became a sauna. AC bill doubled.
  • Noise: GPU fans at 100% sound like a jet engine
  • Maintenance: Models need updates. Frameworks break. Dependencies conflict.
  • Opportunity Cost: Time spent making it work vs. actually building things

The Privacy Paradox

"But we need it for privacy!" Okay, let's think about this:

  • You downloaded the model from... the internet
  • You're running code from... random GitHub repos
  • The model was trained on... the entire internet
  • Your "private" code is probably on GitHub anyway

Plus, local models are so bad you'll end up using ChatGPT anyway when you need real work done.

The Comparison Nobody Asked For

      Task: "Refactor this React component"
      
      GPT-4: 
      - Time: 2 seconds
      - Quality: Production-ready
      - Cost: $0.03
      
      Local LLaMA 2 70B:
      - Time: 5 minutes
      - Quality: Suggested using jQuery
      - Cost: $0.47 in electricity
      - Bonus: Room temperature +5°C

What I Actually Use Now

After burning money and time, here's my setup:

  • 99% of tasks: GPT-4 / Claude API
  • Sensitive data: Anonymize first, then use cloud APIs
  • Offline work: I just... write code myself
  • Local models: Gathering dust

When You Should Consider Local LLMs

  1. You have actual regulatory requirements (not just paranoia)
  2. You're processing data that legally cannot leave your network
  3. You have a specific, simple task that a small model can handle
  4. You enjoy pain and have money to burn
  5. You're doing AI research

That's it. That's the list.

The Brutal Truth

Local LLMs in 2024 are like self-hosting email in 2024. Technically possible, practically painful, and usually not worth it unless you have very specific requirements.

The gap between local and cloud models isn't closing - it's widening. While you're struggling to run last year's model on your $5k rig, OpenAI just released something 10x better that runs instantly for pennies.

Save yourself the pain. Use cloud APIs. If you absolutely need local processing, prepare for a world of hurt. And maybe invest in a good air conditioner - you're going to need it.

Get Started