Office Hours — What makes vibe coding fail compared to structured approaches to LLM development?

What makes vibe coding fail compared to structured approaches to LLM development?

Vibe coding—building LLM systems by intuition, iterative prompting, and “it works on my examples”—collapses the moment you need predictability. The fundamental problem isn’t that vibes are wrong; it’s that they don’t survive contact with real data, varying inputs, and production constraints.

The Core Issue: No Feedback Loop

Vibe coding works fine when you’re testing a single LLM call in ChatGPT with curated inputs. You tweak the prompt, it works, you ship it. But the moment you run that same system against 10,000 documents, or real user queries, or data that doesn’t match your test case, you hit a wall. You can’t debug what you never measured.

A structured approach starts with evals. Not vague “does this look good?” checks, but concrete benchmarks: what percentage of outputs meet your success criteria? If you’re extracting structured data from PDFs, you define what “correct” means upfront. If you’re summarizing, you specify whether brevity or completeness matters more. Without that definition, you’re flying blind.

Here’s the difference in practice:

Vibe approach: Prompt v1 → test on 5 examples → looks good → ship → get burned in production when edge cases appear.

Structured approach: Define success criteria → build eval harness → run evals on representative test set → iterate prompts/model/retrieval → measure improvement quantitatively → ship only when evals hit threshold.

Where Vibes Actually Fail

The June 19 Daily Signal noted that leading AI systems solve only 3% of realistic knowledge work tasks end-to-end. That gap exists because vibe coding optimizes for demo performance, not robustness. You’re measuring “did this work once?” instead of “does this work 85% of the time across all input distributions we’ll actually see?”

Vibe coding also fails silently on confidence calibration. A model can sound authoritative and wrong. With structured evals, you’d notice: “This model hallucinates on 12% of queries.” Without evals, you ship it and your users find out.

Concrete Example: Document Extraction

Let’s say you’re extracting invoice fields (amount, date, vendor) from PDFs.

Vibe approach:

# Test on 3 invoices you found
response = client.messages.create(
    model="claude-opus-4.8",
    messages=[{
        "role": "user",
        "content": f"Extract amount, date, vendor from: {pdf_text}"
    }]
)
# Works fine on your test cases. Ship it.

This will fail on:

PDFs with images instead of text
Multiple invoices concatenated
Invoices in non-English languages
Missing fields (model hallucinates)
Ambiguous date formats (2/3/24 could be Feb 3 or Mar 2)

Structured approach:

# Define success criteria upfront
eval_criteria = {
    "amount": "Must match actual invoice total within $0.01",
    "date": "Must be ISO 8601 format, ±1 day tolerance",
    "vendor": "Must match original PDF vendor name exactly or fuzzy match > 90%"
}

# Build test set of 100 real invoices
test_set = load_representative_invoices(
    formats=["pdf", "scanned_image", "multi_page"],
    languages=["en", "es", "fr"],
    edge_cases=["missing_fields", "ambiguous_dates"]
)

# Run evals
results = evaluate_extraction(model="claude-opus-4.8", criteria=eval_criteria, test_set=test_set)
# Amount accuracy: 94%
# Date accuracy: 87% (failure case: ambiguous formats)
# Vendor accuracy: 91%

# If thresholds require 95%+ accuracy, iterate: better prompt, structured output, or RAG with vendor list

That structured eval immediately tells you: dates are your weak point. You can then fix it specifically (add clarification to the prompt, use a regex to enforce ISO format, add context about typical date formats for this vendor).

Vibe coding would never surface this until production.

Why Production Breaks Vibes

Production introduces three things vibes can’t handle:

Distribution shift. Your test cases were office invoices; production gets handwritten receipts. Your model never saw that in training.

Latency constraints. You optimized for quality; production needs responses in 500ms, not 5 seconds. Vibe coding doesn’t measure latency until it’s already broken.

Cost. Vibe coding often uses expensive models and long prompts because “just make it work.” Structured approaches measure token usage per task, forcing you to optimize early.

The Structured Tooling Layer

June 22 Daily Signal flagged frameworks like Vercel’s file-routing pattern being applied to agents. That’s structured thinking winning: standardized abstractions that prevent rework. Similarly, tools for evals (LangSmith, BrainTrust, custom harnesses) become non-negotiable in production because they’re your feedback loop.

Vibe coding skips this layer entirely. You can’t iterate effectively without visibility into what’s failing and why.

When Vibes Actually Work

Vibes aren’t useless. They’re great for:

One-off analysis (a single document you need to understand)
Prototyping architectures (should we use RAG or fine-tuning?)
Brainstorming prompts before formalizing them

But the moment you need repeatability, scalability, or reliability, you have to graduate to structure. That means evals, reproducible test sets, quantified success criteria, and iteration against those criteria, not against your intuition.

Bottom line: Vibe coding fails in production because it measures intuition instead of outcomes. Switch to structured evals—quantified success criteria, representative test sets, iterative measurement—and you’ll catch failures before users do.

Question via Hacker News