Office Hours — How do you actually test LLM apps beyond vibe checks?

How do you actually test LLM apps beyond vibe checks?

You need three layers. First, deterministic unit tests on the non-LLM parts. Your prompt formatting, token counting, response parsing, all that stuff should be boring and testable with assertions. Don’t let the LLM hide bad infrastructure.

Second, build an eval dataset with known good outputs. 50-200 examples where you’ve manually verified the answer. Use Claude Opus 4.7 or GPT-5.5 to grade your app’s outputs against these benchmarks on dimensions you care about (factuality, tone, completeness). Tools like Anthropic’s Evals or open source stuff like Braintrust make this straightforward. Run this eval before and after changes. It won’t catch everything, but it catches drift.

Setting Up a Practical Eval Loop

Here’s what this actually looks like. You maintain a JSONL file with your gold-standard outputs:

{"input": "What's the capital of France?", "expected": "Paris", "category": "geography"}
{"input": "Explain async/await in JavaScript", "expected": "...", "category": "technical"}

Before deployment, you run your app against this dataset and ask Claude Opus 4.7 or GPT-5.5 to score each output on a 1-5 scale against your expected answer. Log the scores and the deltas from your last baseline. A 10% drop in average score is a signal to investigate. A 2% drop might be noise, but if you see it consistently, you’re drifting.

The cost is real but manageable. Scoring 100 outputs with a frontier model runs roughly $3-8 depending on output length and model choice. GPT-5.5 costs more per token but often needs fewer tokens to grade accurately. Claude Opus 4.7 has better long-context reasoning if your test cases are verbose. Run this before every production deploy and you’ll catch regressions in minutes instead of days.

Third, shadow your app in production. Log 5-10% of real queries and outputs, sample them weekly, and manually check if they’re actually good. This is where you find the edge cases your eval dataset missed. It’s tedious but necessary.

Why Evals Miss Things

The mistake people make is thinking evals replace judgment. They don’t. An eval catching a 2% regression is useful. But you still need humans reading actual outputs regularly.

Evals are trained on your benchmark. If your benchmark doesn’t include queries about handling ambiguous requests, or follow-ups that contradict the initial prompt, or requests in languages you didn’t test, the eval won’t catch failures there. Production always has longer tails than your dataset.

Also, evals grade consistency, not correctness. Two different models might both score 4/5 on factuality while arriving at answers through different reasoning. One might be more robust to adversarial input or domain-specific terminology. The eval won’t tell you which. Sampling real production queries weekly lets you catch the weird failure modes that didn’t make it into your benchmark.

Consider a real example: a customer support classifier that scores 4.8/5 on your eval dataset might systematically misclassify support tickets from a specific vertical because your benchmark didn’t include that jargon. The eval passes. Production complaints arrive two weeks later. Weekly sampling catches this in day four.

The Integration Cost and Tradeoff

This three-layer approach adds real engineering work. Your CI/CD pipeline needs to run evals, which takes 5-15 minutes depending on dataset size. Your production logging needs to be careful about PII and token costs. Your weekly manual sampling needs a person. Budget for it.

The tradeoff is worth it. A $50-100/month eval budget stops you from shipping a regression that costs $5000 in support escalations. More importantly, it gives you signal instead of guesswork. You know when you’ve regressed. You catch edge cases before they become user complaints. You have a paper trail for why a change shipped or why it didn’t.

The practical minimum is 50 test cases if you’re shipping fast and iterating. 150-200 if you’re in production with real users depending on stability. Much larger datasets (500+) have diminishing returns unless you have distinct subdomains that need separate eval tracks.

Bottom line: Build evals against a known-good dataset and run them before every deployment, then sample real production queries weekly. The combination catches drift you’d otherwise miss until your users complain.

Question via Hacker News