Office Hours — How do you actually test LLM apps beyond vibe checks?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
How do you actually test LLM apps beyond vibe checks?
You need three layers. First, deterministic unit tests on the non-LLM parts. Your prompt formatting, token counting, response parsing, all that stuff should be boring and testable with assertions. Don’t let the LLM hide bad infrastructure.
Second, build a eval dataset with known good outputs. 50-200 examples where you’ve manually verified the answer. Use Claude Opus 4.6 or GPT-5.4 to grade your app’s outputs against these benchmarks on dimensions you care about (factuality, tone, completeness). Tools like Anthropic’s Evals or open source stuff like Braintrust make this straightforward. Run this eval before and after changes. It won’t catch everything, but it catches drift.
Third, shadow your app in production. Log 5-10% of real queries and outputs, sample them weekly, and manually check if they’re actually good. This is where you find the edge cases your eval dataset missed. It’s tedious but necessary.
The mistake people make is thinking evals replace judgment. They don’t. An eval catching a 2% regression is useful. But you still need humans reading actual outputs regularly to catch the weird failure modes that didn’t make it into your benchmark.
Bottom line: Build evals against a known-good dataset and run them before every deployment, then sample real production queries weekly. The combination catches drift you’d otherwise miss until your users complain.
Question via Hacker News