Office Hours — Should I A/B test my LLM prompts in production or is that overkill?

Should I A/B test my LLM prompts in production or is that overkill?

It depends on your stakes and velocity. If you’re running a customer-facing feature where prompt quality directly affects conversion, retention, or support load, then yes, A/B test it. You’ll catch real behavior differences that eval sets miss because production has messier, more diverse inputs than your test cases.

But there’s a cheaper path first. Run a few hundred examples from your actual production logs through both prompt versions offline using Claude Opus 4.6 or GPT-5.4, score them with a rubric or a model-based evaluator, and see if one clearly wins. This takes maybe an hour and costs $5-10. If the results are ambiguous, then go live with A/B testing.

The actual overhead of A/B testing is smaller than it used to be. You route a percentage of traffic to variant B, log which version was used, and measure downstream metrics (latency, user corrections, escalations, whatever matters). Just make sure your logging is clean from day one, because retrofitting it is painful.

Where people screw up: they A/B test without defining what success looks like beforehand, or they don’t run the test long enough to detect real differences. With LLMs, you usually need at least a few thousand samples before patterns emerge.

Bottom line: Skip A/B testing for low-stakes features. For anything that impacts user experience or business metrics, do quick offline evaluation first, then A/B test in production if that’s inconclusive.

Question via Hacker News