Office Hours — What's your strategy for testing and evaluating LLM outputs in production now that Promptfoo was acquired?

What’s your strategy for testing and evaluating LLM outputs in production now that Promptfoo was acquired?

Promptfoo’s acquisition doesn’t actually change your core strategy—it just removes one tool from the shelf. The real work was never about the framework; it was about deciding what “good” means for your specific outputs and measuring it consistently.

If you were leaning on Promptfoo for eval infrastructure, you have two paths. First, migrate to open alternatives like Braintrust or roll your own evals using Claude Opus 4.7 or GPT-5.5 as the evaluator (both are solid at this now). Second, and more importantly, focus on what Promptfoo couldn’t do anyway: build domain-specific signals that matter to your business.

The Two-Tier Eval Stack

The maturity move is building a two-tier eval stack. One tier is automated: check if the output is syntactically valid JSON, doesn’t contain PII (use OpenAI’s Privacy Filter if you’re paranoid about it), and passes structural tests. The second tier is sampled human review—pull 50 outputs a week and have someone who understands your use case actually grade them. This catches drift that benchmarks miss.

Automated evals should run on every production output. A simple example for a code generation task:

def eval_output(llm_output, rubric):
    checks = {
        "valid_syntax": is_valid_python(llm_output.code),
        "has_docstring": "def " in llm_output.code and '"""' in llm_output.code,
        "no_hardcoded_secrets": not any(s in llm_output for s in ["api_key", "password"]),
        "passes_basic_lint": run_ruff(llm_output.code),
    }
    return sum(checks.values()) / len(checks)

This runs instantly, scales to 100% of outputs, and catches the obviously broken stuff. Human review catches the subtler failures: code that’s syntactically correct but inefficient, outputs that technically satisfy the prompt but miss the intent, or edge cases your rubric didn’t anticipate.

Model Selection Testing Across Frontiers

For LLM model selection testing specifically, you’re comparing outputs across frontier models (GPT-5.5, Claude Opus 4.7, Gemini 3.1 Pro). Don’t just run one prompt. Run 20-30 variations, then pick the top 5 and evaluate those against a consistent rubric. The model that wins on your eval set usually wins in production, but only if your eval rubric captures what your users actually care about.

Cost matters here. GPT-5.5 is more expensive than Claude Sonnet 4.6, and Claude Sonnet 4.6 is cheaper than Opus 4.7. If your eval shows that Sonnet performs 92% as well as Opus on your specific task, that’s often worth the cost difference in production. Run the math on actual throughput and error rates, not just benchmark scores.

Implementation Details and Tradeoffs

Tool-wise, if you’re comfortable with code, write your evals in Python using Claude’s code execution or GPT-5.4’s native execution capabilities. If you want something lighter, check Braintrust or just use a structured spreadsheet plus manual scoring. Yes, manually. It scales better than you’d think, especially for teams under 500K monthly outputs.

One edge case: if your LLM outputs feed into autonomous agents (which are now common in production), your eval bar changes. A typo in a single LLM output might cascade into 10 failed agent steps. In this case, invest more in automated structural checks and less in “does it read well” evals. Agent workflows need determinism more than they need prose quality.

Another edge case: if you’re evaluating code generation, actually run the code in a sandbox. Claude Opus 4.7 and GPT-5.4 both generate syntactically correct code that fails on execution. A simple pytest run catches more issues than any LLM-as-judge ever will.

Bottom line: Build eval infrastructure around your specific output requirements, not around a particular tool. Promptfoo’s acquisition is a signal to migrate, not to rethink your evaluation philosophy. Start with automated structural checks, layer in human sampling, and pick models based on your actual tradeoffs, not vendor lock-in.

Question via Hacker News