Office Hours — What tools are you using for AI evals, and why does everything feel half-baked?

What tools are you using for AI evals, and why does everything feel half-baked?

Because the industry is still figuring out what “evaluation” actually means. You’ve got three separate problems people keep conflating.

First, benchmark evals. Datasets like MMLU or custom test sets tell you if a model learned the task, but they don’t tell you if it’ll work in production. A 92% accuracy score on your eval set doesn’t prevent the model from silently breaking when real users feed it edge cases you didn’t anticipate. Most teams are still running evals offline and shipping based on lab numbers, which is backwards.

The gap between lab and production is real. You might eval Claude Opus 4.7 on a response generation task against a clean 500-example validation set and get strong numbers. Ship it, and the model encounters production data with malformed JSON, missing fields, or context windows that exceed your test distribution. Benchmark tools like OpenAI Evals and Braintrust catch the first category, not the second.

Second, production monitoring. You need to catch when models drift or fail silently after deployment. Most teams lack the instrumentation to detect degradation before users notice. You need human-in-the-loop feedback loops, error categorization, and real metrics tied to business outcomes, not benchmark scores.

Setting Up Real Feedback Loops

A concrete pattern: log model outputs to a lightweight queue (SQS, Kafka), sample 5-10% for human review, and tag errors by root cause (hallucination, wrong tool call, timeout, out-of-distribution input). After two weeks you’ll have actual signal about what’s breaking. Then feed that signal back into your offline evals so you’re not benchmarking against synthetic data that doesn’t match your failure modes.

For a coding agent running on Claude Opus 4.7 across 10,000 daily requests, 500-1,000 samples per week is enough to catch systematic failures. Tag each error: “agent retried same hallucinated function call”, “missing permission for file write”, “context window exceeded on large repo”, “test suite never reached terminal state”. After three weeks, you’ll see that 12% of failures are agent architecture (broken loop condition) versus 3% model hallucinations. That ratio completely changes how you allocate engineering effort.

Here is what the sampling infrastructure actually looks like in practice:

# Anthropic SDK hook for output logging
import anthropic
import json, boto3, random

client = anthropic.Anthropic()
sqs = boto3.client("sqs")

def completion_with_logging(messages, model="claude-opus-4-7", sample_rate=0.08):
    response = client.messages.create(
        model=model,
        max_tokens=4096,
        messages=messages
    )
    if random.random() < sample_rate:
        sqs.send_message(
            QueueUrl="https://sqs.us-east-1.amazonaws.com/your-queue",
            MessageBody=json.dumps({
                "input": messages,
                "output": response.content[0].text,
                "model": model,
                "usage": response.usage.__dict__,
                "timestamp": response.id
            })
        )
    return response

The queue feeds a Postgres table. A two-column tagging UI built in React takes about two days. Reviewers label each sample with an error category and a severity. After 30 days you have a labeled dataset that reflects actual production failures, not anything you invented in advance. That dataset becomes your next eval suite.

Here is what the sampling infrastructure costs. A basic setup includes a logging layer (Anthropic SDK hooks plus custom middleware), a sampling queue with a UI for tagging (lightweight, you can build this in 2-3 days with a Postgres table and a React component), and a metrics pipeline feeding back to your eval runner. Total engineering lift: 3-4 weeks. Most teams skip this because it doesn’t produce benchmark numbers for stakeholders.

Third, agentic system evals. The architecture matters more than the model. A recent analysis showed that ReAct agents waste significant throughput on hallucinated tool calls, not model errors, but broken harnesses. You can’t eval your way out of a fundamentally flawed agent design with a better prompt. Claude Opus 4.7 won’t save you from a tool schema that doesn’t match your API, or a retrieval step that returns irrelevant context, or a loop that never terminates cleanly.

When Benchmarks Mislead

A 95% benchmark score on GPT-5.5 for multi-step coding tasks tells you the model can solve problems in isolation. It doesn’t tell you whether your agent’s retry logic creates infinite loops on ambiguous errors, or whether your tool definitions cause the model to call functions that don’t exist, or whether your context window strategy loses critical information across tool calls. You can only find this out by running the full agentic system on representative production scenarios and watching it fail.

This is where most teams underestimate the cost of misalignment. A coding agent might nail your synthetic benchmarks but systematically fail on repos with unusual directory structures, missing dependencies, or non-standard build configurations. Those failure modes never appear in your curated test set because the set was built around happy paths.

Most tools focus on benchmark metrics (OpenAI Evals, Langfuse, Braintrust) because they’re easy to measure. Production monitoring is fragmented across observability platforms that weren’t built for LLMs. And agentic evals don’t really exist yet as a standardized practice. You’re likely building custom harnesses per agent, which means the eval methodology doesn’t transfer to the next project.

Tradeoffs in Practice

Standardized benchmarks are reproducible but miss production failure modes. Custom production monitoring is expensive but actually tells you what matters. A typical setup costs 4-6 weeks of engineering time upfront. Most teams underfund this because it doesn’t produce neat numbers for stakeholders. The alternative is shipping based on lab metrics, then discovering regressions in production when users hit unexpected inputs.

One more edge case: multi-model setups like GitHub Copilot (which now routes across GPT-5.4, Claude Sonnet 4.6, and Gemini 3.1 Pro) create a new eval problem. You can’t benchmark the system without benchmarking the routing logic. If your agent picks the wrong model for a task type, no amount of prompt optimization fixes it. You need evals that test model selection accuracy, not just individual model performance.

Bottom line: Start with production monitoring before you obsess over benchmark evals. Build error classification and human feedback loops so you can catch real failures fast, then use that signal to drive what you actually eval offline. A 95% benchmark score means nothing if your top user sees hallucinations on day two.

Question via Hacker News