Office Hours — How do I actually know if my LLM is hallucinating in production?

How do I actually know if my LLM is hallucinating in production?

You don’t, not really—not until a user complains or your monitoring catches it. But you can reduce the damage.

First, distinguish between types. Factual hallucinations (making up data) are different from reasoning hallucinations (bad logic). For factual stuff, you need grounding: RAG with a retrieval quality check, or constraint-based generation (limiting outputs to predefined options). For reasoning, temperature helps less than you’d think—lower temps reduce variance, not hallucinations.

Instrumentation and Detection

In production, instrument three things: (1) confidence signals from your LLM (e.g., logprobs to flag low-confidence tokens), (2) disagreement detection (run the same query twice—if answers differ significantly, you have a problem), and (3) user feedback loops (make it trivial to flag bad outputs).

Disagreement detection is practical and cheap. Call your model twice in parallel with identical input and identical temperature. Parse both responses into a comparable format (JSON schema works), then compute Levenshtein distance or semantic similarity via embedding. If they diverge beyond a threshold, queue for human review. At typical volumes, this adds 30-50% latency and doubles inference cost for suspicious queries, but catches hallucinations before they reach users.

Here’s a concrete implementation: store both responses and their embeddings. Use cosine similarity on embeddings as your primary gate (threshold 0.85), then fall back to token-level diff for short factual responses. In practice, on a question like “What is the current price of X?” where both calls hit the same RAG source, you get high agreement on the retrieval step but low agreement on invented details. Flag anything below your threshold and route to a deterministic verification step (API call, database lookup) before returning to the user.

# Pseudocode for disagreement detection
response_1 = await model.generate(query, temperature=0.7)
response_2 = await model.generate(query, temperature=0.7)

embed_1 = embedding_model.embed(response_1)
embed_2 = embedding_model.embed(response_2)

similarity = cosine_similarity(embed_1, embed_2)

if similarity < THRESHOLD:  # e.g., 0.85
    # Route to verification or human review
    await send_to_review_queue(query, response_1, response_2, similarity)
else:
    return response_1

At 10k queries per day with a 5% hallucination rate, disagreement detection flags roughly 500 queries daily. Running two inference calls on those 500 adds about $8-12 to your daily bill (assuming GPT-5.4 or Claude Opus 4.7 pricing), negligible against the cost of a hallucination reaching production.

Logprobs are harder to interpret than they seem. High logprobs on individual tokens don’t guarantee factual accuracy—they only tell you the model was confident about its next token prediction. A hallucinated date can still have high logprobs. Use logprobs as a canary, not a blocker. Flag requests where cumulative logprobs for factual claims (dates, names, numbers) fall below -2.0 per token and send them to a secondary verification step.

Design for Containment, Not Prevention

The core issue: you can’t fully prevent hallucinations with frontier models today. GPT-5.5 and Claude Opus 4.7 still make things up. Frontier models trade recall for fluency—they’d rather invent a plausible answer than admit uncertainty.

So design your system to contain hallucinations. Never let an LLM output directly into critical paths (payments, medical decisions, legal docs). Always have a human review step or deterministic verification layer. If you’re using RAG, verify the retrieval itself. Check that retrieved chunks actually support the claim before surfacing it.

For factual grounding, RAG with quality gates beats temperature tuning every time. Pair your retrieval with a fact-checking step: after the LLM generates an answer, extract claims and validate them against your source documents. Claude Opus 4.7 is strong at structured extraction for this. Feed the raw retrieved text plus the model’s claim into a separate call asking “does this text support this claim?” (binary yes/no). It’s one extra forward pass per response, but deterministic. Cost-wise, on a query volume of 10k/day, this adds roughly $15-25 daily for the verification calls, which is negligible next to the liability cost of a hallucination in production.

Edge Cases and Tradeoffs

Disagreement detection fails when hallucinations are consistent. If your model is systematically wrong about something (e.g., a common misconception baked into training data), both inference runs will produce the same wrong answer. You need external verification for that—RAG, APIs, or human review.

Temperature is a false lever here. Lowering temperature from 1.0 to 0.3 reduces output variance but doesn’t improve accuracy on factual questions. It just makes the hallucination more consistent, which actually makes disagreement detection useless. If you’re going to tune anything, tune your RAG retrieval ranking and chunk size instead. A model with lower temperature and bad retrieval will confidently return the wrong answer twice.

Another edge case: models that hedge correctly. Claude Opus 4.7 often says “I’m not certain” or “I don’t have access to current data for this.” Disagreement detection can misinterpret this as hallucination when the model is actually being honest. Parse for explicit uncertainty language before flagging a query as suspicious. A simple regex check for phrases like “I don’t know,” “unclear,” or “I don’t have access” will reduce false positives by 40-50% in typical deployments.

Bottom line: Stop trying to prevent hallucinations and start designing systems where they can’t cause harm. Use RAG with verification gates for facts, keep humans in the loop for decisions that matter, and implement disagreement detection as a catch-all for consistency failures. One extra inference pass is cheap insurance.