Office Hours — What deterministic guarantees do you need from an LLM system for compliance-heavy enterprise use cases?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
What deterministic guarantees do you need from an LLM system for compliance-heavy enterprise use cases?
The short answer: you can’t get them from LLMs alone. You need deterministic harnesses around non-deterministic models, and that distinction matters more than anything the model itself promises.
Why “Deterministic LLM” Is an Oxymoron
LLMs are sampling machines. Even with temperature locked at 0, you get different outputs across inference runs due to how transformer attention and token selection work at the hardware level. A German court just ruled (June 9) that Google is liable for AI Overview falsehoods, which means regulators no longer accept “the model sometimes hallucinates” as a defense. Compliance teams expect you to know what your system will output before it outputs it.
That’s impossible with raw LLMs. The model itself can’t guarantee anything. What you can do is build deterministic systems around the model.
The Three Guarantees That Actually Matter
Auditability. Every input, output, and decision path needs to be logged and reproducible. This isn’t just for compliance—it’s the only way to debug why the system made a specific decision. When you’re liable for the output, you need to be able to explain it to a regulator or lawyer. Set up comprehensive logging from the start, including the full prompt, model temperature, seed (if you’re using one), timestamp, and the exact tokens returned. Store this in write-once storage that can’t be retroactively modified.
Retrieval verifiability. If your system pulls context from a document and makes a claim based on it, someone needs to be able to follow the citation back to source. This is where RAG either works for you or becomes a liability. If your system says “policy XYZ states that claims must be filed within 30 days,” you need to be able to point to exactly which document, page, and paragraph it pulled that from. Naive RAG often can’t do this cleanly. You need a pipeline that preserves metadata, chunk provenance, and relevance scores so you can reconstruct why a particular document was selected.
# Example: auditable RAG with provenance
def query_with_audit(question: str):
results = retrieve_documents(question)
audit_trail = {
"question": question,
"retrieved_docs": [
{
"doc_id": r.doc_id,
"chunk_id": r.chunk_id,
"page_number": r.metadata["page"],
"relevance_score": r.score,
"text": r.text
}
for r in results
],
"timestamp": datetime.utcnow(),
"retrieval_model": "embedding_v3"
}
# Build prompt with explicit attribution
context_text = "\n".join([
f"[SOURCE: {r['doc_id']}:page_{r['page_number']}] {r['text']}"
for r in audit_trail["retrieved_docs"]
])
response = client.messages.create(
model="claude-opus-4.8",
messages=[{
"role": "user",
"content": f"Answer based on these sources:\n{context_text}\n\nQuestion: {question}"
}],
temperature=0
)
audit_trail["response"] = response.content[0].text
audit_trail["model"] = "claude-opus-4.8"
# Write to immutable log
write_to_audit_log(audit_trail)
return response.content[0].text
Output validation against schema. If you’re extracting structured data (claim decisions, contract terms, compliance assessments), validate that outputs conform to a schema before they enter the system. Use structured outputs via Claude’s or OpenAI’s APIs—they enforce JSON schema on generation. But don’t stop there. Validate that required fields are present and sensible. If a system outputs a policy decision, it should also output the reasoning and a confidence score. If confidence is low, flag it for human review rather than letting it pass through silently.
What You Can’t Guarantee (And Need to Accept)
You cannot guarantee that the model won’t hallucinate, even with temperature at 0. Simon Willison documented this recently: Claude Fable 5 sometimes returns “silent refusals” where it stops helping without explanation, making debugging a nightmare for production systems. You need human review loops for high-stakes decisions.
For truly sensitive decisions (denying a claim, flagging fraud, making hiring recommendations), the model output should not be the decision. It should be input to a decision. A human must sign off, and the system should be designed to make the human’s job faster and more informed, not to automate judgment away.
The Compliance-Specific Architecture
Compliance-heavy domains need a different system shape. Instead of “model makes decision,” you’re building “model assists analyst who makes decision.”
Use Claude Opus 4.8 or GPT-5.5 for the reasoning (they’re more stable and less likely to fabricate), not cheaper or faster models. Use temperature 0. Have the model output reasoning explicitly, not just conclusions. Implement multi-step checks: retrieval validation (is the document real?), output schema validation (does the output match what we expect?), and consistency validation (does this decision contradict earlier decisions on similar cases?).
Anthropic’s prompt caching (mentioned in the June 10 Daily Signal) helps here—if you’re repeatedly analyzing documents against the same policies, caching the policy text cuts costs and improves reproducibility.
The Hard Truth About Regulations
The German court ruling means liability is shifting. It’s not “the model said it, so it’s not our fault.” It’s “you deployed a system that made a claim, so you’re responsible for it being accurate.” That changes the risk calculus completely. Build assuming a lawyer will read your system’s logic.
Bottom line: Stop asking what guarantees the LLM gives you. Ask what guarantees your harness around the LLM provides. Determinism comes from auditable retrieval, schema validation, explicit reasoning, human review gates, and write-once logging—not from the model itself.
Question via Hacker News