LLM Observability in Production
Technical overview of LLM observability: monitoring hallucinations, token costs, latency, and output drift across multi-step agent workflows, with platform comparisons.
LLM Observability in Production
Every traditional APM tool — Datadog, New Relic, Grafana — can tell you an HTTP request took 2.3 seconds and returned a 200. None of them can tell you the response was a hallucinated JSON blob that passed validation but contained fabricated data. LLM observability is a distinct discipline because the failure modes are distinct: outputs can be structurally correct and semantically wrong, costs scale with content rather than compute, and latency depends on how much the model “thinks” rather than how busy the server is.
This post covers the instrumentation, metrics, platforms, and architectural patterns required to operate LLM applications with the same rigor applied to traditional production systems.
Table of Contents
- Why Traditional Observability Breaks
- The Five Pillars of LLM Observability
- Tracing Multi-Step Agent Workflows
- Hallucination Monitoring
- Output Drift Detection
- Token Cost Tracking
- Latency Profiling
- Platform Comparison
- Instrumentation Patterns
- Building a Custom Observability Layer
- Alerting Strategies
- Summary
- Further Reading
Why Traditional Observability Breaks
Traditional observability rests on three pillars: metrics, logs, and traces. LLM applications break assumptions embedded in all three.
Metrics: Request count, error rate, and p99 latency don’t capture the dominant failure mode — a 200 OK response containing wrong information. A hallucinated answer is not an error in any HTTP sense.
Logs: A single LLM call can generate 4,000+ tokens of output. Logging raw completions at scale produces terabytes of unstructured text that’s expensive to store and nearly impossible to query meaningfully without semantic analysis.
Traces: A multi-step agent workflow might make 3–15 LLM calls, interleaved with tool calls, retrieval steps, and branching logic. Traditional distributed tracing assumes a request-response tree. Agent traces are often graphs with cycles (retry loops, self-correction steps).
Traditional APM tools miss the semantic failure modes that dominate LLM application debugging.
The Five Pillars of LLM Observability
LLM observability requires five distinct measurement categories, each with its own instrumentation approach:
| Pillar | What It Measures | Why It’s Hard |
|---|---|---|
| Trace completeness | Full execution path across LLM calls, tool use, retrieval | Agent workflows branch, loop, and self-correct |
| Output quality | Hallucination rate, factual accuracy, format compliance | Requires semantic evaluation, not just schema validation |
| Output drift | Changes in model behavior over time | Provider-side model updates are silent and continuous |
| Cost attribution | Token spend per user, feature, workflow | Input/output token ratios vary wildly; cached vs uncached pricing differs |
| Latency decomposition | Time spent in each phase: routing, queueing, TTFT, generation | Time-to-first-token and total generation time measure different things |
The five pillars of LLM observability and their interdependencies.
Tracing Multi-Step Agent Workflows
A single user request to an agent system might trigger this sequence: parse intent → retrieve documents → generate plan → execute tool calls → validate output → summarize response. Each step involves different models, different latencies, and different failure modes.
Trace Structure
LLM traces need a hierarchical structure that captures:
- Sessions: A conversation or user interaction (may span multiple requests)
- Traces: A single end-to-end request processing
- Spans: Individual operations within a trace (LLM calls, retrieval, tool execution)
- Generations: The specific LLM input/output pairs within a span
Trace hierarchy for LLM applications — sessions contain traces, traces contain typed spans.
What to Capture Per LLM Span
Every LLM call should record:
{
"span_id": "abc-123",
"trace_id": "trace-789",
"model": "claude-sonnet-4.6",
"provider": "anthropic",
"input_tokens": 2847,
"output_tokens": 512,
"cache_read_tokens": 1200,
"cache_write_tokens": 0,
"total_cost_usd": 0.0089,
"latency_ms": 1843,
"ttft_ms": 312,
"temperature": 0.7,
"top_p": 1.0,
"stop_reason": "end_turn",
"input_messages": [...],
"output_content": "...",
"metadata": {
"user_id": "user-456",
"feature": "code-review",
"environment": "production"
}
}
The cache_read_tokens and cache_write_tokens fields matter for cost accuracy. Anthropic’s prompt caching, for example, charges differently for cache writes vs cache reads vs uncached input. Ignoring this distinction can make cost estimates off by 50-90% for cache-heavy workloads.
Handling Agent Loops
Agent workflows often involve retry loops and self-correction. A naive trace implementation records these as a flat list of spans, losing the causal relationship. The trace should capture parent-child relationships between correction steps:
# Pseudocode for traced agent loop
with tracer.start_trace("user_request") as trace:
plan = await generate_plan(trace, user_input)
for attempt in range(max_retries):
with trace.start_span(f"execution_attempt_{attempt}") as span:
result = await execute_plan(span, plan)
validation = await validate_result(span, result)
span.set_attribute("validation.passed", validation.passed)
span.set_attribute("validation.issues", validation.issues)
if validation.passed:
break
# Self-correction: the correction span is a child of this attempt
with span.start_span("self_correction") as correction:
plan = await revise_plan(correction, plan, validation.issues)
This structure makes it possible to answer: “How many attempts did this request take?” and “What validation failures triggered corrections?”
Hallucination Monitoring
Hallucination detection in production is harder than in evaluation benchmarks because there’s no ground truth to compare against. Several approaches work at different accuracy-cost tradeoffs.
Detection Methods
Reference-based checking: When the LLM’s output is grounded in retrieved documents (RAG), compare claims in the output against the source material. This catches unsupported claims but misses cases where the retrieved documents themselves are wrong.
Self-consistency checking: Run the same prompt multiple times (or ask the model to verify its own output in a separate call). Inconsistencies suggest hallucination. This doubles or triples cost per request, so it’s typically applied only to high-stakes outputs.
Entailment scoring: Use a smaller, specialized model (a natural language inference classifier) to score whether the output is entailed by the input context. Models like vectara/hallucination_evaluation_model or similar cross-encoder classifiers can run inference in <50ms and provide a 0-1 confidence score.
LLM-as-judge: Use a separate LLM call to evaluate the output for factual accuracy, relevance, and grounding. This is the most flexible approach but adds latency and cost. Typically done asynchronously, not inline.
Four hallucination detection methods, each suited to different cost/accuracy tradeoffs.
Sampling Strategy
Running hallucination detection on every request is expensive. A practical approach:
| Traffic Volume | Strategy |
|---|---|
| <1,000 req/day | Evaluate 100% with entailment scoring |
| 1,000–50,000 req/day | Evaluate 100% with entailment, 10% with LLM-as-judge |
| 50,000+ req/day | Evaluate 5–10% sample with entailment, 1% with LLM-as-judge |
Store the hallucination scores alongside traces. Alert when the rolling average exceeds a baseline threshold — a spike from 4% to 12% hallucination rate over 24 hours probably indicates a provider-side model update.
Implementation Example
import numpy as np
from sentence_transformers import CrossEncoder
# Load NLI model for entailment scoring
nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-large", max_length=512)
def score_hallucination(context: str, output: str) -> float:
"""
Returns hallucination probability (0 = grounded, 1 = hallucinated).
Uses NLI entailment: if output is NOT entailed by context,
it's likely hallucinated.
"""
scores = nli_model.predict([(context, output)])
# NLI models return [contradiction, neutral, entailment]
# Higher entailment = lower hallucination risk
entailment_score = scores[0][2] # entailment class
return 1.0 - entailment_score
def monitor_hallucination(trace_id: str, context: str, output: str):
score = score_hallucination(context, output)
# Log to observability platform
observability.log_score(
trace_id=trace_id,
name="hallucination_probability",
value=score,
threshold=0.5
)
if score > 0.7:
observability.alert(
severity="warning",
message=f"High hallucination score ({score:.2f}) for trace {trace_id}"
)
Output Drift Detection
Model providers update their models continuously. OpenAI, Anthropic, and Google all perform minor updates to dated model versions (safety patches, efficiency improvements) without changing the version identifier. These silent updates can change output characteristics in ways that break downstream systems.
What Drifts
- Format compliance: A model that reliably produced valid JSON might start adding markdown formatting or explanatory text around the JSON
- Verbosity: Average output length shifts, affecting both cost and user experience
- Tone and style: Subtle changes in formality, hedging language, or response structure
- Tool call patterns: Changes in how models structure function calls or choose between available tools
- Refusal rates: Increased safety filtering can cause previously-working prompts to be refused
Measuring Drift
Track statistical distributions of output properties over time windows (hourly, daily, weekly):
from dataclasses import dataclass
from collections import deque
import statistics
@dataclass
class OutputMetrics:
output_length: int
json_valid: bool
tool_calls_count: int
refusal: bool
latency_ms: float
hallucination_score: float
class DriftDetector:
def __init__(self, window_size: int = 1000):
self.baseline: deque[OutputMetrics] = deque(maxlen=window_size)
self.current: deque[OutputMetrics] = deque(maxlen=window_size)
def check_drift(self, metric_name: str) -> dict:
baseline_vals = [getattr(m, metric_name) for m in self.baseline]
current_vals = [getattr(m, metric_name) for m in self.current]
if len(baseline_vals) < 100 or len(current_vals) < 100:
return {"status": "insufficient_data"}
baseline_mean = statistics.mean(baseline_vals)
current_mean = statistics.mean(current_vals)
baseline_std = statistics.stdev(baseline_vals)
if baseline_std == 0:
return {"status": "no_variance"}
# Z-score of the shift
z_score = (current_mean - baseline_mean) / baseline_std
return {
"metric": metric_name,
"baseline_mean": baseline_mean,
"current_mean": current_mean,
"z_score": z_score,
"drifted": abs(z_score) > 2.0
}
A practical rule of thumb: re-establish baselines whenever you intentionally change prompts, switch model versions, or update retrieval pipelines. Unexpected drift against a stable baseline is the signal worth alerting on.
Drift detection pipeline: extract output metrics, compare rolling windows against a baseline, alert on statistical shifts.
Token Cost Tracking
Token costs in LLM applications follow patterns unlike any other cloud resource. A single poorly-constructed prompt can cost 100x more than an optimized one. Cost observability requires tracking at multiple granularities.
Cost Attribution Dimensions
Every LLM call’s cost should be attributed across:
- User: Which user or account triggered the spend
- Feature: Which product feature (code review, summarization, chat)
- Model: Which model was used (including fallback routing)
- Cache status: Cache hit, cache write, or uncached
- Workflow step: Which step in a multi-step agent pipeline
def calculate_call_cost(
model: str,
input_tokens: int,
output_tokens: int,
cache_read_tokens: int = 0,
cache_write_tokens: int = 0
) -> float:
"""
Calculate cost for a single LLM call.
Prices in USD per million tokens — update these from provider pricing pages.
"""
# Example pricing structure (verify against current provider pricing)
pricing = {
"claude-sonnet-4.6": {
"input": 3.00,
"output": 15.00,
"cache_read": 0.30,
"cache_write": 3.75,
},
"claude-haiku-4.5": {
"input": 0.80,
"output": 4.00,
"cache_read": 0.08,
"cache_write": 1.00,
},
"gpt-4.1-nano": {
"input": 0.10,
"output": 0.40,
"cache_read": 0.025,
"cache_write": 0.10,
},
}
p = pricing.get(model)
if not p:
return 0.0
uncached_input = input_tokens - cache_read_tokens - cache_write_tokens
cost = (
(uncached_input / 1_000_000) * p["input"]
+ (output_tokens / 1_000_000) * p["output"]
+ (cache_read_tokens / 1_000_000) * p["cache_read"]
+ (cache_write_tokens / 1_000_000) * p["cache_write"]
)
return round(cost, 6)
Cost Anomaly Detection
Token costs follow predictable distributions for a given feature. A code review feature might average $0.03 per invocation with a standard deviation of $0.01. A single call costing $0.50 indicates either a prompt injection (inflating context), a retrieval bug (pulling too many documents), or an agent loop that didn’t terminate.
Track per-feature cost distributions and alert when individual calls exceed 3 standard deviations above the mean, or when hourly/daily aggregate costs exceed budgets.
The Hidden Costs
Several cost sources are easy to miss:
| Hidden Cost | Why It’s Missed |
|---|---|
| Retry tokens | Failed calls that consumed tokens before erroring |
| Validation re-runs | Output failed validation, entire generation repeated |
| Embedding calls | Retrieval-time embeddings billed separately |
| Judge/eval calls | Async quality evaluation uses real tokens |
| Prompt caching writes | First call to populate cache costs more than subsequent reads |
Latency Profiling
LLM latency is not a single number. A streaming response has at least three distinct latency measurements, and each one matters for different reasons.
Latency Components
| Metric | Definition | Typical Range | Why It Matters |
|---|---|---|---|
| Time to First Token (TTFT) | Time from request sent to first token received | 200ms–3s | Determines perceived responsiveness in streaming UIs |
| Inter-Token Latency (ITL) | Average time between consecutive tokens | 10ms–50ms | Affects streaming smoothness |
| Total Generation Time | Time from request to final token | 1s–60s+ | End-to-end wall clock time |
| Queue Time | Time spent waiting before inference begins | 0ms–10s+ | Spikes during provider congestion |
| Tool Call Overhead | Time spent executing tool calls mid-generation | Variable | Can dominate total latency in agent workflows |
Latency decomposition for a single LLM request with tool calls.
Measuring TTFT Accurately
TTFT measurement requires care. If using an HTTP client that buffers responses, the measured TTFT includes buffering delay. For accurate measurement:
import time
import httpx
async def measure_ttft(client: httpx.AsyncClient, request_body: dict) -> dict:
start = time.monotonic()
first_token_time = None
total_tokens = 0
async with client.stream(
"POST",
"https://api.anthropic.com/v1/messages",
json=request_body,
headers={"anthropic-version": "2023-06-01"}
) as response:
async for chunk in response.aiter_lines():
if chunk.startswith("data: "):
if first_token_time is None:
first_token_time = time.monotonic()
total_tokens += 1
end = time.monotonic()
return {
"ttft_ms": (first_token_time - start) * 1000 if first_token_time else None,
"total_ms": (end - start) * 1000,
"tokens": total_tokens,
"itl_ms": ((end - first_token_time) * 1000 / max(total_tokens - 1, 1))
if first_token_time else None
}
Latency Budgets
For agent workflows with multiple LLM calls, establish per-step latency budgets:
# Example latency budget for a RAG agent
total_budget_ms: 8000
steps:
query_understanding:
model: gpt-4.1-nano
budget_ms: 800
retrieval:
budget_ms: 200
reranking:
budget_ms: 300
generation:
model: claude-sonnet-4.6
budget_ms: 5000
validation:
model: gpt-4.1-nano
budget_ms: 700
buffer_ms: 1000
Track what percentage of requests exceed each step’s budget. A step consistently exceeding its budget indicates either a model performance regression, a prompt that’s too complex, or provider-side capacity issues.
Platform Comparison
Three platforms dominate LLM observability as of mid-2026: Langfuse, Helicone, and Braintrust. Each approaches the problem differently.
Langfuse
Architecture: Open-source, self-hostable. Traces, scores, and datasets stored in PostgreSQL. SDK-based instrumentation with OpenTelemetry compatibility.
Strengths:
- Full trace hierarchy (sessions → traces → spans → generations)
- Built-in evaluation framework with custom scoring functions
- Dataset management for regression testing
- Self-hosting option (important for compliance-sensitive environments)
- Native integrations with LangChain, LlamaIndex, and Vercel AI SDK
Weaknesses:
- Self-hosted deployment requires managing PostgreSQL at scale
- Dashboard UI is functional but less polished than commercial alternatives
- Real-time alerting requires additional infrastructure
Pricing: Open-source (self-hosted), or cloud-hosted with a free tier and usage-based pricing.
Helicone
Architecture: Proxy-based. Sits between the application and LLM providers as an HTTP proxy, capturing all request/response data without SDK changes.
Strengths:
- Zero-code instrumentation: change the base URL, get observability
- Strong cost tracking and analytics out of the box
- Request caching at the proxy layer
- Rate limiting and request queuing built in
- Low integration effort
Weaknesses:
- Proxy architecture adds a network hop (typically 1-5ms)
- Less flexible trace hierarchy than SDK-based approaches
- Custom evaluation scoring is more limited
- Multi-step agent traces require additional annotation
Pricing: Free tier with usage-based scaling.
Braintrust
Architecture: SDK-based with a focus on evaluation and experimentation. Positions itself as an “AI product development platform” rather than pure observability.
Strengths:
- Strong evaluation framework with built-in scoring functions (factuality, relevance, toxicity)
- Experiment tracking: compare prompt versions, model changes side by side
- Dataset management with human annotation workflows
- Logging doubles as evaluation data collection
Weaknesses:
- More opinionated about workflow than Langfuse or Helicone
- Tighter coupling to the platform’s evaluation philosophy
- Self-hosting not available
Pricing: Free tier, usage-based scaling.
Comparison Table
| Feature | Langfuse | Helicone | Braintrust |
|---|---|---|---|
| Integration method | SDK | Proxy | SDK |
| Self-hostable | Yes (open-source) | No | No |
| Trace hierarchy | Full (session/trace/span) | Flat + annotations | Experiment/log based |
| Cost tracking | Yes | Yes (automatic) | Yes |
| Evaluation framework | Custom scores | Basic | Built-in scorers |
| Prompt management | Yes | No | Yes |
| Dataset management | Yes | No | Yes |
| Real-time alerting | Via webhooks | Built-in | Via integrations |
| OpenTelemetry support | Yes | No | Partial |
Three integration approaches: Langfuse and Braintrust use SDKs, Helicone uses a proxy layer.
When to Use Which
Langfuse fits teams that want full control, need self-hosting, or are building complex agent workflows with deep trace hierarchies. The open-source nature makes it the default choice for compliance-constrained environments.
Helicone fits teams that want observability with minimal code changes, especially for simpler request/response LLM applications. The proxy approach means you get cost and latency tracking immediately.
Braintrust fits teams whose primary concern is evaluation and experimentation — comparing prompt variants, running regression tests, and managing human evaluation workflows. The observability is a byproduct of the evaluation infrastructure.
Many teams use more than one. Helicone as a proxy for immediate cost visibility, plus Langfuse or Braintrust for deeper evaluation workflows, is a common combination.
Instrumentation Patterns
Pattern 1: Decorator-Based Tracing
Wrap LLM calls with decorators that automatically capture inputs, outputs, and metadata:
from functools import wraps
import time
from langfuse import Langfuse
langfuse = Langfuse()
def traced_llm_call(name: str, model: str):
def decorator(func):
@wraps(func)
async def wrapper(*args, trace=None, **kwargs):
generation = trace.generation(
name=name,
model=model,
input=kwargs.get("messages", args[0] if args else None),
)
start = time.monotonic()
try:
result = await func(*args, **kwargs)
generation.end(
output=result.content,
usage={
"input": result.usage.input_tokens,
"output": result.usage.output_tokens,
},
metadata={"latency_ms": (time.monotonic() - start) * 1000}
)
return result
except Exception as e:
generation.end(
status_message=str(e),
level="ERROR"
)
raise
return wrapper
return decorator
@traced_llm_call(name="summarize", model="claude-sonnet-4.6")
async def summarize(messages: list, **kwargs):
return await anthropic_client.messages.create(
model="claude-sonnet-4.6",
messages=messages,
**kwargs
)
Pattern 2: Middleware-Based Capture
For proxy-style observability, intercept all LLM calls at the HTTP client level:
import httpx
class LLMObservabilityMiddleware:
def __init__(self, tracker):
self.tracker = tracker
async def intercept(
self, request: httpx.Request
) -> httpx.Request:
request.extensions["obs_start_time"] = time.monotonic()
request.extensions["obs_request_body"] = request.content
return request
async def handle_response(
self, response: httpx.Response
) -> httpx.Response:
start = response.request.extensions.get("obs_start_time", 0)
latency = (time.monotonic() - start) * 1000
body = await response.aread()
# Parse provider-specific usage from response
usage = self._extract_usage(body, response.request.url)
self.tracker.record(
url=str(response.request.url),
status=response.status_code,
latency_ms=latency,
input_tokens=usage.get("input_tokens"),
output_tokens=usage.get("output_tokens"),
)
return response
Pattern 3: OpenTelemetry Integration
For teams already using OpenTelemetry, extend the tracing with LLM-specific semantic conventions:
from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes # emerging convention
tracer = trace.get_tracer("llm-app")
async def call_llm(model: str, messages: list):
with tracer.start_as_current_span("llm.generate") as span:
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", model)
span.set_attribute("gen_ai.request.temperature", 0.7)
result = await client.messages.create(
model=model, messages=messages
)
span.set_attribute("gen_ai.response.model", result.model)
span.set_attribute("gen_ai.usage.input_tokens", result.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", result.usage.output_tokens)
span.set_attribute("gen_ai.response.finish_reason", result.stop_reason)
return result
The OpenTelemetry Semantic Conventions for Generative AI are still evolving (the gen_ai.* namespace), but they’re converging toward a standard that Langfuse, Traceloop, and other platforms already support.
Building a Custom Observability Layer
When off-the-shelf platforms don’t fit (air-gapped environments, specific compliance requirements, extreme scale), building a custom layer is straightforward. The core architecture has three components.
Custom observability architecture: async event collection, analytical storage, and visualization/alerting.
Storage Considerations
LLM observability data has specific storage characteristics:
- High cardinality: Every trace has unique prompt/completion text
- Mixed types: Numeric metrics alongside large text blobs
- Time-series access patterns: Most queries filter by time range + dimensions
- Write-heavy: Production apps generate thousands of events per minute
ClickHouse handles this well: columnar storage compresses text efficiently, and materialized views can pre-aggregate metrics. Store full prompts/completions in a separate table with trace_id as the join key, keeping the metrics table lean.
PostgreSQL works for smaller deployments (<100k events/day). Use JSONB columns for flexible metadata and create materialized views for common aggregations.
-- ClickHouse schema for LLM events
CREATE TABLE llm_events (
trace_id String,
span_id String,
timestamp DateTime64(3),
model LowCardinality(String),
provider LowCardinality(String),
feature LowCardinality(String),
user_id String,
input_tokens UInt32,
output_tokens UInt32,
cache_read_tokens UInt32,
cost_usd Float64,
latency_ms Float64,
ttft_ms Float64,
hallucination_score Float32,
output_length UInt32,
stop_reason LowCardinality(String),
error Boolean DEFAULT false
) ENGINE = MergeTree()
ORDER BY (feature, timestamp, trace_id);
-- Separate table for full text (expensive to store, rarely queried)
CREATE TABLE llm_event_content (
trace_id String,
span_id String,
input_text String CODEC(ZSTD(3)),
output_text String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY (trace_id, span_id);
Async Event Emission
Never block the request path to record observability data. Use an async queue:
import asyncio
from collections import deque
class ObservabilityEmitter:
def __init__(self, flush_interval: float = 5.0, batch_size: int = 100):
self._buffer: deque = deque()
self._flush_interval = flush_interval
self._batch_size = batch_size
self._task: asyncio.Task | None = None
def start(self):
self._task = asyncio.create_task(self._flush_loop())
def emit(self, event: dict):
"""Non-blocking event emission."""
self._buffer.append(event)
if len(self._buffer) >= self._batch_size:
asyncio.create_task(self._flush())
async def _flush_loop(self):
while True:
await asyncio.sleep(self._flush_interval)
await self._flush()
async def _flush(self):
if not self._buffer:
return
batch = []
while self._buffer and len(batch) < self._batch_size:
batch.append(self._buffer.popleft())
# Write to ClickHouse, Postgres, or observability platform
await self._write_batch(batch)
Alerting Strategies
LLM alerts require different thresholds and logic than traditional application alerts. A 500-error spike is unambiguous. A gradual increase in hallucination rate is not.
Alert Categories
| Alert Type | Condition | Urgency | Example |
|---|---|---|---|
| Cost spike | Hourly spend > 3x rolling average | High | Agent loop running unbounded |
| Latency degradation | p95 TTFT > 5s for 10 min | Medium | Provider capacity issues |
| Hallucination spike | Rolling hallucination rate > 2x baseline | High | Silent model update |
| Error rate | >5% of LLM calls returning errors | High | Rate limiting, auth issues |
| Output drift | Z-score > 2.0 on any tracked metric | Medium | Model behavior change |
| Refusal rate | >2% of requests refused | Medium | Safety filter changes |
| Cache hit drop | Cache hit rate drops >20pp in 1 hour | Low | Prompt template change invalidated cache |
Composite Alerts
Single-metric alerts generate noise. Composite alerts — requiring multiple conditions — are more actionable:
# Alert: Probable silent model update
alert: model_behavior_change
conditions:
- output_length_z_score > 2.0
- hallucination_rate_change > 0.05 # 5pp increase
- time_window: 6h
- min_sample_size: 500
severity: high
action: page_on_call
message: |
Probable model behavior change detected for {model}.
Output length shifted {z_score} std devs from baseline.
Hallucination rate increased from {baseline_rate} to {current_rate}.
Alert pipeline: multiple metric streams feed composite rules that route to appropriate channels.
Avoiding Alert Fatigue
LLM observability alerts are particularly prone to noise because:
- Model behavior is inherently variable: Temperature >0 means outputs vary naturally. Set baselines using statistical distributions, not fixed thresholds.
- Provider issues are transient: A 30-second latency spike during provider scaling doesn’t warrant a page. Use sustained-duration conditions (e.g., “p95 latency > 3s for 5+ consecutive minutes”).
- Evaluation is probabilistic: A hallucination detector with 80% accuracy will flag 20% false positives. Require multiple signals before alerting.
Summary
LLM observability requires purpose-built instrumentation because the failure modes — hallucinated outputs, silent model updates, token cost spikes, and semantic drift — are invisible to traditional APM tools.
The core requirements:
- Structured traces that capture session → trace → span → generation hierarchies, including agent retry loops and tool calls
- Hallucination monitoring using entailment scoring for breadth and LLM-as-judge for depth, with sampling strategies matched to traffic volume
- Drift detection against statistical baselines, with automatic re-baselining after intentional changes
- Token-granular cost tracking that accounts for cache status, retry waste, and evaluation overhead
- Latency decomposition into TTFT, inter-token latency, queue time, and tool call overhead — not just total request duration
Platform choice depends on constraints: Langfuse for self-hosting and complex agent traces, Helicone for zero-code proxy-based capture, Braintrust for evaluation-centric workflows. Many production deployments use more than one.
Alerting on LLM metrics requires composite conditions and statistical thresholds rather than fixed limits. Single-metric alerts on inherently variable systems produce noise. Multiple converging signals — cost spike plus hallucination increase plus output length shift — produce actionable alerts.
The tooling is maturing rapidly. The OpenTelemetry semantic conventions for generative AI are stabilizing, which will probably make cross-platform instrumentation more portable within the next year. Until then, SDK-based instrumentation with one of the purpose-built platforms remains the most practical approach.
Further Reading
- Langfuse GitHub repository — Open-source LLM observability platform with tracing, evaluation, and prompt management
- Helicone documentation — Proxy-based LLM observability with automatic cost tracking and caching
- Braintrust documentation — AI product development platform with evaluation framework and experiment tracking
- OpenTelemetry Semantic Conventions for GenAI — Emerging standard for LLM-specific span attributes in OpenTelemetry
- OpenLLMetry by Traceloop — OpenTelemetry-native instrumentation for LLM applications, supports multiple providers
- Vectara hallucination evaluation model — Cross-encoder model for scoring hallucination probability in grounded generation
- ClickHouse documentation — Column-oriented database well-suited for high-cardinality LLM observability data
- Arize Phoenix — Open-source LLM observability with trace visualization, evaluation, and dataset management
- OTEL GenAI Instrumentation for Python — Official OpenTelemetry Python instrumentation for generative AI calls