Technical overview of LLM observability: monitoring hallucinations, token costs, latency, and output drift across multi-step agent workflows, with platform comparisons.

LLM Observability in Production

Every traditional APM tool — Datadog, New Relic, Grafana — can tell you an HTTP request took 2.3 seconds and returned a 200. None of them can tell you the response was a hallucinated JSON blob that passed validation but contained fabricated data. LLM observability is a distinct discipline because the failure modes are distinct: outputs can be structurally correct and semantically wrong, costs scale with content rather than compute, and latency depends on how much the model “thinks” rather than how busy the server is.

This post covers the instrumentation, metrics, platforms, and architectural patterns required to operate LLM applications with the same rigor applied to traditional production systems.

Why Traditional Observability Breaks
The Five Pillars of LLM Observability
Tracing Multi-Step Agent Workflows
Hallucination Monitoring
Output Drift Detection
Token Cost Tracking
Latency Profiling
Platform Comparison
Instrumentation Patterns
Building a Custom Observability Layer
Alerting Strategies
Summary
Further Reading

Why Traditional Observability Breaks

Traditional observability rests on three pillars: metrics, logs, and traces. LLM applications break assumptions embedded in all three.

Metrics: Request count, error rate, and p99 latency don’t capture the dominant failure mode — a 200 OK response containing wrong information. A hallucinated answer is not an error in any HTTP sense.

Logs: A single LLM call can generate 4,000+ tokens of output. Logging raw completions at scale produces terabytes of unstructured text that’s expensive to store and nearly impossible to query meaningfully without semantic analysis.

Traces: A multi-step agent workflow might make 3–15 LLM calls, interleaved with tool calls, retrieval steps, and branching logic. Traditional distributed tracing assumes a request-response tree. Agent traces are often graphs with cycles (retry loops, self-correction steps).

Traditional APM tools miss the semantic failure modes that dominate LLM application debugging.

The Five Pillars of LLM Observability

LLM observability requires five distinct measurement categories, each with its own instrumentation approach:

Pillar	What It Measures	Why It’s Hard
Trace completeness	Full execution path across LLM calls, tool use, retrieval	Agent workflows branch, loop, and self-correct
Output quality	Hallucination rate, factual accuracy, format compliance	Requires semantic evaluation, not just schema validation
Output drift	Changes in model behavior over time	Provider-side model updates are silent and continuous
Cost attribution	Token spend per user, feature, workflow	Input/output token ratios vary wildly; cached vs uncached pricing differs
Latency decomposition	Time spent in each phase: routing, queueing, TTFT, generation	Time-to-first-token and total generation time measure different things

The five pillars of LLM observability and their interdependencies.

Tracing Multi-Step Agent Workflows

A single user request to an agent system might trigger this sequence: parse intent → retrieve documents → generate plan → execute tool calls → validate output → summarize response. Each step involves different models, different latencies, and different failure modes.

Trace Structure

LLM traces need a hierarchical structure that captures:

Sessions: A conversation or user interaction (may span multiple requests)
Traces: A single end-to-end request processing
Spans: Individual operations within a trace (LLM calls, retrieval, tool execution)
Generations: The specific LLM input/output pairs within a span

Trace hierarchy for LLM applications — sessions contain traces, traces contain typed spans.

What to Capture Per LLM Span

Every LLM call should record:

{
  "span_id": "abc-123",
  "trace_id": "trace-789",
  "model": "claude-sonnet-4.6",
  "provider": "anthropic",
  "input_tokens": 2847,
  "output_tokens": 512,
  "cache_read_tokens": 1200,
  "cache_write_tokens": 0,
  "total_cost_usd": 0.0089,
  "latency_ms": 1843,
  "ttft_ms": 312,
  "temperature": 0.7,
  "top_p": 1.0,
  "stop_reason": "end_turn",
  "input_messages": [...],
  "output_content": "...",
  "metadata": {
    "user_id": "user-456",
    "feature": "code-review",
    "environment": "production"
  }
}

The cache_read_tokens and cache_write_tokens fields matter for cost accuracy. Anthropic’s prompt caching, for example, charges differently for cache writes vs cache reads vs uncached input. Ignoring this distinction can make cost estimates off by 50-90% for cache-heavy workloads.

Handling Agent Loops

Agent workflows often involve retry loops and self-correction. A naive trace implementation records these as a flat list of spans, losing the causal relationship. The trace should capture parent-child relationships between correction steps:

# Pseudocode for traced agent loop
with tracer.start_trace("user_request") as trace:
    plan = await generate_plan(trace, user_input)
    
    for attempt in range(max_retries):
        with trace.start_span(f"execution_attempt_{attempt}") as span:
            result = await execute_plan(span, plan)
            validation = await validate_result(span, result)
            
            span.set_attribute("validation.passed", validation.passed)
            span.set_attribute("validation.issues", validation.issues)
            
            if validation.passed:
                break
            
            # Self-correction: the correction span is a child of this attempt
            with span.start_span("self_correction") as correction:
                plan = await revise_plan(correction, plan, validation.issues)

This structure makes it possible to answer: “How many attempts did this request take?” and “What validation failures triggered corrections?”

Hallucination Monitoring

Hallucination detection in production is harder than in evaluation benchmarks because there’s no ground truth to compare against. Several approaches work at different accuracy-cost tradeoffs.

Detection Methods

Reference-based checking: When the LLM’s output is grounded in retrieved documents (RAG), compare claims in the output against the source material. This catches unsupported claims but misses cases where the retrieved documents themselves are wrong.

Self-consistency checking: Run the same prompt multiple times (or ask the model to verify its own output in a separate call). Inconsistencies suggest hallucination. This doubles or triples cost per request, so it’s typically applied only to high-stakes outputs.

Entailment scoring: Use a smaller, specialized model (a natural language inference classifier) to score whether the output is entailed by the input context. Models like vectara/hallucination_evaluation_model or similar cross-encoder classifiers can run inference in <50ms and provide a 0-1 confidence score.

LLM-as-judge: Use a separate LLM call to evaluate the output for factual accuracy, relevance, and grounding. This is the most flexible approach but adds latency and cost. Typically done asynchronously, not inline.

Four hallucination detection methods, each suited to different cost/accuracy tradeoffs.

Sampling Strategy

Running hallucination detection on every request is expensive. A practical approach:

Traffic Volume	Strategy
<1,000 req/day	Evaluate 100% with entailment scoring
1,000–50,000 req/day	Evaluate 100% with entailment, 10% with LLM-as-judge
50,000+ req/day	Evaluate 5–10% sample with entailment, 1% with LLM-as-judge

Store the hallucination scores alongside traces. Alert when the rolling average exceeds a baseline threshold — a spike from 4% to 12% hallucination rate over 24 hours probably indicates a provider-side model update.

Implementation Example

import numpy as np
from sentence_transformers import CrossEncoder

# Load NLI model for entailment scoring
nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-large", max_length=512)

def score_hallucination(context: str, output: str) -> float:
    """
    Returns hallucination probability (0 = grounded, 1 = hallucinated).
    Uses NLI entailment: if output is NOT entailed by context, 
    it's likely hallucinated.
    """
    scores = nli_model.predict([(context, output)])
    # NLI models return [contradiction, neutral, entailment]
    # Higher entailment = lower hallucination risk
    entailment_score = scores[0][2]  # entailment class
    return 1.0 - entailment_score

def monitor_hallucination(trace_id: str, context: str, output: str):
    score = score_hallucination(context, output)
    
    # Log to observability platform
    observability.log_score(
        trace_id=trace_id,
        name="hallucination_probability",
        value=score,
        threshold=0.5
    )
    
    if score > 0.7:
        observability.alert(
            severity="warning",
            message=f"High hallucination score ({score:.2f}) for trace {trace_id}"
        )

Output Drift Detection

Model providers update their models continuously. OpenAI, Anthropic, and Google all perform minor updates to dated model versions (safety patches, efficiency improvements) without changing the version identifier. These silent updates can change output characteristics in ways that break downstream systems.

What Drifts

Format compliance: A model that reliably produced valid JSON might start adding markdown formatting or explanatory text around the JSON
Verbosity: Average output length shifts, affecting both cost and user experience
Tone and style: Subtle changes in formality, hedging language, or response structure
Tool call patterns: Changes in how models structure function calls or choose between available tools
Refusal rates: Increased safety filtering can cause previously-working prompts to be refused

Measuring Drift

Track statistical distributions of output properties over time windows (hourly, daily, weekly):

from dataclasses import dataclass
from collections import deque
import statistics

@dataclass
class OutputMetrics:
    output_length: int
    json_valid: bool
    tool_calls_count: int
    refusal: bool
    latency_ms: float
    hallucination_score: float

class DriftDetector:
    def __init__(self, window_size: int = 1000):
        self.baseline: deque[OutputMetrics] = deque(maxlen=window_size)
        self.current: deque[OutputMetrics] = deque(maxlen=window_size)
    
    def check_drift(self, metric_name: str) -> dict:
        baseline_vals = [getattr(m, metric_name) for m in self.baseline]
        current_vals = [getattr(m, metric_name) for m in self.current]
        
        if len(baseline_vals) < 100 or len(current_vals) < 100:
            return {"status": "insufficient_data"}
        
        baseline_mean = statistics.mean(baseline_vals)
        current_mean = statistics.mean(current_vals)
        baseline_std = statistics.stdev(baseline_vals)
        
        if baseline_std == 0:
            return {"status": "no_variance"}
        
        # Z-score of the shift
        z_score = (current_mean - baseline_mean) / baseline_std
        
        return {
            "metric": metric_name,
            "baseline_mean": baseline_mean,
            "current_mean": current_mean,
            "z_score": z_score,
            "drifted": abs(z_score) > 2.0
        }

A practical rule of thumb: re-establish baselines whenever you intentionally change prompts, switch model versions, or update retrieval pipelines. Unexpected drift against a stable baseline is the signal worth alerting on.

Drift detection pipeline: extract output metrics, compare rolling windows against a baseline, alert on statistical shifts.

Token Cost Tracking

Token costs in LLM applications follow patterns unlike any other cloud resource. A single poorly-constructed prompt can cost 100x more than an optimized one. Cost observability requires tracking at multiple granularities.

Cost Attribution Dimensions

Every LLM call’s cost should be attributed across:

User: Which user or account triggered the spend
Feature: Which product feature (code review, summarization, chat)
Model: Which model was used (including fallback routing)
Cache status: Cache hit, cache write, or uncached
Workflow step: Which step in a multi-step agent pipeline

def calculate_call_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cache_read_tokens: int = 0,
    cache_write_tokens: int = 0
) -> float:
    """
    Calculate cost for a single LLM call.
    Prices in USD per million tokens — update these from provider pricing pages.
    """
    # Example pricing structure (verify against current provider pricing)
    pricing = {
        "claude-sonnet-4.6": {
            "input": 3.00,
            "output": 15.00,
            "cache_read": 0.30,
            "cache_write": 3.75,
        },
        "claude-haiku-4.5": {
            "input": 0.80,
            "output": 4.00,
            "cache_read": 0.08,
            "cache_write": 1.00,
        },
        "gpt-4.1-nano": {
            "input": 0.10,
            "output": 0.40,
            "cache_read": 0.025,
            "cache_write": 0.10,
        },
    }
    
    p = pricing.get(model)
    if not p:
        return 0.0
    
    uncached_input = input_tokens - cache_read_tokens - cache_write_tokens
    
    cost = (
        (uncached_input / 1_000_000) * p["input"]
        + (output_tokens / 1_000_000) * p["output"]
        + (cache_read_tokens / 1_000_000) * p["cache_read"]
        + (cache_write_tokens / 1_000_000) * p["cache_write"]
    )
    
    return round(cost, 6)

Cost Anomaly Detection

Token costs follow predictable distributions for a given feature. A code review feature might average $0.03 per invocation with a standard deviation of $0.01. A single call costing $0.50 indicates either a prompt injection (inflating context), a retrieval bug (pulling too many documents), or an agent loop that didn’t terminate.

Track per-feature cost distributions and alert when individual calls exceed 3 standard deviations above the mean, or when hourly/daily aggregate costs exceed budgets.

The Hidden Costs

Several cost sources are easy to miss:

Hidden Cost	Why It’s Missed
Retry tokens	Failed calls that consumed tokens before erroring
Validation re-runs	Output failed validation, entire generation repeated
Embedding calls	Retrieval-time embeddings billed separately
Judge/eval calls	Async quality evaluation uses real tokens
Prompt caching writes	First call to populate cache costs more than subsequent reads

Latency Profiling

LLM latency is not a single number. A streaming response has at least three distinct latency measurements, and each one matters for different reasons.

Latency Components

Metric	Definition	Typical Range	Why It Matters
Time to First Token (TTFT)	Time from request sent to first token received	200ms–3s	Determines perceived responsiveness in streaming UIs
Inter-Token Latency (ITL)	Average time between consecutive tokens	10ms–50ms	Affects streaming smoothness
Total Generation Time	Time from request to final token	1s–60s+	End-to-end wall clock time
Queue Time	Time spent waiting before inference begins	0ms–10s+	Spikes during provider congestion
Tool Call Overhead	Time spent executing tool calls mid-generation	Variable	Can dominate total latency in agent workflows

Latency decomposition for a single LLM request with tool calls.

Measuring TTFT Accurately

TTFT measurement requires care. If using an HTTP client that buffers responses, the measured TTFT includes buffering delay. For accurate measurement:

import time
import httpx

async def measure_ttft(client: httpx.AsyncClient, request_body: dict) -> dict:
    start = time.monotonic()
    first_token_time = None
    total_tokens = 0
    
    async with client.stream(
        "POST",
        "https://api.anthropic.com/v1/messages",
        json=request_body,
        headers={"anthropic-version": "2023-06-01"}
    ) as response:
        async for chunk in response.aiter_lines():
            if chunk.startswith("data: "):
                if first_token_time is None:
                    first_token_time = time.monotonic()
                total_tokens += 1
    
    end = time.monotonic()
    
    return {
        "ttft_ms": (first_token_time - start) * 1000 if first_token_time else None,
        "total_ms": (end - start) * 1000,
        "tokens": total_tokens,
        "itl_ms": ((end - first_token_time) * 1000 / max(total_tokens - 1, 1)) 
                  if first_token_time else None
    }

Latency Budgets

For agent workflows with multiple LLM calls, establish per-step latency budgets:

# Example latency budget for a RAG agent
total_budget_ms: 8000
steps:
  query_understanding:
    model: gpt-4.1-nano
    budget_ms: 800
  retrieval:
    budget_ms: 200
  reranking:
    budget_ms: 300
  generation:
    model: claude-sonnet-4.6
    budget_ms: 5000
  validation:
    model: gpt-4.1-nano
    budget_ms: 700
  buffer_ms: 1000

Track what percentage of requests exceed each step’s budget. A step consistently exceeding its budget indicates either a model performance regression, a prompt that’s too complex, or provider-side capacity issues.

Platform Comparison

Three platforms dominate LLM observability as of mid-2026: Langfuse, Helicone, and Braintrust. Each approaches the problem differently.

Langfuse

Architecture: Open-source, self-hostable. Traces, scores, and datasets stored in PostgreSQL. SDK-based instrumentation with OpenTelemetry compatibility.

Strengths:

Full trace hierarchy (sessions → traces → spans → generations)
Built-in evaluation framework with custom scoring functions
Dataset management for regression testing
Self-hosting option (important for compliance-sensitive environments)
Native integrations with LangChain, LlamaIndex, and Vercel AI SDK

Weaknesses:

Self-hosted deployment requires managing PostgreSQL at scale
Dashboard UI is functional but less polished than commercial alternatives
Real-time alerting requires additional infrastructure

Pricing: Open-source (self-hosted), or cloud-hosted with a free tier and usage-based pricing.

Helicone

Architecture: Proxy-based. Sits between the application and LLM providers as an HTTP proxy, capturing all request/response data without SDK changes.

Strengths:

Zero-code instrumentation: change the base URL, get observability
Strong cost tracking and analytics out of the box
Request caching at the proxy layer
Rate limiting and request queuing built in
Low integration effort

Weaknesses:

Proxy architecture adds a network hop (typically 1-5ms)
Less flexible trace hierarchy than SDK-based approaches
Custom evaluation scoring is more limited
Multi-step agent traces require additional annotation

Pricing: Free tier with usage-based scaling.

Braintrust

Architecture: SDK-based with a focus on evaluation and experimentation. Positions itself as an “AI product development platform” rather than pure observability.

Strengths:

Strong evaluation framework with built-in scoring functions (factuality, relevance, toxicity)
Experiment tracking: compare prompt versions, model changes side by side
Dataset management with human annotation workflows
Logging doubles as evaluation data collection

Weaknesses:

More opinionated about workflow than Langfuse or Helicone
Tighter coupling to the platform’s evaluation philosophy
Self-hosting not available

Pricing: Free tier, usage-based scaling.

Comparison Table

Feature	Langfuse	Helicone	Braintrust
Integration method	SDK	Proxy	SDK
Self-hostable	Yes (open-source)	No	No
Trace hierarchy	Full (session/trace/span)	Flat + annotations	Experiment/log based
Cost tracking	Yes	Yes (automatic)	Yes
Evaluation framework	Custom scores	Basic	Built-in scorers
Prompt management	Yes	No	Yes
Dataset management	Yes	No	Yes
Real-time alerting	Via webhooks	Built-in	Via integrations
OpenTelemetry support	Yes	No	Partial

Three integration approaches: Langfuse and Braintrust use SDKs, Helicone uses a proxy layer.

When to Use Which

Langfuse fits teams that want full control, need self-hosting, or are building complex agent workflows with deep trace hierarchies. The open-source nature makes it the default choice for compliance-constrained environments.

Helicone fits teams that want observability with minimal code changes, especially for simpler request/response LLM applications. The proxy approach means you get cost and latency tracking immediately.

Braintrust fits teams whose primary concern is evaluation and experimentation — comparing prompt variants, running regression tests, and managing human evaluation workflows. The observability is a byproduct of the evaluation infrastructure.

Many teams use more than one. Helicone as a proxy for immediate cost visibility, plus Langfuse or Braintrust for deeper evaluation workflows, is a common combination.

Instrumentation Patterns

Pattern 1: Decorator-Based Tracing

Wrap LLM calls with decorators that automatically capture inputs, outputs, and metadata:

from functools import wraps
import time
from langfuse import Langfuse

langfuse = Langfuse()

def traced_llm_call(name: str, model: str):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, trace=None, **kwargs):
            generation = trace.generation(
                name=name,
                model=model,
                input=kwargs.get("messages", args[0] if args else None),
            )
            
            start = time.monotonic()
            try:
                result = await func(*args, **kwargs)
                generation.end(
                    output=result.content,
                    usage={
                        "input": result.usage.input_tokens,
                        "output": result.usage.output_tokens,
                    },
                    metadata={"latency_ms": (time.monotonic() - start) * 1000}
                )
                return result
            except Exception as e:
                generation.end(
                    status_message=str(e),
                    level="ERROR"
                )
                raise
        return wrapper
    return decorator

@traced_llm_call(name="summarize", model="claude-sonnet-4.6")
async def summarize(messages: list, **kwargs):
    return await anthropic_client.messages.create(
        model="claude-sonnet-4.6",
        messages=messages,
        **kwargs
    )

Pattern 2: Middleware-Based Capture

For proxy-style observability, intercept all LLM calls at the HTTP client level:

import httpx

class LLMObservabilityMiddleware:
    def __init__(self, tracker):
        self.tracker = tracker
    
    async def intercept(
        self, request: httpx.Request
    ) -> httpx.Request:
        request.extensions["obs_start_time"] = time.monotonic()
        request.extensions["obs_request_body"] = request.content
        return request
    
    async def handle_response(
        self, response: httpx.Response
    ) -> httpx.Response:
        start = response.request.extensions.get("obs_start_time", 0)
        latency = (time.monotonic() - start) * 1000
        
        body = await response.aread()
        
        # Parse provider-specific usage from response
        usage = self._extract_usage(body, response.request.url)
        
        self.tracker.record(
            url=str(response.request.url),
            status=response.status_code,
            latency_ms=latency,
            input_tokens=usage.get("input_tokens"),
            output_tokens=usage.get("output_tokens"),
        )
        
        return response

Pattern 3: OpenTelemetry Integration

For teams already using OpenTelemetry, extend the tracing with LLM-specific semantic conventions:

from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes  # emerging convention

tracer = trace.get_tracer("llm-app")

async def call_llm(model: str, messages: list):
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", 0.7)
        
        result = await client.messages.create(
            model=model, messages=messages
        )
        
        span.set_attribute("gen_ai.response.model", result.model)
        span.set_attribute("gen_ai.usage.input_tokens", result.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", result.usage.output_tokens)
        span.set_attribute("gen_ai.response.finish_reason", result.stop_reason)
        
        return result

The OpenTelemetry Semantic Conventions for Generative AI are still evolving (the gen_ai.* namespace), but they’re converging toward a standard that Langfuse, Traceloop, and other platforms already support.

Building a Custom Observability Layer

When off-the-shelf platforms don’t fit (air-gapped environments, specific compliance requirements, extreme scale), building a custom layer is straightforward. The core architecture has three components.

Custom observability architecture: async event collection, analytical storage, and visualization/alerting.

Storage Considerations

LLM observability data has specific storage characteristics:

High cardinality: Every trace has unique prompt/completion text
Mixed types: Numeric metrics alongside large text blobs
Time-series access patterns: Most queries filter by time range + dimensions
Write-heavy: Production apps generate thousands of events per minute

ClickHouse handles this well: columnar storage compresses text efficiently, and materialized views can pre-aggregate metrics. Store full prompts/completions in a separate table with trace_id as the join key, keeping the metrics table lean.

PostgreSQL works for smaller deployments (<100k events/day). Use JSONB columns for flexible metadata and create materialized views for common aggregations.

-- ClickHouse schema for LLM events
CREATE TABLE llm_events (
    trace_id String,
    span_id String,
    timestamp DateTime64(3),
    model LowCardinality(String),
    provider LowCardinality(String),
    feature LowCardinality(String),
    user_id String,
    input_tokens UInt32,
    output_tokens UInt32,
    cache_read_tokens UInt32,
    cost_usd Float64,
    latency_ms Float64,
    ttft_ms Float64,
    hallucination_score Float32,
    output_length UInt32,
    stop_reason LowCardinality(String),
    error Boolean DEFAULT false
) ENGINE = MergeTree()
ORDER BY (feature, timestamp, trace_id);

-- Separate table for full text (expensive to store, rarely queried)
CREATE TABLE llm_event_content (
    trace_id String,
    span_id String,
    input_text String CODEC(ZSTD(3)),
    output_text String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY (trace_id, span_id);

Async Event Emission

Never block the request path to record observability data. Use an async queue:

import asyncio
from collections import deque

class ObservabilityEmitter:
    def __init__(self, flush_interval: float = 5.0, batch_size: int = 100):
        self._buffer: deque = deque()
        self._flush_interval = flush_interval
        self._batch_size = batch_size
        self._task: asyncio.Task | None = None
    
    def start(self):
        self._task = asyncio.create_task(self._flush_loop())
    
    def emit(self, event: dict):
        """Non-blocking event emission."""
        self._buffer.append(event)
        if len(self._buffer) >= self._batch_size:
            asyncio.create_task(self._flush())
    
    async def _flush_loop(self):
        while True:
            await asyncio.sleep(self._flush_interval)
            await self._flush()
    
    async def _flush(self):
        if not self._buffer:
            return
        
        batch = []
        while self._buffer and len(batch) < self._batch_size:
            batch.append(self._buffer.popleft())
        
        # Write to ClickHouse, Postgres, or observability platform
        await self._write_batch(batch)

Alerting Strategies

LLM alerts require different thresholds and logic than traditional application alerts. A 500-error spike is unambiguous. A gradual increase in hallucination rate is not.

Alert Categories

Alert Type	Condition	Urgency	Example
Cost spike	Hourly spend > 3x rolling average	High	Agent loop running unbounded
Latency degradation	p95 TTFT > 5s for 10 min	Medium	Provider capacity issues
Hallucination spike	Rolling hallucination rate > 2x baseline	High	Silent model update
Error rate	>5% of LLM calls returning errors	High	Rate limiting, auth issues
Output drift	Z-score > 2.0 on any tracked metric	Medium	Model behavior change
Refusal rate	>2% of requests refused	Medium	Safety filter changes
Cache hit drop	Cache hit rate drops >20pp in 1 hour	Low	Prompt template change invalidated cache

Composite Alerts

Single-metric alerts generate noise. Composite alerts — requiring multiple conditions — are more actionable:

# Alert: Probable silent model update
alert: model_behavior_change
conditions:
  - output_length_z_score > 2.0
  - hallucination_rate_change > 0.05  # 5pp increase
  - time_window: 6h
  - min_sample_size: 500
severity: high
action: page_on_call
message: |
  Probable model behavior change detected for {model}.
  Output length shifted {z_score} std devs from baseline.
  Hallucination rate increased from {baseline_rate} to {current_rate}.

Alert pipeline: multiple metric streams feed composite rules that route to appropriate channels.

Avoiding Alert Fatigue

LLM observability alerts are particularly prone to noise because:

Model behavior is inherently variable: Temperature >0 means outputs vary naturally. Set baselines using statistical distributions, not fixed thresholds.
Provider issues are transient: A 30-second latency spike during provider scaling doesn’t warrant a page. Use sustained-duration conditions (e.g., “p95 latency > 3s for 5+ consecutive minutes”).
Evaluation is probabilistic: A hallucination detector with 80% accuracy will flag 20% false positives. Require multiple signals before alerting.

Summary

LLM observability requires purpose-built instrumentation because the failure modes — hallucinated outputs, silent model updates, token cost spikes, and semantic drift — are invisible to traditional APM tools.

The core requirements:

Structured traces that capture session → trace → span → generation hierarchies, including agent retry loops and tool calls
Hallucination monitoring using entailment scoring for breadth and LLM-as-judge for depth, with sampling strategies matched to traffic volume
Drift detection against statistical baselines, with automatic re-baselining after intentional changes
Token-granular cost tracking that accounts for cache status, retry waste, and evaluation overhead
Latency decomposition into TTFT, inter-token latency, queue time, and tool call overhead — not just total request duration

Platform choice depends on constraints: Langfuse for self-hosting and complex agent traces, Helicone for zero-code proxy-based capture, Braintrust for evaluation-centric workflows. Many production deployments use more than one.

Alerting on LLM metrics requires composite conditions and statistical thresholds rather than fixed limits. Single-metric alerts on inherently variable systems produce noise. Multiple converging signals — cost spike plus hallucination increase plus output length shift — produce actionable alerts.

The tooling is maturing rapidly. The OpenTelemetry semantic conventions for generative AI are stabilizing, which will probably make cross-platform instrumentation more portable within the next year. Until then, SDK-based instrumentation with one of the purpose-built platforms remains the most practical approach.

LLM Observability in Production

Table of Contents

Why Traditional Observability Breaks

The Five Pillars of LLM Observability

Tracing Multi-Step Agent Workflows

Trace Structure

What to Capture Per LLM Span

Handling Agent Loops

Hallucination Monitoring

Detection Methods

Sampling Strategy

Implementation Example

Output Drift Detection

What Drifts

Measuring Drift

Token Cost Tracking

Cost Attribution Dimensions

Cost Anomaly Detection

The Hidden Costs

Latency Profiling

Latency Components

Measuring TTFT Accurately

Latency Budgets

Platform Comparison

Langfuse

Helicone

Braintrust

Comparison Table

When to Use Which

Instrumentation Patterns

Pattern 1: Decorator-Based Tracing

Pattern 2: Middleware-Based Capture

Pattern 3: OpenTelemetry Integration

Building a Custom Observability Layer

Storage Considerations

Async Event Emission

Alerting Strategies

Alert Categories

Composite Alerts

Avoiding Alert Fatigue

Summary

Further Reading