LLM Observability in Production 2026-05-19T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

LLM Observability in Production

The post you bookmark. One topic, covered end to end.

Technical overview of LLM observability: monitoring hallucinations, token costs, latency, and output drift across multi-step agent workflows, with platform comparisons.

LLM Observability in Production

Every traditional APM tool — Datadog, New Relic, Grafana — can tell you an HTTP request took 2.3 seconds and returned a 200. None of them can tell you the response was a hallucinated JSON blob that passed validation but contained fabricated data. LLM observability is a distinct discipline because the failure modes are distinct: outputs can be structurally correct and semantically wrong, costs scale with content rather than compute, and latency depends on how much the model “thinks” rather than how busy the server is.

This post covers the instrumentation, metrics, platforms, and architectural patterns required to operate LLM applications with the same rigor applied to traditional production systems.

Table of Contents

Why Traditional Observability Breaks

Traditional observability rests on three pillars: metrics, logs, and traces. LLM applications break assumptions embedded in all three.

Metrics: Request count, error rate, and p99 latency don’t capture the dominant failure mode — a 200 OK response containing wrong information. A hallucinated answer is not an error in any HTTP sense.

Logs: A single LLM call can generate 4,000+ tokens of output. Logging raw completions at scale produces terabytes of unstructured text that’s expensive to store and nearly impossible to query meaningfully without semantic analysis.

Traces: A multi-step agent workflow might make 3–15 LLM calls, interleaved with tool calls, retrieval steps, and branching logic. Traditional distributed tracing assumes a request-response tree. Agent traces are often graphs with cycles (retry loops, self-correction steps).

Diagram

Traditional APM tools miss the semantic failure modes that dominate LLM application debugging.

The Five Pillars of LLM Observability

LLM observability requires five distinct measurement categories, each with its own instrumentation approach:

PillarWhat It MeasuresWhy It’s Hard
Trace completenessFull execution path across LLM calls, tool use, retrievalAgent workflows branch, loop, and self-correct
Output qualityHallucination rate, factual accuracy, format complianceRequires semantic evaluation, not just schema validation
Output driftChanges in model behavior over timeProvider-side model updates are silent and continuous
Cost attributionToken spend per user, feature, workflowInput/output token ratios vary wildly; cached vs uncached pricing differs
Latency decompositionTime spent in each phase: routing, queueing, TTFT, generationTime-to-first-token and total generation time measure different things
Diagram

The five pillars of LLM observability and their interdependencies.

Tracing Multi-Step Agent Workflows

A single user request to an agent system might trigger this sequence: parse intent → retrieve documents → generate plan → execute tool calls → validate output → summarize response. Each step involves different models, different latencies, and different failure modes.

Trace Structure

LLM traces need a hierarchical structure that captures:

  • Sessions: A conversation or user interaction (may span multiple requests)
  • Traces: A single end-to-end request processing
  • Spans: Individual operations within a trace (LLM calls, retrieval, tool execution)
  • Generations: The specific LLM input/output pairs within a span
Diagram

Trace hierarchy for LLM applications — sessions contain traces, traces contain typed spans.

What to Capture Per LLM Span

Every LLM call should record:

{
  "span_id": "abc-123",
  "trace_id": "trace-789",
  "model": "claude-sonnet-4.6",
  "provider": "anthropic",
  "input_tokens": 2847,
  "output_tokens": 512,
  "cache_read_tokens": 1200,
  "cache_write_tokens": 0,
  "total_cost_usd": 0.0089,
  "latency_ms": 1843,
  "ttft_ms": 312,
  "temperature": 0.7,
  "top_p": 1.0,
  "stop_reason": "end_turn",
  "input_messages": [...],
  "output_content": "...",
  "metadata": {
    "user_id": "user-456",
    "feature": "code-review",
    "environment": "production"
  }
}

The cache_read_tokens and cache_write_tokens fields matter for cost accuracy. Anthropic’s prompt caching, for example, charges differently for cache writes vs cache reads vs uncached input. Ignoring this distinction can make cost estimates off by 50-90% for cache-heavy workloads.

Handling Agent Loops

Agent workflows often involve retry loops and self-correction. A naive trace implementation records these as a flat list of spans, losing the causal relationship. The trace should capture parent-child relationships between correction steps:

# Pseudocode for traced agent loop
with tracer.start_trace("user_request") as trace:
    plan = await generate_plan(trace, user_input)
    
    for attempt in range(max_retries):
        with trace.start_span(f"execution_attempt_{attempt}") as span:
            result = await execute_plan(span, plan)
            validation = await validate_result(span, result)
            
            span.set_attribute("validation.passed", validation.passed)
            span.set_attribute("validation.issues", validation.issues)
            
            if validation.passed:
                break
            
            # Self-correction: the correction span is a child of this attempt
            with span.start_span("self_correction") as correction:
                plan = await revise_plan(correction, plan, validation.issues)

This structure makes it possible to answer: “How many attempts did this request take?” and “What validation failures triggered corrections?”

Hallucination Monitoring

Hallucination detection in production is harder than in evaluation benchmarks because there’s no ground truth to compare against. Several approaches work at different accuracy-cost tradeoffs.

Detection Methods

Reference-based checking: When the LLM’s output is grounded in retrieved documents (RAG), compare claims in the output against the source material. This catches unsupported claims but misses cases where the retrieved documents themselves are wrong.

Self-consistency checking: Run the same prompt multiple times (or ask the model to verify its own output in a separate call). Inconsistencies suggest hallucination. This doubles or triples cost per request, so it’s typically applied only to high-stakes outputs.

Entailment scoring: Use a smaller, specialized model (a natural language inference classifier) to score whether the output is entailed by the input context. Models like vectara/hallucination_evaluation_model or similar cross-encoder classifiers can run inference in <50ms and provide a 0-1 confidence score.

LLM-as-judge: Use a separate LLM call to evaluate the output for factual accuracy, relevance, and grounding. This is the most flexible approach but adds latency and cost. Typically done asynchronously, not inline.

Diagram

Four hallucination detection methods, each suited to different cost/accuracy tradeoffs.

Sampling Strategy

Running hallucination detection on every request is expensive. A practical approach:

Traffic VolumeStrategy
<1,000 req/dayEvaluate 100% with entailment scoring
1,000–50,000 req/dayEvaluate 100% with entailment, 10% with LLM-as-judge
50,000+ req/dayEvaluate 5–10% sample with entailment, 1% with LLM-as-judge

Store the hallucination scores alongside traces. Alert when the rolling average exceeds a baseline threshold — a spike from 4% to 12% hallucination rate over 24 hours probably indicates a provider-side model update.

Implementation Example

import numpy as np
from sentence_transformers import CrossEncoder

# Load NLI model for entailment scoring
nli_model = CrossEncoder("cross-encoder/nli-deberta-v3-large", max_length=512)

def score_hallucination(context: str, output: str) -> float:
    """
    Returns hallucination probability (0 = grounded, 1 = hallucinated).
    Uses NLI entailment: if output is NOT entailed by context, 
    it's likely hallucinated.
    """
    scores = nli_model.predict([(context, output)])
    # NLI models return [contradiction, neutral, entailment]
    # Higher entailment = lower hallucination risk
    entailment_score = scores[0][2]  # entailment class
    return 1.0 - entailment_score

def monitor_hallucination(trace_id: str, context: str, output: str):
    score = score_hallucination(context, output)
    
    # Log to observability platform
    observability.log_score(
        trace_id=trace_id,
        name="hallucination_probability",
        value=score,
        threshold=0.5
    )
    
    if score > 0.7:
        observability.alert(
            severity="warning",
            message=f"High hallucination score ({score:.2f}) for trace {trace_id}"
        )

Output Drift Detection

Model providers update their models continuously. OpenAI, Anthropic, and Google all perform minor updates to dated model versions (safety patches, efficiency improvements) without changing the version identifier. These silent updates can change output characteristics in ways that break downstream systems.

What Drifts

  • Format compliance: A model that reliably produced valid JSON might start adding markdown formatting or explanatory text around the JSON
  • Verbosity: Average output length shifts, affecting both cost and user experience
  • Tone and style: Subtle changes in formality, hedging language, or response structure
  • Tool call patterns: Changes in how models structure function calls or choose between available tools
  • Refusal rates: Increased safety filtering can cause previously-working prompts to be refused

Measuring Drift

Track statistical distributions of output properties over time windows (hourly, daily, weekly):

from dataclasses import dataclass
from collections import deque
import statistics

@dataclass
class OutputMetrics:
    output_length: int
    json_valid: bool
    tool_calls_count: int
    refusal: bool
    latency_ms: float
    hallucination_score: float

class DriftDetector:
    def __init__(self, window_size: int = 1000):
        self.baseline: deque[OutputMetrics] = deque(maxlen=window_size)
        self.current: deque[OutputMetrics] = deque(maxlen=window_size)
    
    def check_drift(self, metric_name: str) -> dict:
        baseline_vals = [getattr(m, metric_name) for m in self.baseline]
        current_vals = [getattr(m, metric_name) for m in self.current]
        
        if len(baseline_vals) < 100 or len(current_vals) < 100:
            return {"status": "insufficient_data"}
        
        baseline_mean = statistics.mean(baseline_vals)
        current_mean = statistics.mean(current_vals)
        baseline_std = statistics.stdev(baseline_vals)
        
        if baseline_std == 0:
            return {"status": "no_variance"}
        
        # Z-score of the shift
        z_score = (current_mean - baseline_mean) / baseline_std
        
        return {
            "metric": metric_name,
            "baseline_mean": baseline_mean,
            "current_mean": current_mean,
            "z_score": z_score,
            "drifted": abs(z_score) > 2.0
        }

A practical rule of thumb: re-establish baselines whenever you intentionally change prompts, switch model versions, or update retrieval pipelines. Unexpected drift against a stable baseline is the signal worth alerting on.

Diagram

Drift detection pipeline: extract output metrics, compare rolling windows against a baseline, alert on statistical shifts.

Token Cost Tracking

Token costs in LLM applications follow patterns unlike any other cloud resource. A single poorly-constructed prompt can cost 100x more than an optimized one. Cost observability requires tracking at multiple granularities.

Cost Attribution Dimensions

Every LLM call’s cost should be attributed across:

  • User: Which user or account triggered the spend
  • Feature: Which product feature (code review, summarization, chat)
  • Model: Which model was used (including fallback routing)
  • Cache status: Cache hit, cache write, or uncached
  • Workflow step: Which step in a multi-step agent pipeline
def calculate_call_cost(
    model: str,
    input_tokens: int,
    output_tokens: int,
    cache_read_tokens: int = 0,
    cache_write_tokens: int = 0
) -> float:
    """
    Calculate cost for a single LLM call.
    Prices in USD per million tokens — update these from provider pricing pages.
    """
    # Example pricing structure (verify against current provider pricing)
    pricing = {
        "claude-sonnet-4.6": {
            "input": 3.00,
            "output": 15.00,
            "cache_read": 0.30,
            "cache_write": 3.75,
        },
        "claude-haiku-4.5": {
            "input": 0.80,
            "output": 4.00,
            "cache_read": 0.08,
            "cache_write": 1.00,
        },
        "gpt-4.1-nano": {
            "input": 0.10,
            "output": 0.40,
            "cache_read": 0.025,
            "cache_write": 0.10,
        },
    }
    
    p = pricing.get(model)
    if not p:
        return 0.0
    
    uncached_input = input_tokens - cache_read_tokens - cache_write_tokens
    
    cost = (
        (uncached_input / 1_000_000) * p["input"]
        + (output_tokens / 1_000_000) * p["output"]
        + (cache_read_tokens / 1_000_000) * p["cache_read"]
        + (cache_write_tokens / 1_000_000) * p["cache_write"]
    )
    
    return round(cost, 6)

Cost Anomaly Detection

Token costs follow predictable distributions for a given feature. A code review feature might average $0.03 per invocation with a standard deviation of $0.01. A single call costing $0.50 indicates either a prompt injection (inflating context), a retrieval bug (pulling too many documents), or an agent loop that didn’t terminate.

Track per-feature cost distributions and alert when individual calls exceed 3 standard deviations above the mean, or when hourly/daily aggregate costs exceed budgets.

The Hidden Costs

Several cost sources are easy to miss:

Hidden CostWhy It’s Missed
Retry tokensFailed calls that consumed tokens before erroring
Validation re-runsOutput failed validation, entire generation repeated
Embedding callsRetrieval-time embeddings billed separately
Judge/eval callsAsync quality evaluation uses real tokens
Prompt caching writesFirst call to populate cache costs more than subsequent reads

Latency Profiling

LLM latency is not a single number. A streaming response has at least three distinct latency measurements, and each one matters for different reasons.

Latency Components

MetricDefinitionTypical RangeWhy It Matters
Time to First Token (TTFT)Time from request sent to first token received200ms–3sDetermines perceived responsiveness in streaming UIs
Inter-Token Latency (ITL)Average time between consecutive tokens10ms–50msAffects streaming smoothness
Total Generation TimeTime from request to final token1s–60s+End-to-end wall clock time
Queue TimeTime spent waiting before inference begins0ms–10s+Spikes during provider congestion
Tool Call OverheadTime spent executing tool calls mid-generationVariableCan dominate total latency in agent workflows
Diagram

Latency decomposition for a single LLM request with tool calls.

Measuring TTFT Accurately

TTFT measurement requires care. If using an HTTP client that buffers responses, the measured TTFT includes buffering delay. For accurate measurement:

import time
import httpx

async def measure_ttft(client: httpx.AsyncClient, request_body: dict) -> dict:
    start = time.monotonic()
    first_token_time = None
    total_tokens = 0
    
    async with client.stream(
        "POST",
        "https://api.anthropic.com/v1/messages",
        json=request_body,
        headers={"anthropic-version": "2023-06-01"}
    ) as response:
        async for chunk in response.aiter_lines():
            if chunk.startswith("data: "):
                if first_token_time is None:
                    first_token_time = time.monotonic()
                total_tokens += 1
    
    end = time.monotonic()
    
    return {
        "ttft_ms": (first_token_time - start) * 1000 if first_token_time else None,
        "total_ms": (end - start) * 1000,
        "tokens": total_tokens,
        "itl_ms": ((end - first_token_time) * 1000 / max(total_tokens - 1, 1)) 
                  if first_token_time else None
    }

Latency Budgets

For agent workflows with multiple LLM calls, establish per-step latency budgets:

# Example latency budget for a RAG agent
total_budget_ms: 8000
steps:
  query_understanding:
    model: gpt-4.1-nano
    budget_ms: 800
  retrieval:
    budget_ms: 200
  reranking:
    budget_ms: 300
  generation:
    model: claude-sonnet-4.6
    budget_ms: 5000
  validation:
    model: gpt-4.1-nano
    budget_ms: 700
  buffer_ms: 1000

Track what percentage of requests exceed each step’s budget. A step consistently exceeding its budget indicates either a model performance regression, a prompt that’s too complex, or provider-side capacity issues.

Platform Comparison

Three platforms dominate LLM observability as of mid-2026: Langfuse, Helicone, and Braintrust. Each approaches the problem differently.

Langfuse

Architecture: Open-source, self-hostable. Traces, scores, and datasets stored in PostgreSQL. SDK-based instrumentation with OpenTelemetry compatibility.

Strengths:

  • Full trace hierarchy (sessions → traces → spans → generations)
  • Built-in evaluation framework with custom scoring functions
  • Dataset management for regression testing
  • Self-hosting option (important for compliance-sensitive environments)
  • Native integrations with LangChain, LlamaIndex, and Vercel AI SDK

Weaknesses:

  • Self-hosted deployment requires managing PostgreSQL at scale
  • Dashboard UI is functional but less polished than commercial alternatives
  • Real-time alerting requires additional infrastructure

Pricing: Open-source (self-hosted), or cloud-hosted with a free tier and usage-based pricing.

Helicone

Architecture: Proxy-based. Sits between the application and LLM providers as an HTTP proxy, capturing all request/response data without SDK changes.

Strengths:

  • Zero-code instrumentation: change the base URL, get observability
  • Strong cost tracking and analytics out of the box
  • Request caching at the proxy layer
  • Rate limiting and request queuing built in
  • Low integration effort

Weaknesses:

  • Proxy architecture adds a network hop (typically 1-5ms)
  • Less flexible trace hierarchy than SDK-based approaches
  • Custom evaluation scoring is more limited
  • Multi-step agent traces require additional annotation

Pricing: Free tier with usage-based scaling.

Braintrust

Architecture: SDK-based with a focus on evaluation and experimentation. Positions itself as an “AI product development platform” rather than pure observability.

Strengths:

  • Strong evaluation framework with built-in scoring functions (factuality, relevance, toxicity)
  • Experiment tracking: compare prompt versions, model changes side by side
  • Dataset management with human annotation workflows
  • Logging doubles as evaluation data collection

Weaknesses:

  • More opinionated about workflow than Langfuse or Helicone
  • Tighter coupling to the platform’s evaluation philosophy
  • Self-hosting not available

Pricing: Free tier, usage-based scaling.

Comparison Table

FeatureLangfuseHeliconeBraintrust
Integration methodSDKProxySDK
Self-hostableYes (open-source)NoNo
Trace hierarchyFull (session/trace/span)Flat + annotationsExperiment/log based
Cost trackingYesYes (automatic)Yes
Evaluation frameworkCustom scoresBasicBuilt-in scorers
Prompt managementYesNoYes
Dataset managementYesNoYes
Real-time alertingVia webhooksBuilt-inVia integrations
OpenTelemetry supportYesNoPartial
Diagram

Three integration approaches: Langfuse and Braintrust use SDKs, Helicone uses a proxy layer.

When to Use Which

Langfuse fits teams that want full control, need self-hosting, or are building complex agent workflows with deep trace hierarchies. The open-source nature makes it the default choice for compliance-constrained environments.

Helicone fits teams that want observability with minimal code changes, especially for simpler request/response LLM applications. The proxy approach means you get cost and latency tracking immediately.

Braintrust fits teams whose primary concern is evaluation and experimentation — comparing prompt variants, running regression tests, and managing human evaluation workflows. The observability is a byproduct of the evaluation infrastructure.

Many teams use more than one. Helicone as a proxy for immediate cost visibility, plus Langfuse or Braintrust for deeper evaluation workflows, is a common combination.

Instrumentation Patterns

Pattern 1: Decorator-Based Tracing

Wrap LLM calls with decorators that automatically capture inputs, outputs, and metadata:

from functools import wraps
import time
from langfuse import Langfuse

langfuse = Langfuse()

def traced_llm_call(name: str, model: str):
    def decorator(func):
        @wraps(func)
        async def wrapper(*args, trace=None, **kwargs):
            generation = trace.generation(
                name=name,
                model=model,
                input=kwargs.get("messages", args[0] if args else None),
            )
            
            start = time.monotonic()
            try:
                result = await func(*args, **kwargs)
                generation.end(
                    output=result.content,
                    usage={
                        "input": result.usage.input_tokens,
                        "output": result.usage.output_tokens,
                    },
                    metadata={"latency_ms": (time.monotonic() - start) * 1000}
                )
                return result
            except Exception as e:
                generation.end(
                    status_message=str(e),
                    level="ERROR"
                )
                raise
        return wrapper
    return decorator

@traced_llm_call(name="summarize", model="claude-sonnet-4.6")
async def summarize(messages: list, **kwargs):
    return await anthropic_client.messages.create(
        model="claude-sonnet-4.6",
        messages=messages,
        **kwargs
    )

Pattern 2: Middleware-Based Capture

For proxy-style observability, intercept all LLM calls at the HTTP client level:

import httpx

class LLMObservabilityMiddleware:
    def __init__(self, tracker):
        self.tracker = tracker
    
    async def intercept(
        self, request: httpx.Request
    ) -> httpx.Request:
        request.extensions["obs_start_time"] = time.monotonic()
        request.extensions["obs_request_body"] = request.content
        return request
    
    async def handle_response(
        self, response: httpx.Response
    ) -> httpx.Response:
        start = response.request.extensions.get("obs_start_time", 0)
        latency = (time.monotonic() - start) * 1000
        
        body = await response.aread()
        
        # Parse provider-specific usage from response
        usage = self._extract_usage(body, response.request.url)
        
        self.tracker.record(
            url=str(response.request.url),
            status=response.status_code,
            latency_ms=latency,
            input_tokens=usage.get("input_tokens"),
            output_tokens=usage.get("output_tokens"),
        )
        
        return response

Pattern 3: OpenTelemetry Integration

For teams already using OpenTelemetry, extend the tracing with LLM-specific semantic conventions:

from opentelemetry import trace
from opentelemetry.semconv.ai import SpanAttributes  # emerging convention

tracer = trace.get_tracer("llm-app")

async def call_llm(model: str, messages: list):
    with tracer.start_as_current_span("llm.generate") as span:
        span.set_attribute("gen_ai.system", "anthropic")
        span.set_attribute("gen_ai.request.model", model)
        span.set_attribute("gen_ai.request.temperature", 0.7)
        
        result = await client.messages.create(
            model=model, messages=messages
        )
        
        span.set_attribute("gen_ai.response.model", result.model)
        span.set_attribute("gen_ai.usage.input_tokens", result.usage.input_tokens)
        span.set_attribute("gen_ai.usage.output_tokens", result.usage.output_tokens)
        span.set_attribute("gen_ai.response.finish_reason", result.stop_reason)
        
        return result

The OpenTelemetry Semantic Conventions for Generative AI are still evolving (the gen_ai.* namespace), but they’re converging toward a standard that Langfuse, Traceloop, and other platforms already support.

Building a Custom Observability Layer

When off-the-shelf platforms don’t fit (air-gapped environments, specific compliance requirements, extreme scale), building a custom layer is straightforward. The core architecture has three components.

Diagram

Custom observability architecture: async event collection, analytical storage, and visualization/alerting.

Storage Considerations

LLM observability data has specific storage characteristics:

  • High cardinality: Every trace has unique prompt/completion text
  • Mixed types: Numeric metrics alongside large text blobs
  • Time-series access patterns: Most queries filter by time range + dimensions
  • Write-heavy: Production apps generate thousands of events per minute

ClickHouse handles this well: columnar storage compresses text efficiently, and materialized views can pre-aggregate metrics. Store full prompts/completions in a separate table with trace_id as the join key, keeping the metrics table lean.

PostgreSQL works for smaller deployments (<100k events/day). Use JSONB columns for flexible metadata and create materialized views for common aggregations.

-- ClickHouse schema for LLM events
CREATE TABLE llm_events (
    trace_id String,
    span_id String,
    timestamp DateTime64(3),
    model LowCardinality(String),
    provider LowCardinality(String),
    feature LowCardinality(String),
    user_id String,
    input_tokens UInt32,
    output_tokens UInt32,
    cache_read_tokens UInt32,
    cost_usd Float64,
    latency_ms Float64,
    ttft_ms Float64,
    hallucination_score Float32,
    output_length UInt32,
    stop_reason LowCardinality(String),
    error Boolean DEFAULT false
) ENGINE = MergeTree()
ORDER BY (feature, timestamp, trace_id);

-- Separate table for full text (expensive to store, rarely queried)
CREATE TABLE llm_event_content (
    trace_id String,
    span_id String,
    input_text String CODEC(ZSTD(3)),
    output_text String CODEC(ZSTD(3))
) ENGINE = MergeTree()
ORDER BY (trace_id, span_id);

Async Event Emission

Never block the request path to record observability data. Use an async queue:

import asyncio
from collections import deque

class ObservabilityEmitter:
    def __init__(self, flush_interval: float = 5.0, batch_size: int = 100):
        self._buffer: deque = deque()
        self._flush_interval = flush_interval
        self._batch_size = batch_size
        self._task: asyncio.Task | None = None
    
    def start(self):
        self._task = asyncio.create_task(self._flush_loop())
    
    def emit(self, event: dict):
        """Non-blocking event emission."""
        self._buffer.append(event)
        if len(self._buffer) >= self._batch_size:
            asyncio.create_task(self._flush())
    
    async def _flush_loop(self):
        while True:
            await asyncio.sleep(self._flush_interval)
            await self._flush()
    
    async def _flush(self):
        if not self._buffer:
            return
        
        batch = []
        while self._buffer and len(batch) < self._batch_size:
            batch.append(self._buffer.popleft())
        
        # Write to ClickHouse, Postgres, or observability platform
        await self._write_batch(batch)

Alerting Strategies

LLM alerts require different thresholds and logic than traditional application alerts. A 500-error spike is unambiguous. A gradual increase in hallucination rate is not.

Alert Categories

Alert TypeConditionUrgencyExample
Cost spikeHourly spend > 3x rolling averageHighAgent loop running unbounded
Latency degradationp95 TTFT > 5s for 10 minMediumProvider capacity issues
Hallucination spikeRolling hallucination rate > 2x baselineHighSilent model update
Error rate>5% of LLM calls returning errorsHighRate limiting, auth issues
Output driftZ-score > 2.0 on any tracked metricMediumModel behavior change
Refusal rate>2% of requests refusedMediumSafety filter changes
Cache hit dropCache hit rate drops >20pp in 1 hourLowPrompt template change invalidated cache

Composite Alerts

Single-metric alerts generate noise. Composite alerts — requiring multiple conditions — are more actionable:

# Alert: Probable silent model update
alert: model_behavior_change
conditions:
  - output_length_z_score > 2.0
  - hallucination_rate_change > 0.05  # 5pp increase
  - time_window: 6h
  - min_sample_size: 500
severity: high
action: page_on_call
message: |
  Probable model behavior change detected for {model}.
  Output length shifted {z_score} std devs from baseline.
  Hallucination rate increased from {baseline_rate} to {current_rate}.
Diagram

Alert pipeline: multiple metric streams feed composite rules that route to appropriate channels.

Avoiding Alert Fatigue

LLM observability alerts are particularly prone to noise because:

  1. Model behavior is inherently variable: Temperature >0 means outputs vary naturally. Set baselines using statistical distributions, not fixed thresholds.
  2. Provider issues are transient: A 30-second latency spike during provider scaling doesn’t warrant a page. Use sustained-duration conditions (e.g., “p95 latency > 3s for 5+ consecutive minutes”).
  3. Evaluation is probabilistic: A hallucination detector with 80% accuracy will flag 20% false positives. Require multiple signals before alerting.

Summary

LLM observability requires purpose-built instrumentation because the failure modes — hallucinated outputs, silent model updates, token cost spikes, and semantic drift — are invisible to traditional APM tools.

The core requirements:

  • Structured traces that capture session → trace → span → generation hierarchies, including agent retry loops and tool calls
  • Hallucination monitoring using entailment scoring for breadth and LLM-as-judge for depth, with sampling strategies matched to traffic volume
  • Drift detection against statistical baselines, with automatic re-baselining after intentional changes
  • Token-granular cost tracking that accounts for cache status, retry waste, and evaluation overhead
  • Latency decomposition into TTFT, inter-token latency, queue time, and tool call overhead — not just total request duration

Platform choice depends on constraints: Langfuse for self-hosting and complex agent traces, Helicone for zero-code proxy-based capture, Braintrust for evaluation-centric workflows. Many production deployments use more than one.

Alerting on LLM metrics requires composite conditions and statistical thresholds rather than fixed limits. Single-metric alerts on inherently variable systems produce noise. Multiple converging signals — cost spike plus hallucination increase plus output length shift — produce actionable alerts.

The tooling is maturing rapidly. The OpenTelemetry semantic conventions for generative AI are stabilizing, which will probably make cross-platform instrumentation more portable within the next year. Until then, SDK-based instrumentation with one of the purpose-built platforms remains the most practical approach.

Further Reading