Office Hours — What's the best approach to tracing and debugging LLM calls in production?

What’s the best approach to tracing and debugging LLM calls in production?

You’re dealing with a black box that costs money, latency-sensitive, and can fail in ways that aren’t obvious in logs. The standard approach—adding print statements and hoping—doesn’t scale past five calls. You need observability built in from the start.

Instrument at the API layer, not the model

The first thing to get right is capturing what’s actually going to the LLM and what comes back. This means wrapping every API call with structured logging before it leaves your process. Use OpenAI’s built-in logging where available, but more importantly, log the request body, response body, tokens used, cost, and latency in a centralized format.

Here’s the pattern:

import json
import time
from datetime import datetime

def call_llm_with_tracing(client, model, messages, **kwargs):
    """Wrapper that logs every LLM call with full context."""
    request_id = str(uuid.uuid4())
    start_time = time.time()
    
    trace_data = {
        "request_id": request_id,
        "timestamp": datetime.utcnow().isoformat(),
        "model": model,
        "messages": messages,
        "kwargs": kwargs,
        "status": None,
        "latency_ms": None,
        "input_tokens": None,
        "output_tokens": None,
        "cost_usd": None,
        "error": None
    }
    
    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            **kwargs
        )
        trace_data["status"] = "success"
        trace_data["output"] = response.choices[0].message.content
        trace_data["input_tokens"] = response.usage.prompt_tokens
        trace_data["output_tokens"] = response.usage.completion_tokens
        # Hardcode your pricing or fetch from a config
        trace_data["cost_usd"] = calculate_cost(model, response.usage)
        
    except Exception as e:
        trace_data["status"] = "error"
        trace_data["error"] = {"type": type(e).__name__, "message": str(e)}
        raise
    finally:
        trace_data["latency_ms"] = (time.time() - start_time) * 1000
        # Send to your observability backend (see below)
        log_trace(trace_data)
    
    return response

The key insight: you’re not logging for debugging later, you’re building a queryable record right now. Every call gets a request ID. Every error gets captured. Every cost gets tracked.

Route traces to a backend, not stdout

Logging to console is useless in production. You need three things:

A time-series database or structured logging service that lets you query traces by request_id, model, error type, latency bucket, or cost range.
Sample at 100% in production until you have traffic stability, then drop to 5-10% for cost-heavy traces.
Retention of at least 30 days so you can debug incidents that happen a week later.

Use what you already have. If you’re on AWS, CloudWatch is fine but expensive for high volume. Datadog, New Relic, and Honeycomb all work. Even Postgres with a JSON column and an index works at moderate scale if you’re bootstrapping. The tool matters less than having something queryable.

A production query looks like:

SELECT * FROM llm_traces 
WHERE status = 'error' 
  AND model = 'gpt-5.5' 
  AND timestamp > now() - interval '6 hours'
ORDER BY latency_ms DESC;

Chain tracing across agent steps

If you’re running agents, a single trace per LLM call isn’t enough. You need a parent trace ID that connects the entire agent run.

import contextvars

trace_context = contextvars.ContextVar('trace_id')

def run_agent_with_tracing(agent_fn, initial_state):
    """Run agent, linking all LLM calls to a single trace."""
    trace_id = str(uuid.uuid4())
    trace_context.set(trace_id)
    
    agent_trace = {
        "trace_id": trace_id,
        "agent_run_start": datetime.utcnow().isoformat(),
        "steps": [],
        "final_state": None,
        "error": None
    }
    
    try:
        result = agent_fn(initial_state)
        agent_trace["final_state"] = result
    except Exception as e:
        agent_trace["error"] = str(e)
        raise
    finally:
        log_trace(agent_trace)
    
    return result

# Inside your agent, every LLM call picks up the trace_id automatically:
def call_llm_with_tracing(...):
    trace_data["parent_trace_id"] = trace_context.get()
    # ... rest of logic

Now you can query all LLM calls from a single agent run, see which step failed, how many tokens were used across the whole execution, and the wall-clock time spent waiting on models.

Add sampling-based debugging for high-volume calls

In production, you can’t inspect every call. But you can inspect representative ones. Set a sampling rule: capture 100% of errors, 1% of slow calls (>3s), and 0.1% of successful calls. This gives you a statistically sound view of production behavior without log explosion.

def should_log_fully(response_time_ms, status, random_value):
    """Determine if this trace should be fully logged."""
    if status == "error":
        return True
    if response_time_ms > 3000:
        return True
    if status == "success" and random_value < 0.001:
        return True
    return False

Correlate traces with user context

Tag every trace with the user ID, request ID from your API, and feature flag state. This lets you debug “Claude is behaving weird for this one customer” without reproducing their exact setup.

trace_data["user_id"] = current_user()
trace_data["request_id"] = get_request_id()
trace_data["feature_flags"] = get_active_flags()

Use trace data for alerting and dashboards

Once traces are flowing, you can measure real production behavior: p95 latency by model, error rate by error type, cost per transaction. Set up a dashboard that shows:

Requests per minute by model
Error rate (%) over the last hour
Average cost per request
p95 latency

Alert if error rate spikes above 5%, latency exceeds your SLO, or daily spend goes 2x your forecast.

Bottom line: Build structured logging into every LLM call wrapper before you ship to production, route traces to a queryable backend with parent/child relationships, and use the data to instrument dashboards and alerts. This is the difference between debugging blindly and actually understanding what’s happening at scale.

Question via Hacker News