Library of the Week — Weights & Biases Weave

Weights & Biases Weave — a lightweight tracing and evaluation framework for LLM applications

GitHub · Language: Python · License: Apache 2.0

What it does

Weave is W&B’s standalone library for tracing, logging, and evaluating LLM pipelines — separate from the broader W&B experiment tracking platform. It solves the observability gap that opens up once you move past a single model call into chains, agents, and retrieval pipelines. It’s aimed at developers who want structured visibility into what their app is actually doing at runtime.

Why it stands out

Decorator-based tracing with zero restructuring — wrap any function with @weave.op() and get automatic input/output logging, latency tracking, and nested call trees without rewriting your code
First-class evaluation primitives — weave.Evaluation lets you define a dataset, a scoring function, and a model in ~10 lines, then runs evals and stores results in a queryable UI
Model versioning built in — weave.Model subclasses automatically version their configuration alongside their outputs, so you can compare runs where you changed a prompt or temperature
Lighter than LangSmith for non-LangChain stacks — if you’re not using LangChain, LangSmith feels like overkill; Weave works with any Python code and any model provider

Quick start

import weave
from openai import OpenAI

weave.init("my-project")
client = OpenAI()

@weave.op()
def answer_question(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-5.4",
        messages=[{"role": "user", "content": question}]
    )
    return response.choices[0].message.content

result = answer_question("What is retrieval-augmented generation?")

Every call is now traced and visible in the Weave UI with full input/output capture.

When to use it

You’re building an agent or multi-step pipeline and need to debug which step is producing bad outputs
You want a lightweight eval harness that stores results persistently without standing up your own database
Your stack is provider-agnostic (mixing Claude Haiku 4.5, GPT-4.1 Nano, or local Llama 4 calls) and you want unified tracing across all of them

When to skip it

If your team is already deep in the W&B ecosystem and wants everything in one place, the full W&B platform may be redundant with what you already have configured
For production-scale observability with alerting, SLAs, and team access controls, a dedicated platform like Braintrust or Langfuse will serve you better than Weave’s current feature set

The verdict

Weave hits a sweet spot: it’s genuinely low-friction to add to an existing codebase, and the evaluation primitives are well-designed enough to replace a lot of ad-hoc eval scripts. It’s not a full observability platform, but for individual developers or small teams iterating on LLM apps, the tracing-plus-evals combo in a single pip install weave is hard to beat.