Library of the Week — Weights & Biases Weave
A weekly teardown of one open-source AI/ML library: what it does, why it stands out, and when to use it.
Weights & Biases Weave — a lightweight tracing and evaluation framework for LLM applications
GitHub · Language: Python · License: Apache 2.0
What it does
Weave is W&B’s standalone library for tracing, logging, and evaluating LLM pipelines — separate from the broader W&B experiment tracking platform. It solves the observability gap that opens up once you move past a single model call into chains, agents, and retrieval pipelines. It’s aimed at developers who want structured visibility into what their app is actually doing at runtime.
Why it stands out
- Decorator-based tracing with zero restructuring — wrap any function with
@weave.op()and get automatic input/output logging, latency tracking, and nested call trees without rewriting your code - First-class evaluation primitives —
weave.Evaluationlets you define a dataset, a scoring function, and a model in ~10 lines, then runs evals and stores results in a queryable UI - Model versioning built in —
weave.Modelsubclasses automatically version their configuration alongside their outputs, so you can compare runs where you changed a prompt or temperature - Lighter than LangSmith for non-LangChain stacks — if you’re not using LangChain, LangSmith feels like overkill; Weave works with any Python code and any model provider
Quick start
import weave
from openai import OpenAI
weave.init("my-project")
client = OpenAI()
@weave.op()
def answer_question(question: str) -> str:
response = client.chat.completions.create(
model="gpt-5.4",
messages=[{"role": "user", "content": question}]
)
return response.choices[0].message.content
result = answer_question("What is retrieval-augmented generation?")
Every call is now traced and visible in the Weave UI with full input/output capture.
When to use it
- You’re building an agent or multi-step pipeline and need to debug which step is producing bad outputs
- You want a lightweight eval harness that stores results persistently without standing up your own database
- Your stack is provider-agnostic (mixing Claude Haiku 4.5, GPT-4.1 Nano, or local Llama 4 calls) and you want unified tracing across all of them
When to skip it
- If your team is already deep in the W&B ecosystem and wants everything in one place, the full W&B platform may be redundant with what you already have configured
- For production-scale observability with alerting, SLAs, and team access controls, a dedicated platform like Braintrust or Langfuse will serve you better than Weave’s current feature set
The verdict
Weave hits a sweet spot: it’s genuinely low-friction to add to an existing codebase, and the evaluation primitives are well-designed enough to replace a lot of ad-hoc eval scripts. It’s not a full observability platform, but for individual developers or small teams iterating on LLM apps, the tracing-plus-evals combo in a single pip install weave is hard to beat.