Library of the Week — DSPy
DSPy is Stanford's framework for programming language models with typed signatures and optimizers that compile prompts against a metric.
DSPy — a framework for programming, not prompting, language models
GitHub · Language: Python · License: MIT
What it does
DSPy lets you write LM-powered programs as composable Python modules with typed input/output signatures. It then compiles those programs against a metric, searching over prompts, few-shot examples, and (optionally) model weights to optimize for your task. The shift is treating prompts as a learned parameter instead of a hand-tuned string.
Why it stands out
- Signatures replace prompt strings. Declare
"question -> answer"(or a typeddspy.Signatureclass) and DSPy generates the underlying prompt, instead of you babysitting templates. - Modules are real building blocks.
dspy.Predict,dspy.ChainOfThought,dspy.ReAct, anddspy.ProgramOfThoughtcompose like PyTorch layers; swap one out and the rest of the pipeline keeps working. - Optimizers are the headline feature.
MIPROv2,BootstrapFewShot, andGEPAtake a training set plus a metric and search prompt space for you. Compiled programs often beat hand-written prompts by a wide margin on the metrics they’re optimized against. - Model-agnostic. The same program runs on any frontier provider via
dspy.LM(...), or on a local model via vLLM or Ollama, with no rewrite.
Quick start
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4.1-nano"))
class QA(dspy.Signature):
"""Answer the question concisely."""
question: str = dspy.InputField()
answer: str = dspy.OutputField()
program = dspy.ChainOfThought(QA)
print(program(question="Why does DSPy compile prompts?").answer)
# Later: compile against a metric to optimize prompt + few-shots automatically
# optimizer = dspy.MIPROv2(metric=exact_match)
# optimized = optimizer.compile(program, trainset=examples)
When to use it
- You have a task with a measurable metric (exact match, F1, judge score) and want to stop guessing at prompt wording. The optimizer searches for you.
- You’re building a multi-step pipeline (retrieval, rerank, reason, answer) and want to swap models or modules without rewriting glue code.
- You need the same program to run reliably across providers; switching
gpt-4.1-nanoforclaude-haiku-4-5should be a one-line config change.
When to skip it
- You’re doing one-shot, untyped LM calls where a hand-crafted prompt is genuinely shorter than the equivalent signature plus module.
- You don’t have an evaluation set or a metric. Optimizers need ground truth to compile against; without it you’re back to vibes-based prompting.
- Your bottleneck is latency on a single call. DSPy’s compiled programs often add a reasoning hop, which is the right trade-off for quality but the wrong one for sub-100ms paths.
The verdict
DSPy is the most credible answer to “what comes after prompt engineering.” Moving from writing strings to writing typed programs and optimizing them against a metric is the same shift that took deep learning from hand-crafted features to learned representations. It is not the right tool for every script, but if you have an LM workflow with a metric, the compiled version is almost always better than the version you would have written by hand.