Library of the Week — DSPy — Stochastic Sandbox

DSPy — a framework for programming, not prompting, language models

GitHub · Language: Python · License: MIT

What it does

DSPy lets you write LM-powered programs as composable Python modules with typed input/output signatures. It then compiles those programs against a metric, searching over prompts, few-shot examples, and (optionally) model weights to optimize for your task. The shift is treating prompts as a learned parameter instead of a hand-tuned string.

Why it stands out

Signatures replace prompt strings. Declare "question -> answer" (or a typed dspy.Signature class) and DSPy generates the underlying prompt, instead of you babysitting templates.
Modules are real building blocks. dspy.Predict, dspy.ChainOfThought, dspy.ReAct, and dspy.ProgramOfThought compose like PyTorch layers; swap one out and the rest of the pipeline keeps working.
Optimizers are the headline feature. MIPROv2, BootstrapFewShot, and GEPA take a training set plus a metric and search prompt space for you. Compiled programs often beat hand-written prompts by a wide margin on the metrics they’re optimized against.
Model-agnostic. The same program runs on any frontier provider via dspy.LM(...), or on a local model via vLLM or Ollama, with no rewrite.

Quick start

import dspy

dspy.configure(lm=dspy.LM("openai/gpt-4.1-nano"))

class QA(dspy.Signature):
    """Answer the question concisely."""
    question: str = dspy.InputField()
    answer: str = dspy.OutputField()

program = dspy.ChainOfThought(QA)
print(program(question="Why does DSPy compile prompts?").answer)

# Later: compile against a metric to optimize prompt + few-shots automatically
# optimizer = dspy.MIPROv2(metric=exact_match)
# optimized = optimizer.compile(program, trainset=examples)

When to use it

You have a task with a measurable metric (exact match, F1, judge score) and want to stop guessing at prompt wording. The optimizer searches for you.
You’re building a multi-step pipeline (retrieval, rerank, reason, answer) and want to swap models or modules without rewriting glue code.
You need the same program to run reliably across providers; switching gpt-4.1-nano for claude-haiku-4-5 should be a one-line config change.

When to skip it

You’re doing one-shot, untyped LM calls where a hand-crafted prompt is genuinely shorter than the equivalent signature plus module.
You don’t have an evaluation set or a metric. Optimizers need ground truth to compile against; without it you’re back to vibes-based prompting.
Your bottleneck is latency on a single call. DSPy’s compiled programs often add a reasoning hop, which is the right trade-off for quality but the wrong one for sub-100ms paths.

The verdict

DSPy is the most credible answer to “what comes after prompt engineering.” Moving from writing strings to writing typed programs and optimizing them against a metric is the same shift that took deep learning from hand-crafted features to learned representations. It is not the right tool for every script, but if you have an LM workflow with a metric, the compiled version is almost always better than the version you would have written by hand.