Builders Spotlight — Promptfoo

Promptfoo

Ian Webster (formerly leading LLM engineering at Discord) and Michael D’Angelo (formerly Head of AI at Smile Identity) built Promptfoo as a CLI for systematic testing and red-teaming of LLM prompts — a way to catch regressions and adversarial failures before they reach production. On March 9, 2026, Promptfoo announced it was joining OpenAI. The open-source project continues under the MIT license, and the team’s stated focus inside OpenAI is integrating evaluation and red-teaming “directly into the model and infrastructure layers” rather than leaving them as external tools.

The problem it set out to solve

LLM applications are notoriously brittle — a small tweak to a prompt can silently break behavior on edge cases you never thought to test. Teams were shipping prompts to production with no visibility into how they’d perform across different inputs, and when failures happened, they had no structured way to debug or compare alternatives. The process was more art than engineering.

The key insight

Prompts should be tested like code. Promptfoo treats prompt engineering as a testable artifact: you define test cases, expected outputs (or output criteria), and run them systematically across different models and prompt variations. The real innovation is that it makes regression detection and comparative analysis tractable — you can A/B test prompt versions and visualize which one wins across your test suite, and you catch breakage before deployment. This is straightforward in concept but changes how teams approach prompt work: from trial-and-error iteration to measurable optimization.

How it works (in plain terms)

You create a YAML or JSON config that defines your prompts, test cases, and assertions. Each test case includes an input and either an expected output or a grading function (pass/fail criteria). Promptfoo runs your prompts against one or more LLM providers, collects outputs, grades them, and generates a matrix showing which prompt-model combination performs best. The tool supports multiple grading strategies: exact match, similarity thresholds, LLM-as-judge scoring, or custom code. You can compare different prompt versions side-by-side and spot regressions immediately. The trade-off is upfront effort — you have to think through test cases — but you get reproducibility and measurable progress.

What it looks like in practice

prompts:
  - "Classify this review as positive or negative: {{text}}"
  - "You are a sentiment expert. Rate: {{text}}\nResponse: [positive|negative]"

tests:
  - vars:
      text: "This product is amazing!"
    expected: "positive"
  - vars:
      text: "Complete waste of money."
    expected: "negative"

providers:
  - id: openai:gpt-5.5
  - id: openai:gpt-4.1-nano

Run it, and you get a comparison table showing accuracy per prompt-model combo.

Why it matters

Catches silent failures: Regressions in prompt behavior are detected before they hit users, not after incident reports arrive.
Enables data-driven prompt development: Teams can objectively measure which prompt works best instead of guessing, reducing iteration cycles and building institutional knowledge.
Reduces hallucination risk: By defining explicit test cases and grading criteria, you’re forced to think about failure modes upfront and can measure how often they occur.

The OpenAI chapter

The acquisition is interesting on two axes. The validation read: OpenAI buying a prompt-testing and red-teaming tool says that systematic evaluation has graduated from “nice to have” to “the kind of capability frontier labs want owned in-house.” That tracks with the broader industry shift toward eval-driven development.

The open question: Promptfoo’s whole pitch was provider-agnosticism — your eval suite shouldn’t care whether you’re running OpenAI, Anthropic, Google, or a local model. The founders explicitly committed to keeping that multi-provider stance post-acquisition, and the public roadmap continues to ship support for non-OpenAI models. Still, it’s a tension worth naming: the maintainer of a cross-provider testing tool now reports to one of those providers. For now, the open-source license, the multi-provider commitments, and the project’s existing community ownership are the structural commitments to watch — not statements about intent.

If you’re standardizing on Promptfoo, the practical upside is more resources and tighter coupling with OpenAI’s own eval tooling. The practical thing to track is whether feature velocity stays balanced across providers, or whether the OpenAI integration starts to outpace the others in ways that matter for your stack.

Where to go next

Promptfoo GitHub — the CLI and core testing engine
Promptfoo Docs — comprehensive guides on test design and grading strategies