Structured Output from LLMs 2026-05-05T09:00:00.000Z Deep Dives Deep Dives deep-divereferencearchitecture

Structured Output from LLMs

The post you bookmark. One topic, covered end to end.

Structured output from LLMs: JSON mode, function calling, constrained decoding, and grammar-based generation compared across every major provider with implementation patterns.

Structured Output from LLMs

Every non-trivial LLM application needs structured output. Chat interfaces can tolerate freeform text; everything else — API backends, data pipelines, agent tool calls, form extraction, code generation — requires the model to produce output that parses cleanly into a known schema. The gap between “the model usually returns valid JSON” and “the model always returns valid JSON” is where production systems live or die.

The ecosystem has converged on several distinct approaches to this problem, each with different reliability guarantees, latency characteristics, and provider support. Some enforce structure at the decoding level (making invalid output literally impossible), while others rely on instruction-following and post-hoc validation. The differences matter more than most teams realize until they’re debugging a 2 AM incident caused by a trailing comma.

Table of Contents

The Core Problem

LLMs generate tokens autoregressively — each token is sampled from a probability distribution conditioned on all previous tokens. Nothing in this process inherently respects JSON syntax, schema constraints, or type systems. A model trained on millions of JSON examples will usually produce valid JSON, but “usually” is a function of schema complexity, prompt length, and model capability.

Diagram

The baseline pipeline: prompt describes the schema, model generates text, application tries to parse it. Failures happen at the parse step.

Empirically, GPT-5.4 produces valid JSON from prompt-only instructions roughly 95-98% of the time for simple schemas (3-5 fields, no nesting). For complex schemas with nested arrays, enums, and optional fields, that drops to 85-92%. Claude Opus 4.7 and Gemini 3.1 Pro show similar patterns. A 5% failure rate at 1,000 requests/hour means ~50 failures per hour. That’s not acceptable for production.

Approach 1: Prompt Engineering Alone

The simplest approach: describe the desired JSON structure in the system prompt or user message, include an example, and hope for the best.

system_prompt = """Extract the following fields from the user's message.
Return ONLY valid JSON matching this schema:
{
  "name": string,
  "age": integer,
  "interests": string[],
  "employment": {"company": string, "role": string} | null
}
No markdown fences. No explanation. Just JSON."""

This works surprisingly often with current frontier models. The failure modes:

  • Markdown wrapping: Models frequently wrap JSON in ```json ``` fences despite explicit instructions not to. Stripping these is trivial but annoying.
  • Trailing text: “Here is the JSON:” followed by the actual JSON. A regex or find-first-brace heuristic handles most cases.
  • Schema drift: Optional fields omitted entirely (not set to null), extra fields added, enum values invented.
  • Type coercion: Ages returned as "25" instead of 25. Booleans as "true".
  • Truncation: Long outputs hit token limits mid-JSON, producing invalid syntax.

Prompt-only is appropriate for prototyping, internal tools with human oversight, and cases where the schema is simple enough that retries handle the rare failures.

Approach 2: JSON Mode

JSON mode is a provider-level feature that constrains the model’s output to be syntactically valid JSON. It does not enforce a specific schema — it only guarantees that the output parses as JSON.

Diagram

JSON mode guarantees syntactic validity but not schema conformance. The output will parse, but fields may be missing or wrong-typed.

How it works internally

The implementation varies by provider, but the general mechanism biases or masks token logits during sampling to prevent the model from generating tokens that would create invalid JSON. At each decoding step, the sampler tracks the current parse state (are we inside a string? after a colon? expecting a comma?) and zeros out probabilities for tokens that would violate JSON grammar.

This means the model physically cannot produce {"name": "Alice",} (trailing comma) or {"age": twenty} (unquoted value). It also means the model physically cannot produce non-JSON output like “Sure, here’s the data:” before the JSON.

Provider support

ProviderParameterNotes
OpenAIresponse_format: {"type": "json_object"}Available since late 2023. Must mention “JSON” in prompt.
AnthropicNot a separate mode; uses tool_use or promptClaude doesn’t have a standalone JSON mode toggle.
Googleresponse_mime_type: "application/json"Gemini 3.1 Pro and Flash support this.
Open models (vLLM)guided_json parameterSchema-level, not just syntax-level.

The OpenAI requirement to mention “JSON” in the prompt is a footgun. If the word “JSON” doesn’t appear in the system or user message, the API returns an error. This is presumably a guardrail against accidentally enabling JSON mode for conversational use cases.

Limitations

JSON mode solves syntax but not semantics. The model can return {}, {"completely_wrong_field": 42}, or a valid JSON array when an object was expected. For anything beyond trivial schemas, JSON mode alone is insufficient.

Approach 3: Function Calling / Tool Use

Function calling (OpenAI’s term) or tool use (Anthropic’s term) extends JSON mode with schema enforcement. The application defines one or more “functions” with JSON Schema parameter definitions, and the model generates arguments that conform to those schemas.

Diagram

Function calling: the model produces structured tool calls with arguments matching a declared JSON Schema. Validation can happen client-side or server-side.

OpenAI implementation

from openai import OpenAI
client = OpenAI()

tools = [{
    "type": "function",
    "function": {
        "name": "extract_person",
        "description": "Extract person info from text",
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer", "minimum": 0, "maximum": 150},
                "interests": {
                    "type": "array",
                    "items": {"type": "string"}
                },
                "employment": {
                    "type": ["object", "null"],
                    "properties": {
                        "company": {"type": "string"},
                        "role": {"type": "string"}
                    },
                    "required": ["company", "role"]
                }
            },
            "required": ["name", "age", "interests"]
        }
    }
}]

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Alice is 29, works at Stripe as a PM, likes hiking and chess."}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "extract_person"}}
)

args = json.loads(response.choices[0].message.tool_calls[0].function.arguments)
# {"name": "Alice", "age": 29, "interests": ["hiking", "chess"], "employment": {"company": "Stripe", "role": "PM"}}

Setting tool_choice to a specific function name forces the model to call that function, effectively turning function calling into a structured extraction mechanism. Without this, the model might decide the function isn’t relevant and return a text response instead.

Anthropic implementation

import anthropic
client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4.7",
    max_tokens=1024,
    tools=[{
        "name": "extract_person",
        "description": "Extract person info from text",
        "input_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "interests": {
                    "type": "array",
                    "items": {"type": "string"}
                }
            },
            "required": ["name", "age", "interests"]
        }
    }],
    tool_choice={"type": "tool", "name": "extract_person"},
    messages=[{"role": "user", "content": "Alice is 29, likes hiking and chess."}]
)

# response.content[0].input contains the structured data

Anthropic’s tool_use is the canonical way to get structured output from Claude. The tool_choice parameter with type: "tool" and a specific name forces the model to use that tool, analogous to OpenAI’s function-specific tool_choice.

OpenAI Structured Outputs

OpenAI introduced a stricter variant called Structured Outputs, where strict: true in the function definition enables server-side constrained decoding against the JSON Schema. This guarantees schema conformance, not just syntactic validity.

tools = [{
    "type": "function",
    "function": {
        "name": "extract_person",
        "strict": True,  # enables constrained decoding
        "parameters": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"}
            },
            "required": ["name", "age"],
            "additionalProperties": False  # required for strict mode
        }
    }
}]

With strict: True, OpenAI compiles the JSON Schema into a context-free grammar and uses it to mask logits during decoding. The schema must satisfy certain constraints: additionalProperties: false at every object level, all fields listed in required, and a supported subset of JSON Schema (no $ref cycles, limited oneOf support).

The same mechanism is available without function calling via response_format:

response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[...],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "person",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "age": {"type": "integer"}
                },
                "required": ["name", "age"],
                "additionalProperties": False
            }
        }
    }
)

This is arguably the cleanest structured output interface available from any provider as of May 2026.

Diagram

OpenAI’s Structured Outputs: the schema is compiled into a grammar that constrains decoding at the token level. Schema conformance is guaranteed server-side.

Approach 4: Schema-Constrained Decoding

Schema-constrained decoding is the underlying technique behind OpenAI’s Structured Outputs and similar features in open-source serving frameworks. The idea: convert a JSON Schema into a finite state machine (FSM) or context-free grammar (CFG), then use that automaton to mask invalid tokens at each decoding step.

The mechanism

  1. Schema → Grammar: The JSON Schema is translated into a formal grammar. {"type": "object", "properties": {"age": {"type": "integer"}}} becomes rules like: root → '{' ws '"age"' ws ':' ws integer ws '}', where integer → [0-9]+.

  2. Grammar → FSM/PDA: The grammar is compiled into a pushdown automaton (for context-free grammars) or approximated as a finite state machine (for regular subsets). Each state represents a position in the grammar.

  3. Token masking: At each decoding step, the current FSM state determines which tokens are valid continuations. Invalid tokens get their logits set to negative infinity before softmax. The model can only sample from tokens that keep the output on a valid path through the grammar.

  4. Guaranteed termination: The grammar includes an end state, and the decoding process is constrained to eventually reach it (typically by biasing toward closure tokens as the max token limit approaches).

Diagram

At each decoding step, the FSM state determines valid next tokens. Raw logits are masked before sampling, making schema violations impossible.

Performance characteristics

The FSM compilation is a one-time cost per schema. For simple schemas, this takes 10-50ms. Complex schemas with deep nesting or large enums can take 200-500ms. Most implementations cache the compiled FSM, so the cost is amortized across requests.

Per-token overhead is minimal — computing the valid token set from the current FSM state is O(vocabulary_size) in the worst case but typically much faster with precomputed transition tables. The real cost is that constraining the token space can reduce output quality in subtle ways. When the model “wants” to generate a token that’s masked, it’s forced to pick its second (or third, or tenth) choice, which can cascade into lower-quality completions.

Open-source implementations

Outlines (by .txt) is the most mature open-source library for constrained decoding. It supports JSON Schema, regex patterns, and arbitrary CFGs.

import outlines
from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age: int
    interests: list[str]

model = outlines.models.transformers("Qwen/Qwen3-8B")
generator = outlines.generate.json(model, Person)

result = generator("Extract: Alice is 29, likes hiking and chess.")
# Person(name='Alice', age=29, interests=['hiking', 'chess'])

vLLM integrates Outlines and supports guided_json, guided_regex, and guided_grammar parameters:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1")

response = client.chat.completions.create(
    model="Qwen/Qwen3-8B",
    messages=[{"role": "user", "content": "Extract: Alice is 29, likes hiking"}],
    extra_body={
        "guided_json": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "age": {"type": "integer"},
                "interests": {"type": "array", "items": {"type": "string"}}
            },
            "required": ["name", "age", "interests"]
        }
    }
)

llama.cpp supports GBNF grammars (a BNF variant) for constrained decoding:

root   ::= "{" ws "\"name\":" ws string "," ws "\"age\":" ws integer "," ws "\"interests\":" ws array ws "}"
string ::= "\"" [^"\\]* "\""
integer ::= [0-9]+
array  ::= "[" ws string ("," ws string)* ws "]"
ws     ::= [ \t\n]*

This is lower-level than JSON Schema but gives fine-grained control over the output format.

Approach 5: Grammar-Based Generation

Grammar-based generation generalizes schema-constrained decoding beyond JSON. Any output format expressible as a context-free grammar can be enforced: XML, YAML, SQL, custom DSLs, even programming languages.

Diagram

Grammar-based generation works for any format with a formal grammar, not just JSON.

Use cases beyond JSON

SQL generation with grammar constraints ensures syntactically valid queries:

import outlines

sql_regex = r"SELECT .+ FROM \w+( WHERE .+)?( ORDER BY \w+( ASC| DESC)?)?( LIMIT \d+)?;"
generator = outlines.generate.regex(model, sql_regex)

Enum/classification constrains output to exact label values:

generator = outlines.generate.choice(model, ["positive", "negative", "neutral"])
sentiment = generator("Review: The food was okay but the service was terrible.")
# "negative" — guaranteed to be one of the three options

Regex-constrained generation for structured strings:

# ISO date
date_gen = outlines.generate.regex(model, r"\d{4}-\d{2}-\d{2}")

# Email address (simplified)
email_gen = outlines.generate.regex(model, r"[a-z.]+@[a-z]+\.[a-z]{2,4}")

# Phone number
phone_gen = outlines.generate.regex(model, r"\+1-\d{3}-\d{3}-\d{4}")

The constraint is absolute: the model cannot produce output that doesn’t match the regex. This is more powerful than post-hoc validation because it eliminates retry loops entirely.

Provider Comparison

FeatureOpenAIAnthropicGooglevLLM/Outlinesllama.cpp
JSON mode (syntax only)json_object❌ (use tool_use)response_mime_type
Schema-constrained JSONstrict: true❌ (schema validation, not constrained decoding)response_schemaguided_json✅ (via GBNF)
Function calling✅ tools✅ tool_use✅ function_calling✅ (OpenAI-compatible)Partial
Forced function selectiontool_choicetool_choicetool_config
Regex constraintsguided_regex✅ (via GBNF)
Arbitrary grammarguided_grammar✅ GBNF
Streaming + structure✅ (partial JSON)✅ (tool_use streaming)
Nested object support
Union types / oneOfLimitedLimitedManual
First-token latency impact~10-20msNone (prompt-based)~10-20ms50-200ms (compilation)10-50ms

Key distinctions

OpenAI has the most complete structured output story among API providers. strict: true provides genuine constrained decoding with a schema guarantee. The limitation is the JSON Schema subset: no $ref cycles, additionalProperties: false required everywhere, and some oneOf/anyOf patterns unsupported.

Anthropic takes a different approach. Claude’s tool_use is schema-aware and produces well-typed output, but there’s no public documentation confirming that it uses constrained decoding internally (as opposed to schema-trained instruction following with server-side validation). In practice, Claude Opus 4.7’s schema adherence with tool_use is extremely reliable — low single-digit failure rates even on complex schemas — but the guarantee is probabilistic rather than absolute.

Google supports response_schema in the Gemini API, which provides schema-level constraints. The implementation appears to use constrained decoding for Gemini 3.1 Pro and Flash models.

Open-source (vLLM + Outlines) offers the most flexibility: JSON Schema, regex, and arbitrary grammars. The tradeoff is operational complexity and the compilation overhead for complex schemas.

Diagram

API providers offer schema-constrained JSON; open-source frameworks additionally support arbitrary grammars and regex constraints.

Implementation Patterns

Pattern 1: Pydantic + Structured Outputs (OpenAI)

The tightest integration available. Define a Pydantic model, pass it directly.

from pydantic import BaseModel
from openai import OpenAI

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float

class Invoice(BaseModel):
    vendor: str
    date: str
    items: list[LineItem]
    total: float

client = OpenAI()

response = client.beta.chat.completions.parse(
    model="gpt-5.4",
    messages=[
        {"role": "system", "content": "Extract invoice data from the provided text."},
        {"role": "user", "content": invoice_text}
    ],
    response_format=Invoice,
)

invoice = response.choices[0].message.parsed  # typed Invoice object

The parse method handles JSON Schema generation from the Pydantic model, strict: true configuration, and response deserialization. If the model refuses (content filter), response.choices[0].message.refusal contains the reason.

Pattern 2: Tool Use as Structured Extraction (Anthropic)

import anthropic
from pydantic import BaseModel
import json

class Invoice(BaseModel):
    vendor: str
    date: str
    items: list[dict]  # simplified for example
    total: float

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4.7",
    max_tokens=2048,
    tools=[{
        "name": "extract_invoice",
        "description": "Extract structured invoice data",
        "input_schema": Invoice.model_json_schema()
    }],
    tool_choice={"type": "tool", "name": "extract_invoice"},
    messages=[{"role": "user", "content": f"Extract invoice data:\n{invoice_text}"}]
)

for block in response.content:
    if block.type == "tool_use":
        invoice = Invoice(**block.input)

Pattern 3: Retry with Validation

For cases where constrained decoding isn’t available or the schema is too complex:

from pydantic import BaseModel, ValidationError
from tenacity import retry, stop_after_attempt, retry_if_exception_type

class ExtractedData(BaseModel):
    name: str
    age: int
    email: str

@retry(
    stop=stop_after_attempt(3),
    retry=retry_if_exception_type((json.JSONDecodeError, ValidationError))
)
def extract_with_retry(text: str) -> ExtractedData:
    response = client.chat.completions.create(
        model="gpt-5.4",
        messages=[
            {"role": "system", "content": f"Extract data as JSON: {ExtractedData.model_json_schema()}"},
            {"role": "user", "content": text}
        ],
        response_format={"type": "json_object"}
    )
    raw = json.loads(response.choices[0].message.content)
    return ExtractedData(**raw)  # Pydantic validates types and required fields

This combines JSON mode (syntax guarantee) with Pydantic validation (schema guarantee) and retries (reliability guarantee). The cost is 1-3x token usage on failures, but with current models the retry path is hit <5% of the time for reasonable schemas.

Pattern 4: Streaming Structured Output

Structured output and streaming aren’t mutually exclusive. OpenAI streams partial JSON during constrained decoding, and Anthropic streams tool_use input progressively.

# OpenAI streaming with structured output
stream = client.chat.completions.create(
    model="gpt-5.4",
    messages=[...],
    response_format={"type": "json_schema", "json_schema": {...}},
    stream=True
)

partial_json = ""
for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    partial_json += delta
    # partial_json is a valid JSON prefix at each step
    # Can use incremental JSON parsers (ijson, json-stream) for progressive rendering
Diagram

Streaming structured output enables progressive rendering — fields appear in the UI as they’re generated, even before the full JSON is complete.

For progressive UI rendering, libraries like ijson (Python) or @streamparser/json (JavaScript) parse JSON incrementally, emitting complete key-value pairs as they become available.

Pattern 5: Multi-Schema Dispatch

When the model needs to choose between multiple output types:

tools = [
    {
        "type": "function",
        "function": {
            "name": "create_task",
            "parameters": {
                "type": "object",
                "properties": {
                    "title": {"type": "string"},
                    "priority": {"type": "string", "enum": ["low", "medium", "high"]},
                    "due_date": {"type": "string"}
                },
                "required": ["title", "priority"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_tasks",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "status": {"type": "string", "enum": ["open", "closed", "all"]}
                },
                "required": ["query"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "update_task",
            "parameters": {
                "type": "object",
                "properties": {
                    "task_id": {"type": "string"},
                    "status": {"type": "string", "enum": ["open", "closed"]}
                },
                "required": ["task_id", "status"]
            }
        }
    }
]

# tool_choice="auto" lets the model pick the right function
response = client.chat.completions.create(
    model="gpt-5.4",
    messages=[{"role": "user", "content": "Mark task-123 as done"}],
    tools=tools,
    tool_choice="auto"
)

The model selects the appropriate function and fills in the arguments. This is the foundation of agent tool use, but it’s equally useful for routing user intent to typed handlers.

Failure Modes and Edge Cases

Hallucinated field values

Constrained decoding guarantees the output matches the schema structurally, but it says nothing about semantic correctness. A model can produce {"age": 250} and satisfy {"type": "integer"}. Pydantic validators help:

from pydantic import BaseModel, Field

class Person(BaseModel):
    name: str = Field(min_length=1, max_length=200)
    age: int = Field(ge=0, le=130)
    email: str = Field(pattern=r'^[^@]+@[^@]+\.[^@]+$')

Even with strict: true, OpenAI’s constrained decoding doesn’t enforce minimum, maximum, minLength, maxLength, or pattern from JSON Schema. These are validated client-side by Pydantic, not server-side by the decoder. This is a common source of confusion.

Diagram

Three layers of validation: server-side constrained decoding handles structure, client-side Pydantic handles constraints, application logic handles semantics.

The empty/minimal output problem

When constrained to produce JSON, models sometimes take the path of least resistance:

{"name": "", "age": 0, "interests": []}

This is schema-valid, satisfies all structural constraints, and is completely useless. It happens most often when the model is uncertain about the correct values or when the input doesn’t contain the requested information. Mitigation: include clear instructions about what to do when information is missing (use null vs. best guess vs. explicitly state uncertainty in a separate field).

Enum hallucination with non-strict modes

Without constrained decoding, models occasionally invent enum values:

# Schema says: "enum": ["low", "medium", "high"]
# Model returns: "priority": "urgent"  # not in enum

With OpenAI’s strict: true, this is impossible — the decoder only allows tokens that form valid enum values. With Anthropic’s tool_use, it happens occasionally (estimated <1% with Claude Opus 4.7 based on typical usage reports, though Anthropic doesn’t publish official rates).

Nested schema depth limits

Deeply nested schemas increase compilation time and can degrade output quality. A schema with 5+ levels of nesting creates a large FSM with many states, and the constrained decoding may produce more “mechanical” output as the model has fewer degrees of freedom.

Practical guideline: flatten schemas where possible. Instead of:

{"order": {"customer": {"address": {"street": {"line1": "..."}}}}}

Use:

{"customer_street_line1": "...", "customer_city": "..."}

Token limit truncation

If the model runs out of tokens mid-JSON, constrained decoding implementations typically force-close all open brackets/braces. The result is schema-valid but semantically truncated — arrays may be shorter than expected, optional fields may be missing. OpenAI returns a finish_reason: "length" when this happens, which should be checked.

if response.choices[0].finish_reason == "length":
    # Output was truncated — retry with higher max_tokens or simpler prompt
    pass

Performance and Latency Impact

Constrained decoding adds overhead at two points:

  1. Schema compilation (one-time): 10-500ms depending on schema complexity. Cached by most implementations.
  2. Per-token masking (every token): 0.1-2ms per token. Negligible relative to model forward pass time on GPUs.

For API providers (OpenAI, Google), the latency impact of structured outputs is barely measurable — estimated at 5-15ms for the first request with a new schema (compilation), and sub-millisecond per token after that. The schema caching means subsequent requests with the same schema see no compilation overhead.

For self-hosted inference with vLLM + Outlines, the picture is slightly different. Complex schemas can add 100-300ms compilation time, and per-token overhead is measurable (1-3ms) on smaller GPUs. The Outlines team has been optimizing this; recent versions use precompiled FSM transition tables that reduce per-token cost substantially.

SetupSchema CompilationPer-Token OverheadOverall Impact
OpenAI strict: true~10-20ms (cached)~0.1msNegligible
Gemini response_schema~10-20ms (cached)~0.1msNegligible
vLLM + Outlines50-500ms (cached)1-3ms5-15% on small models
llama.cpp + GBNF10-100ms (cached)0.5-2ms3-10%

Impact on output quality

This is the less-discussed tradeoff. Constrained decoding can degrade output quality because it restricts the model’s token space. When the model’s preferred next token is masked, it falls back to lower-probability alternatives. This effect is most noticeable with:

  • Small models (< 8B parameters): More sensitive to token masking because their probability distributions are less peaked.
  • Complex schemas: More constraints = more masking = more forced detours from the model’s preferred path.
  • Highly specific enums: If the model’s top choice is “United States” but the enum only contains “US”, constrained decoding forces correct format at the cost of the model’s “natural” reasoning path.

With frontier models (GPT-5.4, Claude Opus 4.7, Gemini 3.1 Pro), quality degradation from structured output constraints is minimal in practice. The models are good enough at instruction following that the constrained tokens usually align with what the model would have generated anyway.

Choosing the Right Approach

Diagram

Decision flow: schema complexity determines the appropriate structured output approach.

Decision framework

Use prompt engineering + JSON mode when:

  • Schema has < 5 flat fields
  • Occasional failures are tolerable (internal tools, batch processing with retry)
  • Provider doesn’t support schema-constrained decoding
  • Minimizing API complexity matters more than reliability

Use function calling / tool use when:

  • The model needs to choose between multiple schemas (multi-tool scenarios)
  • Building an agent that calls external APIs
  • Schema is moderately complex (nested objects, arrays)
  • Working with Anthropic (tool_use is the canonical structured output path)

Use strict/schema-constrained decoding when:

  • Zero tolerance for schema violations (payment processing, medical records, legal documents)
  • Schema is complex with nested objects, enums, and optional fields
  • High throughput where even a 2% retry rate is expensive
  • Using OpenAI or self-hosted inference (vLLM)

Use grammar-based generation when:

  • Output format isn’t JSON (SQL, YAML, custom DSL)
  • Need regex-level constraints on individual field values
  • Self-hosting with llama.cpp or vLLM
  • Classification tasks (constrain output to exact label set)

The Pydantic bridge

Regardless of which approach is used at the generation level, Pydantic (or equivalent typed validation) should sit at the application boundary:

from pydantic import BaseModel, ValidationError

class StructuredResponse(BaseModel):
    # ... schema definition ...
    pass

def get_structured_output(prompt: str) -> StructuredResponse:
    """Single function that abstracts the structured output strategy."""
    raw_json = call_llm_with_structured_output(prompt)  # any approach
    try:
        return StructuredResponse(**raw_json)
    except ValidationError as e:
        # Log, retry, or raise
        raise

This decouples the validation layer from the generation strategy. If moving from OpenAI to Anthropic (or from API to self-hosted), only the call_llm_with_structured_output implementation changes. The Pydantic model stays the same.

Diagram

Pydantic as the universal validation layer: regardless of provider or structured output method, all output passes through typed validation before entering application code.

Summary

Structured output from LLMs exists on a spectrum from best-effort (prompt engineering) to guaranteed (constrained decoding). The right choice depends on schema complexity, reliability requirements, and provider constraints.

  • Prompt + JSON mode handles simple schemas adequately. Expect 2-5% failure rates on moderately complex schemas, mitigated by retries.
  • Function calling / tool use is the standard approach across providers. OpenAI’s tools and Anthropic’s tool_use both support schema-aware generation with forced function selection.
  • OpenAI’s strict: true Structured Outputs provides genuine constrained decoding with schema guarantees. The JSON Schema subset is limited but covers most practical use cases.
  • Anthropic’s tool_use achieves high schema adherence through training and (probably) server-side validation, but without documented constrained decoding guarantees.
  • Open-source constrained decoding (Outlines, vLLM guided_json, llama.cpp GBNF) offers the most flexibility, including regex and arbitrary grammar support, at the cost of operational complexity.
  • Constrained decoding doesn’t validate semantics. A structurally perfect JSON object can still contain hallucinated values. Client-side validation (Pydantic with field constraints) remains essential.
  • Streaming and structured output coexist. All major providers support streaming structured output, enabling progressive UI rendering with incremental JSON parsers.

Every production system should have Pydantic (or equivalent) at the application boundary regardless of what the LLM provider guarantees. Defense in depth: server-side constraints prevent syntax errors, client-side validation catches semantic errors, and business logic catches everything else.

Further Reading

  • Outlines — The primary open-source library for structured generation with LLMs, supporting JSON Schema, regex, and CFG constraints.
  • OpenAI Structured Outputs Guide — Official documentation on strict: true JSON Schema mode and function calling with structured outputs.
  • Anthropic Tool Use Documentation — Claude’s tool_use API for structured extraction and agent tool calls.
  • vLLM Guided Decoding — vLLM’s integration with Outlines for guided_json, guided_regex, and guided_grammar parameters.
  • llama.cpp GBNF Grammars — Grammar-based constrained decoding in llama.cpp using GBNF format.
  • Pydantic Documentation — Typed validation library used throughout the Python LLM ecosystem for schema definition and output validation.
  • Willison, “Structured Output from LLMs” — Simon Willison’s collected writings on structured extraction patterns and provider comparisons.
  • Guidance — Microsoft’s library for constrained generation that interleaves template structure with LLM generation.
  • Instructor — Pydantic-based structured output library with retry logic, supporting OpenAI, Anthropic, and other providers through a unified interface.