Office Hours — Is anyone using function calling with LLMs in production?

Is anyone using function calling with LLMs in production?

Yeah, heavily. Function calling (or tool use) is no longer experimental—it’s foundational infrastructure for anything beyond chat. OpenAI’s and Anthropic’s APIs have it baked in, and the pattern is consistent: declare your tools, let the model decide when to invoke them, handle the result, loop back.

The reality is messier than tutorials suggest. In production, you’re managing tool schemas carefully (models hallucinate parameters), handling timeouts when external services lag, and dealing with models that sometimes refuse to call tools when they should. You need explicit fallback logic for when tool invocation fails or returns garbage.

Where It Works Well

Real wins come from specific domains: customer support agents routing to appropriate queues, code execution sandboxes, autonomous coding agents with git and test runners. Function calling works best when success is verifiable. A test passes or fails. An API call returns a clear response. A database query executes. The model sees the outcome and adjusts.

Coding agents are the clearest example. Claude Opus 4.7 and GitHub Copilot can now handle genuine multi-step tasks: clone a repo, run tests, parse failure output, modify code, re-run, and push—all without human intervention between steps. The feedback loop is tight and objective. GPT-5.4 has native computer use capability in the API, allowing agents to interact with desktop applications directly through visual feedback. The frontier models have gotten good enough that the bottleneck shifts from capability to infrastructure: can you observe what the agent is doing, and can you roll back if it breaks something?

Where It Breaks Down

Function calling starts breaking down in ambiguous territory where the model needs to reason about whether a tool even makes sense. Should we call the payment API or retry with fallback pricing? Is this customer request something we should escalate, or handle in-band? These aren’t binary tool invocations; they’re judgment calls.

Long tool chains also degrade under pressure. Each step introduces noise. A model calling ten tools in sequence accumulates error—hallucinated parameters in step three, a malformed API call in step seven. The longer the chain, the higher the chance of drift, especially if intermediate signals are weak or noisy. Where agents excel is when each step has a clean success signal and the decision tree is bounded. Where they fail is agentic RAG across heterogeneous data sources, or anything requiring subjective judgment about safety or correctness.

Observability and Cost

The tooling is maturing. OpenAI’s API includes native sandbox execution. Anthropic’s Claude Opus 4.7 is reliable at complex multi-step tool sequences. But you still need aggressive observability. When a function call chain fails, you need to know which step broke and why. Log the schema sent to the model, the tool calls it generated, the results it received, and the final decision. Most teams aren’t instrumented at this level.

Cost scales quickly with complex agents. Each tool call is an API round trip. A customer support agent that chains five tools per query and runs 100k queries monthly adds up fast. At typical frontier model pricing, that’s roughly $2k-$5k monthly in tool invocation overhead alone, before tracing infrastructure. Budget for observability infrastructure too—logging and replay logic to debug failures will often dwarf the compute cost. Consider sampling strategy early: you cannot log every step of every agent run at scale without bleeding budget.

Schema Design and Failure Modes

Tool schemas are where theoretical clarity meets practical chaos. A schema that’s too loose gives the model room to hallucinate; too tight and it refuses valid requests. GPT-5.4 and Claude Opus 4.7 are both better at respecting constraints than earlier models, but neither is perfect. Use enums aggressively for categorical parameters. Make required fields explicit. Include examples in the description.

{
  "name": "transfer_funds",
  "description": "Transfer money between accounts. Always verify the destination before executing.",
  "parameters": {
    "type": "object",
    "properties": {
      "from_account_id": {
        "type": "string",
        "pattern": "^ACC-[0-9]{8}$",
        "description": "Source account ID (format: ACC-XXXXXXXX)"
      },
      "to_account_id": {
        "type": "string",
        "pattern": "^ACC-[0-9]{8}$"
      },
      "amount_cents": {
        "type": "integer",
        "minimum": 1,
        "description": "Amount in cents. Must be positive."
      },
      "transfer_type": {
        "type": "string",
        "enum": ["domestic", "international", "wire"],
        "description": "Transfer method affects fees and processing time"
      }
    },
    "required": ["from_account_id", "to_account_id", "amount_cents", "transfer_type"]
  }
}

When the model makes a malformed call, don’t silently fail—return a clear error message describing what went wrong and let it retry with context. “transfer_type must be one of: domestic, international, wire. You provided ‘DOMESTIC’.” beats “invalid parameter” every time.

The Fallback Question

Build fallback paths explicitly. If your tool invocation fails, what does the agent do? Retry the same tool? Call a different tool? Escalate to a human? Hallucinating a response is worse than admitting failure. A customer support agent that makes up a tracking number is a liability. One that says “I couldn’t retrieve that, connecting you to a specialist” is honest.

Bottom line: Function calling is production-ready and widely deployed, but treat it as controlled orchestration, not magic autonomy. Build it where success is verifiable and keep humans in the loop when failure has teeth. Instrument aggressively or you’ll debug in production.

Question via Hacker News