Office Hours — Why are so many companies rolling out their own AI/LLM agent sandboxing solution? A daily developer question about AI/LLMs, answered with a direct, opinionated take. 2026-04-12T12:00:00.000Z Office Hours Office Hours office-hoursq-and-apractical-ai

Office Hours — Why are so many companies rolling out their own AI/LLM agent sandboxing solution?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

Why are so many companies rolling out their own AI/LLM agent sandboxing solution?

The framing of this question shifted this week. As of May 2026, providers have moved into the managed agent space. Anthropic announced agent hosting directly. OpenAI’s Agents SDK includes sandboxed code execution. Google has Vertex AI Agent Builder. The premise “providers offer nothing here” is no longer accurate.

What they offer are managed primitives: hosted execution environments, some spend controls, sandboxed tool invocations. What they don’t offer is integration with your infrastructure. Your VPC, your databases, your audit logging requirements, your compliance boundary. That gap is still yours to close.

The Unit Economics Problem

The companies building in-house aren’t doing it because the providers haven’t shipped anything. They’re doing it because the economics and integration surface don’t match production requirements. Anthropic’s agent hosting was reported this week to cost significantly more than the $0.08/hr headline once you account for actual model invocations and tool calls. A rough calculation on a typical autonomous coding task: Claude Opus 4.7 charges $15 per million input tokens and $75 per million output tokens. A single complex refactoring loop might consume 200k input tokens and 300k output tokens across 3-5 agent steps. That’s roughly $25-30 per task. At scale across a team, managed provider infrastructure at that per-task cost becomes harder to justify than a self-hosted orchestration layer running on existing container infrastructure.

Opaque pricing on managed agent infrastructure makes it hard to budget for anything with real workload volume. You don’t know upfront whether tool invocations, retry loops, or context window refreshes will push you into a higher tier or incur additional charges.

Here’s what that looks like in practice. A team running 50 autonomous coding agents per day (conservative for a medium-size engineering org) hits $37,500/month on managed hosting. The same workload on a self-hosted Modal or Temporal cluster with Claude Opus 4.7 API calls direct costs roughly $18,000/month on model inference plus $4,000/month for compute. That’s a 60% cost reduction before accounting for the fact that you can optimize retry logic, cache responses, or batch database queries in your own orchestration layer. A managed agent that makes inefficient tool calls charges you for every one.

What Self-Hosted Orchestration Actually Looks Like

The practical delta shows up in how you instrument model calls. A self-hosted Temporal workflow wrapping Claude Opus 4.7 gives you something like this:

@activity.defn
async def run_agent_step(ctx: AgentContext) -> StepResult:
    start_tokens = ctx.token_counter.total
    response = await anthropic_client.messages.create(
        model="claude-opus-4-7",
        max_tokens=4096,
        messages=ctx.messages,
        tools=ctx.tools,
    )
    ctx.token_counter.record(response.usage)
    log.info("step_tokens", input=response.usage.input_tokens,
             output=response.usage.output_tokens,
             task_id=ctx.task_id)
    return StepResult(content=response.content, usage=response.usage)

That’s granular per-step token accounting, attached to a specific task ID, logged to whatever sink your compliance team requires. Managed agent hosting gives you aggregate usage numbers. That’s not the same thing.

When Managed Hosting Works

Provider-managed agent hosting is genuinely useful for teams without strict cost or compliance boundaries. Offloading observability, scaling, and error handling to the provider removes operational friction. For teams running GitHub Copilot across an organization or using Claude Code for internal automation, the convenience factor is real.

But production workloads have hard constraints. Temporal, Prefect, and Modal’s container primitives remain solid options for teams that need explicit observability, cost controls, and integration into existing deployment pipelines. These tools let you run frontier models as agents without sacrificing visibility into token consumption, retry behavior, or infrastructure costs. You own the execution layer; the provider owns the model inference. That split is cleaner for compliance and forecasting.

The Compliance and Integration Gap

Provider agent hosting also doesn’t solve the hard integration problem. Your database access patterns, your audit logging requirements, your VPC isolation boundary, your rate limiting strategy. These are orthogonal to whether you use Claude Code or build an in-house orchestrator. But they’re not orthogonal to cost. A managed agent that makes five unnecessary database queries is five times as expensive. A self-hosted agent running on your infrastructure can batch those calls, cache responses, or fail fast. The economics compound.

There’s also the observability gap. Managed agent sandboxes give you high-level logs. They don’t give you the granular token accounting that matters for production cost forecasting, or the ability to see exactly which tool invocation caused a token spike. Self-hosted orchestration with direct API access lets you instrument every model call, every retry, every context refresh. For teams with compliance or regulatory reporting requirements, that transparency is non-negotiable.

Hidden Costs of Managed Infrastructure

One often-overlooked edge case: retry behavior. A managed agent that hits rate limits or temporary API failures will retry transparently, charging you for every attempt. Self-hosted orchestration gives you control over exponential backoff, circuit breakers, and which failures warrant a retry versus immediate fallback. On a large volume of agent tasks, that difference compounds into real money.

Caching is another lever managed providers don’t expose. If your agent repeatedly reads the same configuration file or queries the same schema metadata across multiple runs, a self-hosted setup can cache those responses at the prompt level. Anthropic’s prompt caching cuts input token costs by up to 90% on repeated context. Managed infrastructure may or may not apply that automatically, and you have no visibility into whether it did.

Bottom line: Evaluate managed agent hosting against your actual cost model and compliance requirements before committing. The providers have moved into this space, but the unit economics are still opaque and the integration surface often favors self-hosted orchestration for anything running at production volume. The managed path is faster to start. The self-hosted path is cheaper and more auditable to operate.

Question via Hacker News