Office Hours — How are you handling LLM API costs in production without sacrificing quality?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
How are you handling LLM API costs in production without sacrificing quality?
The honest answer is you’re making explicit tradeoffs, not magically avoiding them. Most teams I talk to are doing three things in parallel.
First, routing by task complexity. GPT-5.5 is expensive but necessary for reasoning-heavy work. Claude Haiku 4.5 handles straightforward classification, summarization, and templated generation at a fraction of the cost. You need to actually measure where your tokens are going before you can be smart about this. Set up logging that tags requests by task type and model, then look at cost per task category.
Here’s what that looks like in practice. A content platform I know routes like this:
- Taxonomy classification (news articles into 5 predefined categories): Claude Haiku 4.5
- Fact-checking complex claims: GPT-5.5
- Bulk summarization of user-submitted content: Gemini 3.1 Flash (cheaper per token, sufficient quality)
- Real-time content moderation: Claude Haiku 4.5 with cached system prompts
- Novel reasoning tasks (e.g., “does this claim contradict our existing knowledge base?”): GPT-5.5
They measure token spend by category weekly. After routing optimization alone, they cut API costs by 35% without touching caching or batching. Most teams skip this step and go straight to self-hosting, which is backwards.
Routing in Practice
To instrument this, log every API call with task metadata. A minimal example using OpenAI’s API:
response = client.chat.completions.create(
model="gpt-5.5",
messages=[{"role": "user", "content": prompt}],
metadata={"task_type": "fact_check", "user_id": "123"}
)
log_event({
"task_type": "fact_check",
"model": "gpt-5.5",
"input_tokens": response.usage.prompt_tokens,
"output_tokens": response.usage.completion_tokens,
"cost": calculate_cost(response.usage, "gpt-5.5")
})
Run this for two weeks before optimizing. You’ll find that 60-80% of your spend is often on tasks that don’t need your most expensive model.
Second, caching and batch processing. If you’re making the same API calls repeatedly, you’re leaving money on the table. OpenAI’s prompt caching works. Google’s Gemini models have native caching too. For non-latency-sensitive work, batch APIs are 50% cheaper than real-time calls. Build the latency tolerance into your product architecture upfront.
Caching is particularly underutilized. If your system prompt or few-shot examples are over 1024 tokens (and they usually are), you’re paying for the same tokens on every request. With OpenAI’s caching, those tokens cost 90% less after the first call. With Gemini caching, it’s similar. The operational friction is minimal. Set it and forget it.
Batching is trickier because it requires architectural changes. You can’t batch a real-time user request, but you can batch overnight summarization, metadata extraction, or analytics jobs. If you have 100k documents to classify and can wait 24 hours, batch processing saves real money. A typical batch job costs $0.30 per million input tokens versus $3 for on-demand, assuming GPT-5.4 pricing. The gap widens for frontier models.
The Self-Hosting Threshold
Third, consider self-hosted models for high-volume, lower-risk tasks. Llama 4 is solid for classification and extraction at scale. DeepSeek-V3 handles many workloads cheaper than frontier models. The operational overhead is real, but if you’re spending thousands monthly on API calls, self-hosting the commodity tasks becomes worth it.
The math: a Llama 4 deployment (modest GPU cluster, inference serving, monitoring) costs roughly $2-5k/month to operate. If you’re spending $10k+/month on Claude Haiku for straightforward tasks, self-hosting breaks even within months. If you’re spending $2k/month, the operational burden doesn’t justify it. Know your actual spend first.
Self-hosting also introduces real constraints. Model quality plateaus below frontier capability. Llama 4 is good but won’t match GPT-5.5 on reasoning tasks. Your engineering team now maintains another service. Upgrading to new open-source models requires revalidation. The right use case is high-volume, low-ambiguity tasks where your quality bar is well-defined and you’ve validated the model meets it.
One edge case: if you have spiky traffic, self-hosting becomes less attractive. Scaling a GPU cluster up and down is slower than scaling API calls. If your usage varies 10x week-to-week, the API cost during peaks is still cheaper than provisioning hardware for them.
Bottom line: Measure your actual token spend by task type first, then match model tier to task complexity. Add caching everywhere it fits, batch what you can, and only self-host if you have engineering bandwidth and the math actually works.
Question via Hacker News