Office Hours — How are you handling LLM API costs in production without sacrificing quality?

A daily developer question about AI/LLMs, answered with a direct, opinionated take.

Daily One question from the trenches, one opinionated answer.

How are you handling LLM API costs in production without sacrificing quality?

The honest answer is you’re making explicit tradeoffs, not magically avoiding them. Most teams I talk to are doing three things in parallel.

First, routing by task complexity. GPT-5.4 is expensive but necessary for reasoning-heavy work. Claude Haiku 4.5 handles straightforward classification, summarization, and templated generation at a fraction of the cost. You need to actually measure where your tokens are going before you can be smart about this. Set up logging that tags requests by task type and model, then look at cost per task category.

Second, caching and batch processing. If you’re making the same API calls repeatedly, you’re leaving money on the table. OpenAI’s prompt caching works. Google’s Gemini models have native caching too. For non-latency-sensitive work, batch APIs are 50% cheaper than real-time calls. Build the latency tolerance into your product architecture upfront.

Third, consider self-hosted models for high-volume, lower-risk tasks. Llama 4 is solid for classification and extraction at scale. DeepSeek-V3 is cheaper to run than frontier models for many workloads. The operational overhead is real, but if you’re spending thousands monthly on API calls, self-hosting the commodity tasks becomes worth it.

Bottom line: Measure your actual token spend by task type first, then match model tier to task complexity. Add caching everywhere it fits, batch what you can, and only self-host if you have engineering bandwidth and the math actually works.

Question via Hacker News