Office Hours — I'm using Claude Opus 4.6 for a customer-facing summarization task. Should I batch requests during off-peak hours to save money, or just call the API in real-time?

I’m using Claude Opus 4.6 for a customer-facing summarization task. Should I batch requests during off-peak hours to save money, or just call the API in real-time?

Real answer: it depends on your latency tolerance, but batching is almost always worth it if you can afford the delay.

Claude’s batch API gives you a 50% discount compared to standard API pricing. If you’re summarizing support tickets, emails, or documents where a 24-hour turnaround is acceptable, you’re leaving money on the table by not batching. We’ve seen teams cut their summarization costs in half by moving to batch.

The math is concrete. A support team processing 1,000 ticket summaries per day at standard rates might spend $50-80 depending on summary length and model. The same batch job costs $25-40. Over a month, that’s roughly $750-1,200 in savings with zero quality loss. Off-peak timing doesn’t matter for batch pricing, by the way. The 50% discount applies whenever you submit the job.

When Batching Works (and Doesn’t)

The tradeoff is straightforward: real-time API calls have no delay but cost full price. Batch processing takes up to 24 hours and costs half as much. If your summarization task is user-initiated and they’re waiting for the result, you need standard API calls. But if it’s background processing, scheduled jobs, or bulk operations, batch wins.

One thing to watch: batch doesn’t work for interactive chat or streaming. If your summarization task is user-initiated and they’re waiting for the result in real-time, you’re locked into standard pricing. But internal workflows, daily customer dashboards, and asynchronous processing pipelines are all fair game for batch.

Practically, most production setups hybrid this. Real-time for high-priority or urgent summaries, batch for the routine stuff. You could tier by urgency: mark a summary as “VIP” if a customer requests it during a support call, run standard API. Everything else queues for the next batch window. Your infrastructure picks the right path based on urgency flags in the request.

Practical Implementation

You’ll want a simple queue structure. Store summaries needing processing in a database with a priority field and processed_at timestamp. A scheduled job runs once daily (usually off-peak, though timing is decoupled from pricing), collects all non-urgent items, and formats them as batch requests.

Batch requests to Claude Opus 4.6 should be formatted as JSONL, one request per line:

{"custom_id": "ticket-4521", "params": {"model": "claude-opus-4-6", "max_tokens": 1024, "messages": [{"role": "user", "content": "Summarize this support ticket: ..."}]}}
{"custom_id": "ticket-4522", "params": {"model": "claude-opus-4-6", "max_tokens": 1024, "messages": [{"role": "user", "content": "Summarize this support ticket: ..."}]}}

Include request IDs (the custom_id field) so you can match responses back to your database records. The batch job returns results in the same order, so correlation is straightforward. While waiting for results, serve cached summaries or a “will be ready tomorrow” message to customers. When batch completes, backfill the results.

Cost Example: Before and After

Say you’re summarizing 200 support tickets daily at 800 tokens per summary (average input + output). At Claude Opus 4.6 pricing, that’s roughly:

Real-time only: 200 tickets × 800 tokens × $3/MTok ≈ $480/month Batch only: 200 tickets × 800 tokens × $1.50/MTok ≈ $240/month Savings: $240/month or $2,880/year

Add a hybrid model (20% real-time VIP, 80% batch) and you’re at roughly $310/month. Still 35% cheaper than pure real-time, with better customer experience for urgent cases.

Edge Cases and Fallbacks

One edge case worth handling: if a user marks a summary as urgent after it’s queued for batch, have a fallback to real-time processing. It costs more per request, but some customers justify it. The opposite can happen too. If batch processing is already queued and a user requests the same summary in real-time before the batch completes, check your cache first and avoid duplicate work.

Batch processing can fail or take longer than 24 hours during system congestion, though that’s rare. Store your batch job IDs and build retry logic. If a batch doesn’t complete within 30 hours, resubmit or fall back to standard API for those requests.

Monitoring matters here. Track batch completion time, failure rates, and cache hit rates separately. If you see batch jobs regularly exceeding 24 hours, real-time API becomes more attractive despite the cost. Most teams don’t hit that ceiling unless submitting tens of thousands of requests per batch.

Bottom line: If summaries don’t need to be instant, use Claude’s batch API and cut costs by 50%. Real-time only for cases where latency actually matters. Hybrid setups pay for themselves quickly at scale, and the complexity is minimal if you structure the queue upfront.