Office Hours — Is it better to improve the harness around the LLM or wait for a better model?

Is it better to improve the harness around the LLM or wait for a better model?

Improve the harness. Waiting is a trap that kills shipped products.

Here’s why: model upgrades happen unpredictably, take months to integrate into production safely, and you can’t control them. Your harness, though? That’s leverage you own right now. Better retrieval, smarter prompting, structured outputs, retry logic with fallbacks, grounding against your actual data, prompt caching to reduce latency and cost, tool use design that actually fits your domain. These compound.

I’ve watched teams sit waiting for the next frontier model while their competitors shipped Claude Opus 4.7 with five layers of careful prompt engineering and won. The Opus team didn’t need a better base model. They needed better plumbing.

The Cost Math of Waiting vs. Shipping

Let’s ground this in actual numbers. Say you’re building a code review agent. Claude Opus 4.7 costs $15 per million input tokens, $75 per million output tokens. GPT-5.4 runs $3/$12. If your harness is sloppy—unnecessary context in prompts, no caching, redundant API calls—you’ll burn through tokens fast enough that model choice matters less than you think.

A team I know was paying $8k/month on Opus calls for a document extraction system. Three weeks of harness work: better chunking, fewer redundant passes, prompt caching on repeated schemas, structured output that eliminated post-processing retries. Cost dropped to $1.2k/month on the same model. No model upgrade needed. That’s $82k/year in real margin.

Here’s what that looked like in their prompt abstraction layer:

# Before: unstructured, repetitive calls
def extract_invoice(doc_text):
    response = client.messages.create(
        model="claude-opus-4-7",
        messages=[{"role": "user", "content": full_doc + extraction_prompt}]
    )
    return parse_text_response(response)

# After: cached schema, structured output
def extract_invoice(doc_text):
    response = client.messages.create(
        model="claude-opus-4-7",
        system=[
            {"type": "text", "text": EXTRACTION_SYSTEM},
            {"type": "text", "text": INVOICE_SCHEMA, "cache_control": {"type": "ephemeral"}}
        ],
        messages=[{"role": "user", "content": doc_text}],
        structured_output={
            "type": "json_schema",
            "json_schema": INVOICE_JSON_SCHEMA
        }
    )
    return response.content[0].parsed

The second version caches the schema across requests, uses structured output to eliminate parsing logic, and cuts token spend per document by 60%. Same model. Different leverage.

When you eventually do upgrade models, good harness design is portable. A system built around structured outputs and explicit fallback chains works identically on GPT-5.5, Claude Opus 4.7, or Gemini 3.1 Pro. Your competitive edge doesn’t vanish the moment a new model lands.

When Model Releases Actually Matter

That said, don’t ignore releases entirely. When GPT-5.5 or Claude Opus 4.8 drops, run a one-week eval against your harness. Spend a day testing. But don’t pause shipping to rewrite everything.

Watch for specific capability jumps that map to your actual constraints: vision resolution improvements, context window increases, or performance gains on your exact task. Claude Opus 4.7 hit 64.3% on SWE-bench Pro, which matters if you’re building autonomous coding agents where that 2-3% difference compounds across thousands of tasks. It doesn’t matter if you’re doing simple classification.

The trap is confusing “better model” with “better for my product.” Frontier models improve on public benchmarks continuously. Your product improves when those improvements unlock new features or close specific failures you’re already seeing in production. Know the difference.

Build for Model Agnosticism

The real play is this: build your harness so it’s model-agnostic. Swap inference engines with a flag. Inject the model name as a variable, not hardcoded into prompts and retry logic. Then you get both worlds. Better prompts and tools work on any frontier model. When a new one lands, you gain the improvement for free without rearchitecting.

Use abstraction layers for provider calls. A simple wrapper that routes to OpenAI, Anthropic, or Google behind a config flag means your test suite runs against multiple models in parallel. You’ll discover compatibility issues before production. This also lets you A/B test Claude Opus 4.7 against GPT-5.4 without reshuffling your inference code.

The Edge Case: Deprecated Models

One scenario where you do need to move: when your current model hits end-of-life. GPT-4 and GPT-4o were retired in February 2026. If you were running on GPT-4o in January 2026, you had months to migrate, but migration was mandatory, not optional. Plan for this. Don’t assume any current model ships forever. But “eventual migration” is not the same as “wait for better models before shipping.”

Build deprecation warnings into your config layer. Log which models are in use. Set calendar reminders for announced end-of-life dates. Treat model migrations as infrastructure maintenance, not as pauses waiting for the next big release.

Bottom line: Ship a solid harness today over a better model tomorrow. Once you’re live, model upgrades become force multipliers instead of blockers. Your harness compounds; waiting doesn’t.

Question via Hacker News