Office Hours — How do I know when to stop prompt engineering and just upgrade my model?

How do I know when to stop prompt engineering and just upgrade my model?

This one burns people constantly. You’ll tweak a prompt for weeks, add chain-of-thought reasoning, throw in examples, restructure your instructions, and you’ll squeeze out maybe 3-5% improvement. Then GPT-5.4 comes out and you get 15% better performance immediately with the same prompt.

The move is to establish a performance ceiling on your current model first. Run your test set, measure the actual gap between what you need and what you’re getting, then spend maybe one solid week on prompt work. If you’re hitting diminishing returns (each new prompt iteration adds less than 1-2%), stop and consider upgrading.

Also check whether your problem is actually the model or your setup. Bad retrieval, noisy input data, or a fundamentally unclear task description will waste both prompt engineering time and money on a better model. Use Claude Opus 4.6 or GPT-5.4 just once on your hardest test cases to see what’s theoretically possible. If those models still struggle, your issue isn’t prompting.

The financial case matters too. Upgrading from GPT-4.1 Nano to GPT-5.4 costs maybe 3-5x more per request. If you’re doing millions of inferences, that’s real money. But if prompt engineering takes your team three weeks for tiny gains while you’re leaving money on the table with quality issues, that’s also expensive.

Bottom line: Benchmark your current model first, spend one focused week on prompts, then make the call. Don’t drift into endless prompt tweaking hoping to avoid the upgrade cost.