Office Hours — We're getting inconsistent outputs from the same prompt with GPT-5.4. Temperature is locked at 0. What's actually going on?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
We’re getting inconsistent outputs from the same prompt with GPT-5.4. Temperature is locked at 0. What’s actually going on?
Temperature at 0 should give you deterministic outputs, but you’re probably hitting one of three things. First, the model itself has some internal variance even at temperature 0 due to how the sampling works under the hood. OpenAI’s docs acknowledge this now, though it’s rare. Second, you might have different system prompts, different conversation histories, or slightly different input formatting between calls that you’re not noticing. Check your logs carefully, including whitespace and special characters.
Third, and most likely, you’re bumping into context window boundaries or the model is legitimately uncertain about the right answer and picking from a very tight distribution. At temperature 0, it’s still choosing the highest-probability token, but if the top few tokens have nearly identical scores, tiny differences in the input can shift which one wins.
Debugging in Practice
Start by logging everything at the point of the API call. This means the exact request body, headers, and response. Here’s what to capture:
import json
import hashlib
payload = {
"model": "gpt-5.4",
"messages": [{"role": "user", "content": prompt}],
"temperature": 0,
"max_tokens": 500
}
# Hash the payload to detect even whitespace differences
payload_hash = hashlib.sha256(json.dumps(payload, separators=(',', ':')).encode()).hexdigest()
print(f"Payload hash: {payload_hash}")
print(f"Full payload:\n{json.dumps(payload, indent=2)}")
If you’re getting variance across multiple calls with identical hashes, you’ve confirmed the variance is real. If the hashes differ, you’ve found your culprit. Common sources of hidden differences: API keys with different rate limits or org settings (OpenAI routes requests differently based on account tier), different API endpoint regions, or conversation history leaking from previous requests if you’re reusing client instances.
Context and Confidence
Even at temperature 0, GPT-5.4 expresses uncertainty through token probability distributions. If the model sees an ambiguous input and the top two tokens have probabilities like 0.34 and 0.33, a tiny perturbation in floating-point math during inference can flip which one gets selected. This isn’t a bug, it’s the model doing its job.
You can detect this by checking the logprobs field in the response. If the top token’s log probability is close to the runner-up (within 0.5 nats), you’re in a high-uncertainty zone. The fix isn’t to adjust temperature (it’s already at 0), it’s to make your prompt more specific or constrain the output space.
Running the Controlled Test
Make 50 identical API calls with the exact same input and log everything. If you’re still getting variance, file it with OpenAI support. If it disappears, your inputs weren’t as identical as you thought.
If variance persists across truly identical requests, consider whether you need determinism at all. Many production systems that think they need temperature 0 actually need consistency within a reasonable error band, which is cheaper and faster to achieve with temperature 0.1 or 0.2 and a deterministic post-processing step.
Bottom line: Lock down your inputs first (logs, whitespace, everything, including API context), run a controlled test to confirm the variance is real, then escalate if it persists. Check logprobs to see if you’re hitting a high-uncertainty zone. If you are, the answer is a better prompt, not a model bug.