Office Hours — How are you structuring context and prompts for AI coding agents to get reliable results?

How are you structuring context and prompts for AI coding agents to get reliable results?

The gap between “agent works in my notebook” and “agent ships production code” is almost entirely a context and prompt problem. Most failures aren’t the model’s fault—they’re poor signal, missing constraints, or ambiguous success criteria. Here’s what actually matters.

Start with a Clear Success Definition, Not Just a Task Description

The single biggest mistake is treating the agent’s goal like you’d describe a feature to a colleague over Slack. “Build a login endpoint” is vague. An agent needs an explicit, verifiable success condition.

Bad: “Write a login endpoint that handles authentication.”

Good: “Write a POST /auth/login endpoint that (1) accepts JSON with email and password, (2) validates against bcrypt hashes in the users table, (3) returns a JWT token on success, (4) returns 401 with message ‘Invalid credentials’ on failure, (5) includes all necessary imports, (6) has no console.log statements, (7) passes the test file at tests/auth.test.ts.”

The second version gives Claude Code or GPT-5.5 something to measure itself against. Agents iterate well when they have a binary pass/fail signal. Without it, they guess.

Context Architecture: Separate by Concern

Agents work best when context is layered, not dumped. Structure your system prompt like this:

Layer 1: Role and constraints (what the agent is, what it can’t do)

Layer 2: Technical constraints (language, frameworks, patterns you care about)

Layer 3: Codebase knowledge (file structure, key modules, naming conventions)

Layer 4: The specific task (what you want right now)

Layer 5: Examples (one or two concrete examples of code you’d accept)

Here’s a real template structure for Claude Code:

You are a software engineer with read-only access to a codebase.
Your job is to write new code that fits the existing patterns.

CONSTRAINTS:
- Do not modify existing files
- Do not use console.log or debugger
- Do not introduce new npm packages without asking
- All code must pass linting with eslint
- All code must have TypeScript strict mode enabled

CODEBASE PATTERNS:
- Use async/await, never callbacks
- All database queries use the QueryBuilder in lib/db.ts
- All endpoints return { data: T } or { error: string } (never null)
- Services in /services, controllers in /api/routes

STRUCTURE:
/src
  /services (business logic)
  /api/routes (endpoints)
  /lib (shared utilities)
  /types (TypeScript interfaces)

THE TASK:
Create a service function at src/services/userService.ts that:
1. Exports async getUserById(id: string): Promise<User>
2. Uses QueryBuilder to fetch from users table
3. Returns null if user not found (not undefined)
4. Throws if database connection fails
5. Logs query time to a metrics collector

EXAMPLE OF CODE STYLE YOU LIKE:
[paste 20 lines of a file that represents your style]

ACCEPTANCE CRITERIA:
- File exists at correct path
- Function signature matches exactly
- No type errors with tsc --noEmit
- Tests in tests/userService.test.ts pass

That’s ~250 tokens but saves you hours of iteration because the agent has a map, not fog.

Handle Ambiguity with Explicit Branching, Not Open-Ended Thinking

When Claude Thinking mode is engaged, agents sometimes spend reasoning tokens on questions you could have answered upfront. Pre-emptively answer the hard questions:

Instead of: “Write code to handle user authentication.”

Ask: “Write code to handle user authentication. Assume: (1) passwords are bcrypt hashes stored in the users table, (2) tokens should expire in 24 hours, (3) refresh tokens go in a separate table with cascade delete, (4) we’re NOT implementing OAuth or 2FA, (5) failed login attempts don’t rate-limit (we’ll add that later).”

The second version cuts the agent’s decision tree from 15 branches to 1.

Provide Negative Examples, Not Just Positive Ones

One example of what you want is good. One example of what you don’t want is better.

ACCEPTABLE PATTERN:
async function getUserById(id: string) {
  const user = await db.select('*').from('users').where({ id });
  return user || null;
}

NOT ACCEPTABLE (even though it works):
async function getUserById(id: string) {
  try {
    const user = await db.query(`SELECT * FROM users WHERE id = ?`, [id]);
    if (!user) return null;
    return user;
  } catch (e) {
    console.error(e);
    throw e;
  }
}
// Don't: use raw SQL strings, add console.log, re-throw without transformation

Agents learn faster from contrasts than from abstract rules.

Give the Agent Access to Test Output, Not Just Task Description

This is the key reliability multiplier. Structure your prompt so the agent sees its own test failures:

TASK: Fix src/services/authService.ts

TEST OUTPUT:
✓ loginUser returns token on valid credentials
✗ loginUser returns 401 on invalid password
  Expected: { error: "Invalid credentials" }
  Received: { error: "Unauthorized" }
✗ loginUser throws on database connection failure
  Expected: Error to be thrown
  Received: { error: "Database error" } (returned, not thrown)

FIX THESE TWO FAILURES.

When the agent sees concrete test failures, it doesn’t guess—it debugs. This is vastly more reliable than “write a login function.” Test-driven agents are 3–5x more consistent than task-driven ones.

Size the Context Window for One Iteration, Not the Whole Codebase

Agents fail spectacularly when you dump 50KB of code context hoping they’ll “understand the whole system.” They don’t integrate that data well; they pattern-match locally and miss systemic patterns.

Instead, give them: (1) the file they’re editing, (2) interfaces/types they need, (3) one reference file showing the pattern, (4) tests, (5) the specific failure. That’s typically 5–15KB and laser-focused.

For multi-file tasks (e.g., “create a feature across three files”), break it into sequential agent calls with explicit handoffs. Agents are better at “write this file given these three fixed files” than “orchestrate changes across four files simultaneously.”

Set Token Budgets and Failure Modes Upfront

Tell the agent its limits:

TOKEN BUDGET: 30,000 tokens total for this task.
EFFORT CAP: If you're iterating more than 3 times, stop and ask for clarification.
FAILURE MODE: If you can't pass tests in 3 iterations, output the most likely blocker.

This prevents agents from spinning. Claude Opus 4.6 respects hard constraints; GPT-5.5 Thinking sometimes burns thinking tokens on recursive debugging. Being explicit cuts waste.

Bottom line:

Structure context by separating role, constraints, codebase patterns, and specific task; always include concrete acceptance criteria and test output; give negative examples alongside positive ones; and keep the context window focused on one iteration rather than the whole codebase. Agents are reliable when success is verifiable and ambiguity is closed before the model starts thinking.

Question via Hacker News