Office Hours — What workflow and practices actually worked when building a large-scale AI-assisted application (126K lines of code)?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
What workflow and practices actually worked when building a large-scale AI-assisted application (126K lines of code)?
Building 126K lines of production code with AI assistance is a different beast than toy examples. You’re not asking whether AI can write code, you’re asking how to maintain velocity, catch errors before they cascade, and keep human judgment in the loop when it matters. Here’s what actually worked.
Separate the AI-write zones from the human-read zones
The single biggest mistake we made early was treating the codebase as uniformly suitable for AI generation. We ended up with AI writing the boring parts (data mappers, boilerplate handler chains) and humans writing the decision-bearing parts (business logic, architectural boundaries, error recovery). This split happened naturally after the first month of refactoring hell.
For the data layer, we had AI generate 8K lines of type-safe database adapters from schema definitions. We reviewed the first ten functions obsessively, established patterns, then let it run. For the API handler layer, AI wrote the routing stubs and middleware chains. For the core business logic (pricing calculations, state transitions, permission checks), we kept humans in the loop and used AI as an accelerator, not the author.
The key insight: AI writes well when there’s a clear template and low consequence for minor inconsistencies. It struggles when the code needs to embody judgment calls or architectural philosophy.
Use deterministic tests as guardrails, not security theater
We set up a CI pipeline where every AI-generated PR had to pass a battery of checks before even reaching human review. Static analysis (ESLint, type checking), unit tests, and integration tests ran automatically. The 126K codebase had about 8K lines of tests, heavily weighted toward the AI-write zones. We found that strict requirements here (100% coverage in adapters, all critical paths exercised) meant the AI rarely wrote code that was technically correct but subtly wrong.
// Test example: We enforced this pattern for all database adapters
test('adapter converts API response to domain model', () => {
const raw = fetchRawData();
const adapted = userAdapter.toDomain(raw);
// Verify every field is touched, not just the happy path
expect(adapted).toHaveProperty('id');
expect(adapted).toHaveProperty('email');
expect(adapted.createdAt).toBeInstanceOf(Date);
// AI consistently respects this structure
});
When AI encounters a failing test, it actually fixes it most of the time without human intervention. But when a test passes and the code is still wrong (a gap in your test suite), you discover it during code review or production.
Establish a code review rhythm that scales
With the volume of AI commits, traditional line-by-line review dies immediately. We moved to a sampling + risk-based approach. Every change went through automated checks, then 100% of changes touching these zones got human review: business logic, permission checks, database migrations, anything that touches user data or financial transactions. Changes in pure utility zones (formatters, validators, adapters after the first few) got spot-checked weekly.
The breakthrough was using code ownership files and routing PRs to the right people. When AI touched a zone owned by someone, that person got a notification. They could either approve quickly (because patterns were established) or dig deeper if something felt off.
Keep a prompt registry, not a prompt notebook
We stopped pasting prompts into Slack and started versioning them. Every major function type (database adapter, API handler, test generator) had a single, refined prompt stored in a prompts/ directory alongside the code it generated. When the AI model upgraded (we moved from GPT-5.3 Codex to GPT-5.5 in March), we re-ran the entire suite of prompts and compared output quality. We found maybe 30% of our prompts needed tweaking for the new model.
The registry approach meant onboarding new team members was concrete: “Here’s the prompt that generates adapters. This is what good output looks like. This is what we’ve seen go wrong.” It also forced us to document what we were actually asking for.
Use caching + selective model upgrading
We’re running GPT-5.5 for the creative work (new types of components, tricky refactoring) but GPT-5.4 Codex for repetitive generation (test stubs, migrations). The cost difference is meaningful at 126K lines. We implemented a request caching layer so if AI generates the exact same type of boilerplate twice (which it does), we hit the cache instead of re-querying.
On the bigger picture, we didn’t blindly upgrade the model every time a new version dropped. We A/B tested GPT-5.5 against GPT-5.4 on a batch of work (write 50 database adapters with each), scored the output quality, and decided whether the 3-4x cost increase was worth it. For most adapter generation, it wasn’t. For refactoring legacy code, GPT-5.5 Thinking’s extra reasoning time paid for itself in fewer bugs.
Establish hard boundaries on what the AI can commit
We use GitHub Copilot multi-model (GPT-5.4, Claude Sonnet 4.6) for developer IDE suggestions. That’s real-time, low-stakes. But for the actual commit pipeline (where code gets written to main), we route through a purpose-built agent that runs on Claude Opus 4.8. The IDE is fast and loose. The committer is conservative.
This separation matters because an IDE suggestion that’s 70% right is useful. A commit that’s 70% right introduces technical debt. We found that forcing explicit staging (write with the IDE, review, stage the change, run the agent-committer separately) introduced just enough friction that people caught about 40% more issues before they landed in main.
Monitor hallucination clusters by type
Not all AI mistakes are created equal. We logged every instance where AI-generated code passed tests but we caught an issue in review, or code shipped and caused a bug. We categorized them: missing edge cases, wrong data type, off-by-one error, misunderstood requirement. Over 3 months, we noticed the distribution shifted. Early on, most issues were “wrong data type” (AI flipped a field, used the wrong API). Later, most were “missing edge case” (the code was correct for the happy path but didn’t handle nulls or empty collections).
This meant we could refine prompts and tests to focus on the failure modes we were actually seeing. We added explicit guards for “what happens when this field is null?” We added property-based tests for collections. The bug rate dropped measurably after each iteration.
Human review is still the limiting reagent
We have 6 engineers. At peak, the AI was generating 2-3K lines per day. We could only review maybe 1K of that deeply. The solution was not to hire more reviewers, but to be ruthless about what actually needed deep review. We accepted that 50% of the code would get spot-checked or only linted. But the 10% that touches business logic, the 10% that touches data contracts, the 5% that touches security—that all got real human attention.
This worked because the AI generated code predictably. Once we saw a pattern work for database adapters 10 times in a row, we trusted it to keep working (with sampling). We built up that trust gradually, not all at once.
Bottom line: AI works best as a code velocity multiplier in zones with clear templates and strong tests, not as a replacement for architectural judgment. Invest in rigorous tests, establish ownership and routing, and keep humans reviewing the decisions that matter. The 126K codebase didn’t succeed because AI is perfect—it succeeded because we treated AI as a constrained tool with specific, high-value use cases, not a general-purpose engineer.
Question via Hacker News