The Stack — Harvey AI
A technical teardown of Harvey AI: the models, infrastructure, and engineering decisions behind the product.
Harvey AI is a legal-domain AI platform built for professional legal work at enterprise scale
What It Is
Harvey is an AI platform used by law firms, corporate legal departments, and professional services firms to automate and augment legal work — drafting, research, due diligence, contract analysis, and regulatory compliance. It serves major firms including PwC, A&O Shearman, and the U.S. Air Force JAG Corps. Unlike general-purpose AI assistants, Harvey is purpose-built for the epistemic demands of legal practice: precision over fluency, citations over confidence.
The Architecture
Harvey has publicly disclosed a deep co-development partnership with OpenAI, and their models are confirmed to be built on top of OpenAI’s infrastructure — not a generic API call, but a custom fine-tuned layer on top of frontier models. Based on engineering blog posts and conference talks, Harvey runs custom fine-tuned versions of OpenAI’s models trained specifically on legal corpora. The specific model versions in active use are not publicly disclosed, but given the timeline and Harvey’s stated needs for long-context reasoning, their current base is likely in the GPT-5.x family.
Harvey’s core technical bet is on domain-specific fine-tuning rather than prompt engineering alone. This is confirmed directionally by their engineering team’s public statements: they collected proprietary legal datasets, including case outcomes, contract structures, and jurisdiction-specific regulatory text, and used these to shift model behavior toward legal reasoning patterns. This isn’t cosmetic tuning — it’s an attempt to reduce hallucination rates in high-stakes outputs where a misattributed citation isn’t a UX problem, it’s a malpractice exposure.
For retrieval, Harvey almost certainly employs a multi-stage RAG pipeline. Legal work requires grounding — you can’t generate a contract clause referencing EU AI Act Article 9 without actually retrieving the current text. Their infrastructure likely maintains jurisdiction-specific vector stores that are versioned against regulatory updates, since statutory and case law changes continuously. The retrieval layer appears to sit upstream of generation, with the model operating on retrieved context rather than parametric memory for any claims that could be verified.
On inference, Harvey operates as a managed service on top of OpenAI’s infrastructure rather than self-hosting models. This trades some infrastructure control for significant operational leverage — they’re not running GPU clusters, they’re building legal intelligence on top of a model provider relationship that includes priority API access and model co-development rights. Cost management at their scale likely involves aggressive prompt caching for repeated document patterns (standard contract boilerplate, frequently cited statutes), which reduces token throughput for common legal document types significantly.
The Smart Decision
Harvey’s most defensible architectural decision is building matter-specific context management. Legal work isn’t stateless — a lawyer working a deal for six months has accumulated enormous context: prior drafts, negotiation history, client preferences, opposing counsel patterns. Harvey appears to have built an internal memory layer that persists matter-level context across sessions rather than treating each interaction as a cold start.
This is architecturally significant because it changes what the model is doing. Instead of retrieval-augmented generation over a generic corpus, Harvey is doing retrieval over a structured workspace that belongs to the engagement. The practical effect is that the model’s outputs become more consistent, more calibrated to client-specific preferences, and harder for a generic competitor to replicate without the same persistence layer. A law firm’s internal Harvey deployment becomes more valuable the longer they use it — that’s a compounding moat that has nothing to do with model capability.
The deeper cleverness is that this architecture also aligns with legal ethics rules around client confidentiality. By keeping matter context siloed per client rather than in a shared learned representation, Harvey avoids the conflict-of-interest problems that would arise if knowledge from one client’s deal bled into outputs for an adverse party. That’s not just a nice-to-have — it’s a regulatory requirement, and building the architecture around it rather than bolting on guardrails afterward is the right call.
The Tradeoff
Harvey’s commitment to fine-tuning on top of OpenAI’s infrastructure means they move at the speed of that partnership, not at the speed of the open model ecosystem. When Cohere released Command A+ under Apache 2.0, or when Llama 4 Maverick shipped with 1M context, Harvey couldn’t simply swap in a cheaper, self-hostable alternative. Their fine-tuned weights are tied to a specific model family. The result is that Harvey is structurally dependent on a single upstream vendor in a way that affects both cost leverage and architectural flexibility.
This surfaces most acutely in enterprise security conversations. Large law firms and financial institutions increasingly want on-premise or private-cloud deployments with no data leaving their environment. A fine-tuned dependency on OpenAI’s hosted inference makes that conversation complicated. Competitors who bet on open-weight models — or who build adapters that can run on Llama 4 or Mistral Large 3 in a private VPC — can offer a deployment model Harvey currently cannot match without significant re-architecture.
The tradeoff was probably correct in 2023 when the capability gap between OpenAI models and everything else was enormous. In mid-2026, with strong open-weight models available and enterprise demand for data sovereignty rising, that bet is getting more expensive to hold.
What You Can Steal
- Scope your fine-tuning to failure modes, not features. Harvey doesn’t fine-tune to add capabilities — they fine-tune to reduce hallucination in the specific domain patterns where base models fail (legal citations, jurisdiction-specific rules). Fine-tune defensively.
- Build matter/project-level memory as a first-class persistence layer. Don’t treat every session as stateless. Users working on long-horizon tasks accumulate context that makes outputs better — store it structured, retrieve it explicitly, keep it client-siloed.
- Design your retrieval architecture around update frequency. Legal statutes change; so does your domain’s ground truth. Build versioned vector stores where documents have TTLs and refresh pipelines, not a one-time index.
- Let compliance constraints drive architecture, not constrain it. Harvey’s client data silo is both an ethics requirement and a product moat. If your domain has compliance rules about data handling, build the architecture around those rules — they often create useful constraints that differentiate you.
- Prompt caching pays disproportionately in document-heavy domains. If your users are working with long, repeated document structures (contracts, filings, reports), invest early in caching common context windows. The cost reduction at scale is substantial.