Office Hours — Best LLM stack for Q&A over internal PDFs?

Best LLM stack for Q&A over internal PDFs?

You probably want Claude Opus 4.7 or GPT-5.5 paired with a vector store (Pinecone, Weaviate, or just SQLite with embeddings if you’re small) and a chunking strategy that doesn’t butcher context. The hard part isn’t the LLM—it’s handling PDFs that are structurally broken. Many internal PDFs have weird formatting, scanned images masquerading as text, or tables that get mangled by extraction tools.

For the pipeline: use a solid PDF parser (PyPDF2 or pdfplumber, not just plain text extraction), chunk by semantic boundaries not arbitrary token counts, embed with a current embedding model (OpenAI’s text-embedding-3-large or Anthropic’s Claude embeddings if you’re already in that ecosystem), and rerank results before sending to the LLM. This dramatically reduces hallucination on information that’s actually in your PDFs but buried in bad formatting.

PDF Parsing and Chunking

The parsing step deserves 80% of your effort. A naive approach extracts text sequentially and splits on tokens, which destroys spatial relationships in PDFs with columns, sidebars, or footnotes. Better approach: parse layout first, then chunk by logical sections. PyPDF2 handles basic extraction; pdfplumber preserves table structure and coordinates. For scanned PDFs, you’re OCR-ing anyway, so pipe through a vision model’s text extraction before chunking.

Chunking strategy matters more than chunk size. A 1000-token chunk that cuts a table in half is worse than a 500-token chunk that respects section boundaries. Use a library like LlamaIndex or LangChain that chunks by headers and paragraphs first, then splits only when a logical block exceeds your token limit. Include overlap (50-100 tokens) so retrieval doesn’t drop context at boundaries.

Here’s a practical chunking config using LangChain:

from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=100,
    separators=["\n\n", "\n", ". ", " "],
    length_function=len,
)
chunks = splitter.split_documents(docs)

This respects paragraph boundaries before token limits, preserving semantic cohesion.

Inference Caching and Economics

Inference caching is live now and changes the economics significantly. If your PDFs are stable and you’re running repeated queries over the same document set, cached context windows save real money on frontier models. For example, if you embed a 50-page manual (roughly 40,000 tokens) and field 100 queries against it per day, caching that manual in a prompt saves reprocessing costs on 99 out of 100 requests.

Claude Opus 4.7 and GPT-5.5 both support prompt caching. The math: cache writes cost 25% of regular token cost, reads cost 10%. On high-volume Q&A, this is a 3-5x savings compared to naive retrieval. You pay once to cache the document chunks in your system prompt or as a pinned context block, then every subsequent query reuses that cache hit for 90% less cost per token.

Set cache_control on your static document context:

system_blocks = [
    {
        "type": "text",
        "text": full_document_text,
        "cache_control": {"type": "ephemeral"}
    }
]

For a PDF corpus you query thousands of times per week, this shifts the unit economics from “expensive per query” to “expensive once, cheap forever.”

Privacy and On-Premise Alternatives

If your PDF corpus is truly internal and sensitive, consider running embeddings locally with an open model like Llama 4 or Mistral Large 3, keeping vectors on-prem, but using a frontier model for the final generation step. This splits the privacy concern: embeddings never leave your infrastructure, only the retrieved chunks and query go to the API. You lose some retrieval quality versus frontier embeddings, but gain compliance clarity.

The tradeoff is real. Frontier embedding models (text-embedding-3-large) outperform local alternatives by 3-8% on semantic retrieval tasks. Whether that matters depends on your domain. For internal docs with clear language and good structure, the gap narrows. For ambiguous or domain-specific corpora, it’s significant.

Retrieval and Ranking

The stack that’s actually shipping well right now is Claude Opus 4.7 for generation, text-embedding-3-large for vectors, and either native structured outputs or function calling to force the model to cite which document chunk it’s answering from. That last bit matters more than people think—lets you validate accuracy and debug failures.

Add a reranking step using a cross-encoder model (BAAI/bge-reranker-v2-m3 works locally) between retrieval and generation to eliminate junk matches before they reach the LLM. Retrieve top-50 results from your vector store, rerank to top-5, send to Claude. This costs almost nothing (reranking is fast) and catches the frequent case where semantic similarity ranked a structurally similar but contextually irrelevant chunk first.

Common Failure Modes

One concrete gotcha: many PDF Q&A implementations fail on temporal queries (“what changed in Q4?”) or cross-document questions (“which manual contradicts the other?”). Single-document retrieval doesn’t see the contradiction. Use metadata tagging during chunking (document name, date, section) and filter before retrieval to avoid mixing incompatible sources. Store chunk metadata in your vector store and filter your retrieval queries by date range or document type before embedding similarity search.

Another failure mode: questions that require synthesis across multiple chunks. “Summarize the security audit findings” might need 10-15 chunks. Your retrieval strategy should return enough context. Aim for 3000-5000 tokens of retrieved context before sending to the LLM, not 1000-1500. A single chunk is rarely enough.

Bottom line: Start with Claude Opus 4.7 plus a vector store, focus 80% of effort on PDF parsing and chunking quality (that’s where your real problems live), add reranking before you optimize the LLM itself, use inference caching if your document set is stable, and tag chunks with metadata to handle temporal and cross-document queries correctly.

Question via Hacker News