Office Hours — What's the best approach for handling OCR and document parsing in a RAG pipeline?

What’s the best approach for handling OCR and document parsing in a RAG pipeline?

The core problem

OCR and document parsing sit at the most critical bottleneck in RAG pipelines, and they’re almost always the weak link. You can have the best retrieval model and the smartest reranking logic in the world, but if your documents are corrupted, misaligned, or structurally broken coming out of parsing, everything downstream fails silently. A customer asking about their invoice gets answered with text from a completely different document because the OCR layer concatenated pages wrong. This isn’t a model problem. It’s a data plumbing problem.

The tricky part: most practitioners treat OCR as a one-time problem (run Tesseract or Claude’s vision, store the text, done), then blame the LLM when retrieval sucks. In reality, OCR quality directly determines what your embedding model can actually learn from, what your retriever can find, and whether your final answer is grounded or hallucinated.

Separate OCR from document parsing

These are two different tasks with different failure modes, and conflating them breaks diagnosis.

OCR extracts text from images. Document parsing takes text (or structured formats) and organizes it into retrievable chunks. You can have excellent OCR with terrible parsing, or decent OCR with smart parsing that recovers meaning from the noise.

For OCR specifically, use Claude Opus 4.7’s vision capabilities if you need high accuracy and can tolerate API latency and cost. For volume, Tesseract 5 + local preprocessing (contrast enhancement, deskewing) handles most PDFs and scanned documents reasonably well. The real win is preprocessing: a 30-second deskew and contrast pass before OCR reduces error rates dramatically. If you’re doing this at scale, run test batches through multiple OCR engines and compare outputs rather than picking one and hoping.

Document parsing is where most teams actually lose the game. A PDF that looks clean to humans (proper fonts, clear layout) often parses terribly because it’s missing structural metadata. Use pymupdf or pdfplumber to extract both text and layout information (tables, column positions, reading order), then reconstruct the document’s actual structure. Don’t flatten everything into a single string. Preserve section boundaries, table markup, and spatial relationships. This matters enormously for retrieval accuracy.

Preserve document structure during chunking

The single biggest mistake: chunking by token count alone, which destroys the document hierarchy that makes information retrievable.

A thousand-token chunk starting mid-paragraph in a legal document is useless. It has no context about which section it belongs to, what document it came from, or what came before. Your embedding model has to guess, which it will get wrong.

Instead, chunk hierarchically. Keep sections together. Use metadata (document title, section heading, page number) as part of the embedding. For PDFs, tables, and structured content, treat them as atomic units rather than breaking them up. A dense table should stay together even if it’s 500 tokens, not get split across five chunks.

Here’s a concrete pattern that works:

# Preserve hierarchy during chunking
class HierarchicalChunker:
    def __init__(self, chunk_size=512, overlap=100):
        self.chunk_size = chunk_size
        self.overlap = overlap
    
    def chunk_document(self, document):
        chunks = []
        current_chunk = {
            "text": "",
            "metadata": {
                "doc_title": document.get("title"),
                "section": None,
                "page": None,
                "chunk_type": "text"  # or 'table', 'image', etc.
            }
        }
        
        for block in document.get("blocks", []):
            block_type = block.get("type")
            
            # Atomic blocks (tables, images) stay together
            if block_type in ["table", "image"]:
                if current_chunk["text"].strip():
                    chunks.append(current_chunk)
                    current_chunk = {"text": "", "metadata": current_chunk["metadata"].copy()}
                
                chunks.append({
                    "text": block.get("content", ""),
                    "metadata": {
                        **current_chunk["metadata"],
                        "chunk_type": block_type,
                        "page": block.get("page")
                    }
                })
                continue
            
            # Track section changes
            if block_type == "heading":
                if current_chunk["text"].strip():
                    chunks.append(current_chunk)
                current_chunk = {
                    "text": f"# {block.get('content', '')}\n",
                    "metadata": {
                        **current_chunk["metadata"],
                        "section": block.get("content"),
                        "page": block.get("page")
                    }
                }
                continue
            
            # Regular text: chunk by size but respect boundaries
            text_to_add = block.get("content", "")
            if len(current_chunk["text"]) + len(text_to_add) > self.chunk_size:
                chunks.append(current_chunk)
                current_chunk = {"text": "", "metadata": current_chunk["metadata"].copy()}
            
            current_chunk["text"] += text_to_add + "\n"
        
        if current_chunk["text"].strip():
            chunks.append(current_chunk)
        
        return chunks

This keeps section context, preserves table atomicity, and includes metadata that your retriever can actually use.

Validate OCR quality before embedding

Don’t embed garbage and hope for the best. Run quality checks.

For each page or document, measure OCR confidence scores if available, check for suspiciously low word counts (sign of failed OCR), look for nonsensical character sequences, and compare against known patterns in your domain. If you’re processing financial documents, a page with zero numbers is a red flag. If you’re parsing contracts, a page with no legal keywords is suspect.

When OCR confidence drops below a threshold (typically 60-70% depending on source), flag it for manual review or re-process with a different engine. This saves you from embedding degraded text that will confuse your retriever.

If your documents have images, diagrams, or complex layouts, single-text embeddings lose information. Claude Opus 4.7 can process both text and images, and many embedding models now support multimodal input (OpenAI’s newer embedding models, for instance). For retrieval over technical documentation or architectural diagrams, this is not optional.

Cost and speed tradeoffs

Claude Opus 4.7 vision is expensive for high-volume OCR. Local Tesseract + preprocessing handles 80% of cases at zero API cost. Run local first, fall back to Claude only for documents where Tesseract fails (low contrast, unusual fonts, rotated text, handwriting).

For parsing, pymupdf is instant and local. For complex layouts or table extraction, you might call an LLM, but do this selectively, not for every document.

Bottom line: OCR and parsing are separate problems requiring different tools. Preserve document structure during chunking, validate OCR quality before embedding, and measure end-to-end retrieval accuracy with real queries to catch parsing failures before they hit production.

Question via Hacker News