Office Hours — We're using an LLM to extract structured data from messy PDFs. Sometimes it works perfectly, sometimes it misses fields or invents data. How do I know if the problem is the model, my prompt, or the PDF quality itself?
A daily developer question about AI/LLMs, answered with a direct, opinionated take.
We’re using an LLM to extract structured data from messy PDFs. Sometimes it works perfectly, sometimes it misses fields or invents data. How do I know if the problem is the model, my prompt, or the PDF quality itself?
This is the classic “what’s actually breaking” problem. Start by instrumenting three things separately: log the raw extracted JSON, log your prompt (including the actual PDF text chunk), and log whether you’re catching validation errors on the output schema.
Run 50-100 samples through your pipeline and bucket them by failure mode. Are missing fields consistently the same field across different PDFs? That’s usually a prompt issue, not a model issue. Claude Opus 4.7 is genuinely good at this task, but it needs you to be explicit about what “missing” means versus “field genuinely absent from document.”
Setting Up Observability
Before you start swapping models around, instrument your extraction pipeline properly. Log at least these three artifacts for every extraction:
- The raw text extracted from the PDF (after OCR or text parsing).
- The full prompt you sent to the model, with no templating hidden.
- The structured output, plus any validation errors if your schema rejected it.
Use structured logging (JSON lines, not printf) so you can grep failures by error type later. Something like:
{
"sample_id": "inv_123",
"pdf_source": "acme_2024.pdf",
"failure_mode": "missing_field",
"field_name": "invoice_date",
"model": "claude-opus-4.7",
"prompt_version": "v3",
"cost_usd": 0.0012,
"latency_ms": 1240
}
This takes 20 minutes to add and saves you hours of guessing. Include cost and latency so you can spot patterns later (sometimes a model is slower or more expensive on specific document types, which tells you something about what it’s struggling with).
Distinguishing Prompt from Model Issues
Missing fields consistently show up as a prompt problem. If the same field is blank across 8 different PDFs from different vendors, your prompt isn’t being explicit enough about what that field is or where to find it.
Test this quickly: add a line like “Only extract fields that appear explicitly in the document text. If a field is not present, return null.” Run the same 10 failing samples again. If that fixes most of it, the model was trying to infer or fill gaps based on context. If it doesn’t, the model genuinely isn’t seeing the field in the text, which means either your PDF preprocessing is stripping it or the model (unlikely with Claude Opus 4.7 or GPT-5.5) is missing it.
Are you getting invented data? That’s hallucination, and it matters which model you’re using. Test the same PDF batch against both Claude Opus 4.7 and GPT-5.5. Different models hallucinate differently. Claude tends to be more conservative; GPT-5.5 is more aggressive at inferring context. Also check if your PDF text extraction is eating important context. PDFs are notoriously bad. Text extraction tools often strip formatting, merge columns incorrectly, or lose context around fields. Run the same PDF through two different extraction libraries (pdfplumber vs. PyPDF2, for example) and compare the text output. If one version gives you the field and the other doesn’t, your preprocessing is the bottleneck, not the model.
PDF Preprocessing as a Hidden Variable
This is where most extraction pipelines fail silently. A PDF rendered perfectly in Adobe Reader can produce garbled or truncated text via automated extraction. Tables become word salad. Multi-column layouts collapse into gibberish. Handwritten annotations vanish entirely.
Before blaming the model, diff the extracted text against the actual PDF visually. Zoom in on a problem section, copy the text manually, and compare it to what your extraction tool produced. If invoice dates are consistently on the page but missing from extracted text, you have a PDF extraction problem, not a model problem. Switching from Claude Opus 4.7 to GPT-5.5 won’t help.
Some PDFs are image-scanned rather than text-based, which means you need OCR preprocessing. Others are hybrid. Your extraction pipeline needs to detect this and route accordingly. Missing this routing is a common silent failure.
The Quick Diagnostic Test
If switching from Claude Opus 4.7 to GPT-5.5 drops your error rate by more than 15% on the same samples and prompts, it’s a model capability gap. If tweaking your prompt to be more restrictive (adding “do not infer” or “only extract explicit values”) fixes it, you were over-specifying. If neither of those moves the needle, your PDF text extraction is lossy.
One more check: if you’re hallucinating values, add a validation step that requires the model to cite the exact text span it extracted from. Something like:
{
"field": "invoice_date",
"value": "2024-03-15",
"source_text": "Invoice Date: 2024-03-15",
"confidence": 0.98
}
This forces the model to either ground its answer or fail validation. It costs a few extra tokens but catches hallucination before it enters your database. Models are significantly more accurate when you require citation; it’s not just a logging benefit.
Bottom line: Don’t guess. Split your failures into categories (missing fields, invented data, wrong values), instrument your pipeline to log all three signals, and test each against a prompt change, a model change, and a PDF extraction change. You’ll know in an afternoon what’s actually wrong, and you’ll have the data to back up whatever fix you choose.