Office Hours — We're using an LLM to extract structured data from messy PDFs. Sometimes it works perfectly, sometimes it misses fields or invents data. How do I know if the problem is the model, my prompt, or the PDF quality itself?

We’re using an LLM to extract structured data from messy PDFs. Sometimes it works perfectly, sometimes it misses fields or invents data. How do I know if the problem is the model, my prompt, or the PDF quality itself?

This is the classic “what’s actually breaking” problem. Start by instrumenting three things separately: log the raw extracted JSON, log your prompt (including the actual PDF text chunk), and log whether you’re catching validation errors on the output schema.

Run 50-100 samples through your pipeline and bucket them by failure mode. Are missing fields consistently the same field across different PDFs? That’s usually a prompt issue, not a model issue. Claude Opus 4.6 is genuinely good at this task, but it needs you to be explicit about what “missing” means versus “field genuinely absent from document.”

Are you getting invented data? That’s hallucination, and you need to test whether it’s happening with GPT-5.4 or Opus 4.6 on the same PDFs. Different models hallucinate differently. Also check if your PDF preprocessing is stripping important context. PDFs are garbage; bad text extraction will make any model look incompetent.

Quick validation: if switching from Claude Opus 4.6 to GPT-5.4 drops your error rate by more than 15%, it’s the model. If tweaking your prompt to include “only extract fields that are explicitly stated” fixes it, you were over-specifying. If it’s still broken, your PDF text extraction is eating the signal.

Bottom line: Don’t guess. Split your failures into categories (missing fields, invented data, wrong values) and test each against both a prompt change and a model change. You’ll know in an afternoon what’s actually wrong.