Why we test extraction on long messy PDFs before promising scale
Timur here — founder of Grizzz.ai.
When we run extraction quality tests, we do not use clean pitch decks.
We use the messiest materials in the actual submission queue.
Long data room PDFs with inconsistent formatting. Investor updates that reference prior rounds without naming numbers. Technical appendices with tables that break mid-page. Founder Q&A answers that contradict the deck. Attachments with scanned text, unusual layouts, or sections written in two languages.
That choice is deliberate.
If you only demo extraction on clean materials, you are not testing the workflow. You are testing the best case. And the best case is not what a fund encounters when it is processing real deal flow.
What breaks under real conditions
A short, well-formatted pitch deck is the easiest kind of startup material to work with.
The information is dense and intentional. The structure is consistent. The founder has spent time making it readable. A lot of things that could go wrong do not get a chance to.
Long PDFs are different.
A forty-page investor data room has formatting inconsistencies, repeated headers, nested footnotes, and content that assumes the reader already remembers context from ten pages back. A market analysis appended to a deck may use different terminology for the same concepts. A financial model embedded as a scanned image is not extractable the same way as a spreadsheet.
These are not edge cases. They are the standard operating condition once a fund is doing serious diligence on a shortlisted company.
If the extraction layer only works well on clean submissions, it is not fund-grade. It is a demo that happens to look good when the inputs are cooperative.
The test that matters is not “did this look right on the one clean deck?” It is “did this hold up when the source was dense, inconsistent, and ambiguous?”
Why extraction difficulty is a product discipline question
There is a tempting response when extraction fails on a hard PDF.
Blame the model.
The model missed that section. The model got confused by the table format. The model did not handle the two-language content well.
Some of that is true.
But extraction quality on difficult materials is not mainly a model problem. It is a product discipline problem.
The question is not “which model extracts messy PDFs best?” The question is: what has the product been deliberately engineered to handle, at what level of confidence, and what happens to output when it falls below that level?
That distinction matters because a model will always produce something. It will never say “I cannot parse this PDF.” It will produce output that looks coherent, even when the underlying extraction is weak.
If the product has no explicit handling for low-confidence extraction — no confidence-aware output, no visible gaps, no fallback behavior — then the output will look fine when it is not. And the person reviewing it will not know.
That is the failure mode that scales badly.
On one deal, a misread PDF might mean one question gets skipped. On a hundred deals processed in the same system, a systematic extraction weakness that was never named gets multiplied across every case where that PDF type appeared.
This is why testing deliberately on difficult materials is a design decision, not just a QA check.
You cannot engineer for conditions you have never deliberately pushed the system through.
What honest extraction looks like under pressure
I think there are four things that separate extraction that works under real conditions from extraction that only works on demo inputs.
First: the system distinguishes between extracted facts and inferred claims. If the PDF said something, the extraction records it. If the system inferred something because the surrounding context suggested it, that inference is marked differently. The two are not blended into the same structured field.
Second: gaps are preserved as gaps. If a section of the PDF was poorly scanned, ambiguously formatted, or simply absent, the output does not replace that gap with a plausible alternative. The gap stays visible. A reviewer sees what was found and what was not.
Third: confidence is linked to evidence, not to output length. A long structured summary is not evidence of high extraction quality. The confidence in a specific claim should trace back to whether the source material actually supported it, not to whether the output reads fluently.
Fourth: difficult source types get flagged, not silently downgraded. If the system processes a long, messy PDF that was harder to extract than a clean deck, a reviewer should know that. The downstream judgment on that company is based on material that was harder to parse, and that context belongs in the output.
None of those four things are about the model being smarter.
They are about the product being honest about what the extraction actually found.
What scale pressure actually reveals
Scale is where extraction discipline gets tested.
A single-deal demo can hide almost everything. One clean PDF, one good output, one impressive summary — that sequence tells you almost nothing about whether the workflow survives real fund use.
The signal is in what happens after the first hundred deals.
Which PDF types consistently produce weak extraction? Which document structures cause the system to miss the most relevant content? Which confidence thresholds are being set too low to catch genuine failures before the output reaches a reviewer?
Those patterns only become visible if you have been deliberately testing difficult material all along.
If the extraction layer was only ever tested on clean inputs, scale makes the weakness visible all at once, usually at the worst possible time: when a fund is trying to use the system on a real shortlisted deal, not during a demo.
That is why we do not start with clean decks in quality testing.
Starting with clean decks delays the honest answer. It tells you the system works when inputs cooperate. It does not tell you whether the workflow survives the standard operating condition of a fund processing real deal flow.
The hard PDFs are where product seriousness gets decided.
Not because handling them perfectly is the goal.
Because how the system responds when extraction gets difficult — whether it flags uncertainty, preserves gaps, and gives a reviewer enough signal to know when to trust the output — is the test that determines whether the workflow can actually be used in production, or whether it only looks like it can.
On shortlisted deals, Grizzz turns raw startup materials into risks, next questions, and an evidence-linked full report before partner time.
Grizzz is diligence infrastructure that compounds as more deals move through the same workflow.

