We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need?

8.3k viewscircle icon2 Upvotescircle icon13 Comments
Sort by:
Product Associate6 days ago

For handwritten PDFs, standard OCR + LLM parsing breaks down primarily because poor image resolution, skew, noise, and handwriting variability destroy the signal before intelligence is applied. In those cases, even the best LLMs are downstream consumers of already-degraded text.

A few approaches that have worked better in practice:

Aggressive image pre-processing before OCR
Apply de-skewing, contrast normalization, denoising, binarization, and super-resolution (ESRGAN-style models) before OCR. Improving the pixels often yields more lift than changing the LLM.

Handwriting-specific OCR models (not general OCR)
Use handwriting-trained engines (e.g., AWS Textract Handwriting, Google Vision Handwriting, Azure Read Handwritten Text) rather than generic OCR. They handle stroke variability and spacing far better.

Field-level extraction instead of full-text OCR
If the documents are semi-structured, train models to detect and extract specific fields (names, dates, amounts) rather than attempting full transcription.

Vision-language models instead of OCR → text → LLM
Use multimodal models that reason directly over images. They often outperform OCR pipelines when text quality is poor because they infer context rather than relying on perfect character recognition.

Human-in-the-loop for low-confidence zones
Route only low-confidence regions for manual validation. This hybrid approach usually delivers the best cost-accuracy tradeoff for handwritten data.

Set realistic accuracy thresholds
For low-quality handwritten scans, 100% automation is rarely achievable. Designing for assisted automation rather than full automation avoids diminishing returns.

In short, LLMs amplify what they’re given. For handwritten PDFs, success depends far more on image quality, handwriting-aware models, and selective human review than on model selection alone.

Chief Information Officera month ago

So we got this challange from one of our customer and we've solve it custom solution. If you share your usecase/senario, I am happy to see if we can support you with any insights, because we've spend some good time there.

Manager, Data Science5 months ago

Take a screen shot and feed it to LLMs they do a better Job, Also try Google's Document AI had good experience with it.

IT Manager5 months ago

We had a good experience using the Docling library (https://docling-project.github.io/docling/). It was able to extract accurate data even in low quallity scenarios.

Program Director, Intelligent Automation + Entrepreneur in Healthcare and Biotech5 months ago

We have found quite a bit of success using Microsoft AI Document Intelligence. It's worth checking out.

1 Reply
no title5 months ago

*With human in the loop for validation

Content you might like

Yes65%

No34%

Read More Comments

Agree — this is a signal that Google isn’t supporting open source27%

Disagree — it’s smart to focus on other priorities45%

Not sure — it’s hard to know their reasoning24%

It’s too early to tell4%

View Results