We are working on PDF data extraction doing a POC with a few LLMs like Claude, Co-pilot, Power automate, Textract etc. While, machine printed pdfs are fairly trivial and accurate from LLMs extraction, we are running into a challenge with hand written PDFs, as the image clarity is often very poor. Despite trying to OCR it and then parsing, it is not working well. I am trying to figure out, if anyone has a better suggestion or solution for this need?

4.6k views1 Upvote10 Comments

Sort by:

Manager, Data Science18 days ago

Take a screen shot and feed it to LLMs they do a better Job, Also try Google's Document AI had good experience with it.

IT Manager19 days ago

We had a good experience using the Docling library (https://docling-project.github.io/docling/). It was able to extract accurate data even in low quallity scenarios.

Program Director, Intelligent Automation + Entrepreneur in Healthcare and Biotech19 days ago

We have found quite a bit of success using Microsoft AI Document Intelligence. It's worth checking out.

1 Reply

no title19 days ago

*With human in the loop for validation

Team Leader19 days ago

Handwritten PDFs can be tricky, especially when the scans are low quality.

Clean up the images first (contrast, noise removal, deskewing)

Use OCR that’s designed for handwriting (like Azure Read or TrOCR)

Or even let a vision-capable LLM look at the image directly to extract info

For tricky parts, a quick human check can save a lot of headaches

This approach usually works much better than just running standard OCR.

IT Coordinator in Education19 days ago

It was a while ago, but due to the nature of the information/content, we were experiencing errors in the OCR text recognition that was detrimental to the overall project. We ended up using OCR text recognition, and hand keying the portions of documents deemed as high value, high impact content. The output was roughly 70/30, with 70% accurate output from OCR and 30% manual intervention in qualifying areas.

Content you might like

According to AXIOS and The Verge, ChatGPT doubled weekly users to 200Millions than in November 2023. How do you feel about the impressive growth of the chatbots usage, especially in your Corporate context?

Very concerned 10%

Concerned40%

Neutral18%

No surprised 38%

Excited 4%

Other (comments are waiting you)

View Results

Ethics and responsible AI are principles wonderful on paper, but so hard to have in the loop especially where ROI is a major driver. How do you feel that? Are you trying to enforce a commitment for the implementation of such pillars in the AI governance? Can you please share your strategies!

Yes, it is part of my vision and mission 38%

Yes but I have difficulty to enroll in 40%

I know them but I don't believe they are a must be29%

Our business is focused on the ROI, period.12%

I am not sure, let's discuss in the comments.

View Results

My ESG team uses a dedicated application to consolidate indicators for external reporting. This system manages a workflow where a designated owner submits data and a validator approves it. However, the process is heavily manual and lacks system integration. With upcoming regulations like CVM 193 in Brazil, the pressure for accurate, auditable, and timely ESG data is intensifying, and our current tool is being questioned. Question: What strategies or platforms have you implemented to automate the extraction of ESG data from various source systems (EHS, HR, ERP, etc.) into a centralized reporting tool?

We currently have a Netwitness platform that we are using specifically for packet capture and network analysis only. No SIEM, no response, strictly just packet capture and network analysis. We are looking to replace it as is, meaning we are looking for recommendations for packet capture and network analysis tools. No SIEM, No SOAR, no EDR. Can anyone suggest a tool that works similarly that is also a SaaS product?

Sort by:

Content you might like

According to AXIOS and The Verge, ChatGPT doubled weekly users to 200Millions than in November 2023. How do you feel about the impressive growth of the chatbots usage, especially in your Corporate context?

Ethics and responsible AI are principles wonderful on paper, but so hard to have in the loop especially where ROI is a major driver. How do you feel that? Are you trying to enforce a commitment for the implementation of such pillars in the AI governance? Can you please share your strategies!

Which tech conference do you intend to attend in person next year or have already participated in this year, and what motivated your choice?

What sets us apart?

RELATED ONE-MINUTE INSIGHTS

How are U.S. CISOs Addressing Liability Risk?

CrowdStrike Outage: Impact And Recovery

2024 Product Management Priorities and Challenges

Data-Driven Customer Experience: Uniting D&A and CX Teams

Navigating Economic Uncertainty in 2024: IT Leader Perspectives

Take Your Insights On-the-Go