Nintex Community Menu Bar

Data extraction wizard - OCR or digital PDF analysis for extracting data from PDF files?

4 years ago
August 17, 2020
3 replies
79 views
Translate

limxx491

Hi all,

I am creating a data extraction wizard that can consolidate data from PDF, XML, webpage and paste the output into an Excel file.

For analyzing the PDF pages, which command is better? The OCR command or digital PDF analysis? Some of the PDF files are digital and some are hand-written. Is it possible to use both?

Thank you!

Did this topic help you find an answer to your question?

Marlon_Gobitz
2 years ago
August 29, 2022

hello.

the PDF Advanced command will work for digital PDFs.

Handwriting is trickier - you can try to use the Advanced OCR command (you will need a license from Microsoft). https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text, but i am not sure about PDFs. It works on images.

Of course you can combine the different advanced commands, and if e.g. one fails, then continue with the other one.

Good luck,

Marlon

Translate

Theo_Schmidt
2 years ago
August 29, 2022

FYI: It also works for PDF files.

Generally the MS cognitive services (referenced above) will work well for hand-written content, as well as printed (or digital) PDF files. That said, hand-written recognition depends very much on the actual document in question (and the person writing 😉 so your mileage may vary.

For machine printed (and scanned) files: You can start out by first testing the "Tesseract" (OCR) Advanced Command in Kryon studio and see how the results are, based on some samples. In my own experience I can say that more often than not the results are pretty good.

Theo

Translate

alvaro_correia
2 years ago
August 29, 2022

PDF analysis gives me weird information location in page that makes me do strange chasing of rules for parsing. Example: I would bet that the total amount of a payment document would come just after a “Total Amount” I see in the PDF just to the left of the $ figure, but that is not the case. After saving the page to a text file we catch it, but really not intuitive.

Translate

Reply

Cookie policy

We use cookies to enhance and personalize your experience. If you accept you agree to our full cookie policy. Learn more about our cookies.

Cookie settings

We use 3 different kinds of cookies. You can choose which cookies you want to accept. We need basic cookies to make this site work, therefore these are the minimum you can select. Learn more about our cookies.

Basic
Functional

Normal
Functional + analytics

Complete
Functional + analytics + social media + embedded videos

Reply

Related topics

How to Normalize iOS app payments into one currency before tracking [Amplitude] Revenueicon

Tracking one Event from various Sourcesicon

How to send utm parameters from users that came from Web to Mob appicon

Best practice to track revenue events for SaaSicon

Issues with some event triggeringicon

Sign up

Log in with SSO

Login to the community

Log in with SSO

Scanning file for viruses.

This file cannot be downloaded

Cookie policy

Cookie settings