Data extraction wizard - OCR or digital PDF analysis for extracting data from PDF files?

  • 17 August 2020
  • 3 replies
  • 33 views

Hi all,

 

I am creating a data extraction wizard that can consolidate data from PDF, XML, webpage and paste the output into an Excel file.

 

For analyzing the PDF pages, which command is better? The OCR command or digital PDF analysis? Some of the PDF files are digital and some are hand-written. Is it possible to use both?

 

Thank you!


3 replies

hello.

the PDF Advanced command will work for digital PDFs.

Handwriting is trickier - you can try to use the Advanced OCR command (you will need a license from Microsoft). https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/concept-recognizing-text, but i am not sure about PDFs. It works on images.

Of course you can combine the different advanced commands, and if e.g. one fails, then continue with the other one.

Good luck,

Marlon

FYI: It also works for PDF files.

 

Generally the MS cognitive services (referenced above) will work well for hand-written content, as well as printed (or digital) PDF files. That said, hand-written recognition depends very much on the actual document in question (and the person writing 😉 so your mileage may vary.

 

For machine printed (and scanned) files: You can start out by first testing the "Tesseract" (OCR) Advanced Command in Kryon studio and see how the results are, based on some samples. In my own experience I can say that more often than not the results are pretty good.

 

Theo

PDF analysis gives me weird information location in page that makes me do strange chasing of rules for parsing. Example: I would bet that the total amount of a payment document would come just after a “Total Amount” I see in the PDF just to the left of the $ figure, but that is not the case. After saving the page to a text file we catch it, but really not intuitive.

Reply