Analyze PDF - Get PDF Page Text (Formatted)

  • 5 November 2021
  • 3 replies

I am curious if anyone has any insight into how the PDF text extraction works behind the scenes. The results produced using the command are different than any other method I have used in the past for extracting the text from PDF files using code. In some cases that's a good thing, in other cases not so much. Just curious if anyone had any ideas.

3 replies

@andy Brommel​  you want us to share our top secrets? ?

@Ivgeni Rapoport​ what can we share on this?


Kryon's "read PDF" command works on searchable PDFs, which means that there is a text layer that exists inside the PDF, and our command extracts this layer to the string variable. the "formatted" option is adding Tabs and newlines into the string.

Thanks for the insight! After I started digging "under the hood" (through the application folders) I believe I was able to gain a better understanding of how it works. I really appreciate the assistance.