create sandwich pdf from tiff image by ocr

Forum|Forum|9 years ago
August 12, 2016
2 replies
116 views

+9

mlauer

A sandwich PDF is a scanned document that contains an invisible text layer exactly over the image text. Text layer is produced by OCR (optical character recognition) using open source software tesseract. So PDF ist searchable for text and text can copied and pasted in another document.

prerequisites

PowerShell workflow action (I use dataone PowerActivity, it should also be possible to other actions like NTX PowerShell Action - Stable Release from Aaron Labiosa )
tesseract (GitHub - tesseract-ocr/tesseract: Tesseract Open Source OCR Engine (main repository) , windows installer from Mannheim University Library )

sharepoint lists

document library named e.g. tesseract for uploading (scanned) images in tiff format
list named tesseract-lang to select text language in list lookup control on start form

list workflow tif2pdf

startform

workflow

load up an image in tiff format and start item workflow for it. By powershell script text is recognized by tesseract from image and converted to a searchable PDF with image and text layer a so called sandwich pdf. It is also possible to select und copy text from new pdf.

Here is powershell script for PowerActivity action:

$VerbosePreference = "Continue" # keine Ausgabe: "SilentlyContinue"

##############################################################################

Add-PsSnapin Microsoft.SharePoint.PowerShell

##############################################################################

$SiteURL = "{Common:WebUrl}"

$ItemURL = "{Common:ItemUrl}"

$FromDoc = ($ItemURL).Replace($SiteURL + "/", "")

$DocLib = [System.IO.Path]::GetDirectoryName($FromDoc) # tesseract

$PathName = "c: emp""

$ToPDF = [System.IO.Path]::GetFileNameWithoutExtension($ItemURL)

$ToPDFFile = $PathName + $ToPDF

$PDFFile = $ToPDFFile + '.pdf'

$TiffFile = $PathName + [System.IO.Path]::GetFileName($ItemURL)

$web = Get-SPWeb $SiteURL

$file = $web.GetFile($FromDoc)

$filebytes = $file.OpenBinary()

$filestream = New-Object System.IO.FileStream($TiffFile, "Create")

$binarywriter = New-Object System.IO.BinaryWriter($filestream)

$binarywriter.write($filebytes)

$binarywriter.Close()

##############################################################################

$ErrorActionPreference="Continue"

tesseract -l {WorkflowVariable:Sprachkuerzel} "$TiffFile" "$ToPDFFile" pdf

##############################################################################

$FilePath = $ToPDFFile + '.pdf'

$overWriteExisting = $True #or add new version if versioning enabled

$Web = Get-SPWeb $SiteURL

$List = $Web.GetFolder($DocLib)

$Files = $List.Files

$File= Get-ChildItem $FilePath

$metadata = @{}

$Files.Add($DocLib +"/" + $ToPDF + ".pdf",$File.OpenRead(), $metadata, $overWriteExisting)

$web.Dispose()

del $PDFFile

del $TiffFile

A

alexuh
Forum|Forum|7 years ago
July 29, 2018

Hello - Do you install the windows installer on the server or the pc? What if i have pdf files that i want to ocr, do i take the same approach?

Like

M

+9

mlauer
Author
Forum|Forum|7 years ago
July 31, 2018

Hello Alexander,

PowerShell-Skript will be executed by dataone action on SharePoint-Server, therefore You have to install tessseract there.

PDF files have to be converted first to image files, e.g. by using convert from ImageMagick.

Like

Sign up

Log in with SSO

Login to the community

Log in with SSO

Scanning file for viruses.

This file cannot be downloaded