mlauer

create sandwich pdf from tiff image by ocr

Blog Post created by mlauer Champion on Aug 12, 2016

A sandwich PDF is a scanned document that contains an invisible text layer exactly over the image text. Text layer is produced by OCR (optical character recognition) using open source software tesseract. So PDF ist searchable for text and text can copied and pasted in another document.

 

prerequisites

 

sharepoint lists

  • document library named e.g. tesseract for uploading (scanned) images in tiff format
  • list named tesseract-lang to select text language in list lookup control on start form

list workflow tif2pdf

  • startform

  • workflow

load up an image in tiff format and start item workflow for it. By powershell script text is recognized by tesseract from image and converted to a searchable PDF with image and text layer a so called sandwich pdf. It is also possible to select und copy text from new pdf.

 

Here is powershell script for PowerActivity action:

$VerbosePreference = "Continue"   # keine Ausgabe: "SilentlyContinue"

##############################################################################

Add-PsSnapin Microsoft.SharePoint.PowerShell

##############################################################################

$SiteURL  =  "{Common:WebUrl}"

$ItemURL =  "{Common:ItemUrl}"

 

$FromDoc = ($ItemURL).Replace($SiteURL + "/", "")

$DocLib = [System.IO.Path]::GetDirectoryName($FromDoc)  # tesseract

$PathName = "c:\temp\"

 

$ToPDF   = [System.IO.Path]::GetFileNameWithoutExtension($ItemURL)

$ToPDFFile = $PathName + $ToPDF                                                   

$PDFFile = $ToPDFFile  + '.pdf'

$TiffFile = $PathName + [System.IO.Path]::GetFileName($ItemURL)   

 

$web = Get-SPWeb $SiteURL

$file = $web.GetFile($FromDoc)

$filebytes = $file.OpenBinary()

$filestream = New-Object System.IO.FileStream($TiffFile, "Create")

$binarywriter = New-Object System.IO.BinaryWriter($filestream)

$binarywriter.write($filebytes)

$binarywriter.Close()

##############################################################################

$ErrorActionPreference="Continue"

tesseract -l {WorkflowVariable:Sprachkuerzel} "$TiffFile" "$ToPDFFile" pdf

##############################################################################

$FilePath = $ToPDFFile + '.pdf'                    

$overWriteExisting = $True #or add new version if versioning enabled

 

$Web = Get-SPWeb $SiteURL

$List = $Web.GetFolder($DocLib)

$Files = $List.Files

 

$File= Get-ChildItem $FilePath

$metadata = @{}

$Files.Add($DocLib +"/" + $ToPDF + ".pdf",$File.OpenRead(), $metadata, $overWriteExisting)

 

$web.Dispose()

 

del $PDFFile

del $TiffFile

Outcomes