What is the best way to evaluate a large number of pdf files and see which have already undergone OCR?
I use hazel to process pdf files and OCR them but i do not want to run them thru OCR twice.
Thank you.
Moderators: Mr_Noodle, Moderators
SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .
I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.
pdftotext file.pdf file.txt
If all of the following conditions are met
Kind is PDF
Contents do not contain match a
Do the following to the matched file or folder:
Run AppleScript embedded script
Toggle extension visible
Rename with pattern: name extension
Continue matching rules
tell application "PDFpenPro"
open theFile as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
quit
end tell
gacorrea wrote:SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .
I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.
Could you please post the code for how you perfoem the above check? Thanks!
#! /bin/bash
if ! grep Font "$1"
then
exit 0
else
exit 1
fi