Determine if a PDF needs to be OCR'd & Automate FineReader

I recently discovered a way to determine if a PDF file contains a Optical Character Recogition (OCR'd) text layer, or even a native text layer, and thought this board might benefit.
Basically, if the text for a PDF file contains the word "Font", it's likely it's either been printed natively to a PDF or already been run through a OCR program, such as Adobe Acrobat, ABBYY FineReader, or the OCR capabilities of PDFpen(Pro). By "text", I mean the binary-level code of the PDF, not the PDF text itself.
To prove this to yourself, open a freshly scanned (non-OCR'd) PDF in TextMate, or your favorite text editor. Search for the word "Font". Then, run the PDF through your favorite OCR program, open in TextMate, then search again. This time, you should find words like "FontName", "BaseFont", and/or "Font".
I automated this capability by using the command "grep". Here's a shell script example:
Even better, my first rule for one of my watched folders contains a shell script to automatically test a PDF to see if it needs to be OCR'd then OCRs it if necessary. Since my version of ABBYY FineReader came bundled with my ScanSnap, this code assumes your freshly-scanned PDF originated from a Fujitsu ScanSnap scanner. (Question: Why don't I just let the ScanSnap software run the PDF though FineReader automatically? Answer: I don't feel like waiting for it.) If it didn't, and you still want to use the version of ABBYY FineReader that came with your ScanSnap, you'll need to use PDFInfo, PDFAuxInfo, or pdftk to set the PDF's Creator to "ScanSnap Manager". I'm sure I have script that can automate that if you'd like.
If you'd like to run a Quartz Filter on your PDF to help reduce the file size, it's a touch trickier to do. You'll need to find a system running 10.5 and grab an Apple utility called "quartzfilter". If you're running 10.6 or greater, I'll assume you found that file on a computer running 10.5 under /System/Library/Printers/Libraries/ and placed it under the same directory on your current computer. Next, add the following code to the script above after the "osascript" line, but before the last line reading "fi":
Of course, you can probably run this last step using Hazel with an Automator reference. In fact, it's probably easier to do it that way, you'd just miss out on the size check portion of the code.
Lastly, PDFpen(Pro) recently added some OCR lingo to their Applescript dictionary, so some enterprising person can likely automate the OCR capabilities of PDFpen(Pro) via Hazel. That code may look something like this, but I haven't tested it:
Good luck!
Basically, if the text for a PDF file contains the word "Font", it's likely it's either been printed natively to a PDF or already been run through a OCR program, such as Adobe Acrobat, ABBYY FineReader, or the OCR capabilities of PDFpen(Pro). By "text", I mean the binary-level code of the PDF, not the PDF text itself.
To prove this to yourself, open a freshly scanned (non-OCR'd) PDF in TextMate, or your favorite text editor. Search for the word "Font". Then, run the PDF through your favorite OCR program, open in TextMate, then search again. This time, you should find words like "FontName", "BaseFont", and/or "Font".
I automated this capability by using the command "grep". Here's a shell script example:
- Code: Select all
#! /bin/bash
if ! grep Font "$1"
then
echo "This file needs to be OCRd"
else
echo "This file does not need to be OCRd"
fi
Even better, my first rule for one of my watched folders contains a shell script to automatically test a PDF to see if it needs to be OCR'd then OCRs it if necessary. Since my version of ABBYY FineReader came bundled with my ScanSnap, this code assumes your freshly-scanned PDF originated from a Fujitsu ScanSnap scanner. (Question: Why don't I just let the ScanSnap software run the PDF though FineReader automatically? Answer: I don't feel like waiting for it.) If it didn't, and you still want to use the version of ABBYY FineReader that came with your ScanSnap, you'll need to use PDFInfo, PDFAuxInfo, or pdftk to set the PDF's Creator to "ScanSnap Manager". I'm sure I have script that can automate that if you'd like.
- Code: Select all
#! /bin/bash
if ! grep Font "$1"
then
# Open the file in ABBYY's FineReader
open -a 'Scan to Searchable PDF.app' "$1"
# Wait around until the file is done being processed by testing to see if a file named "yourfile processed by FineReader" exists.
while [ ! -e "${1%.pdf} processed by FineReader.pdf" ]; do
sleep 5
done
sleep 5
# Overwrite the original, un-OCR'd file with the new, OCR'd file
mv -f "${1%.pdf} processed by FineReader.pdf" "$1"
# Use Applescript to tell the system to hide ABBYY's FineReader
osascript -e 'tell application "System Events" to set visible of process "Scan to Searchable PDF" to false'
fi
If you'd like to run a Quartz Filter on your PDF to help reduce the file size, it's a touch trickier to do. You'll need to find a system running 10.5 and grab an Apple utility called "quartzfilter". If you're running 10.6 or greater, I'll assume you found that file on a computer running 10.5 under /System/Library/Printers/Libraries/ and placed it under the same directory on your current computer. Next, add the following code to the script above after the "osascript" line, but before the last line reading "fi":
- Code: Select all
# Run the magical quartz filter to reduce the file size - the same one you can access using Preview.app
/System/Library/Printers/Libraries/quartzfilter "$1" /System/Library/Filters/Reduce\ File\ Size.qfilter "${1%.pdf} Reduced.pdf"
# Sometimes, strangely, the resulting file is larger. If it is, toss it, otherwise keep the new reduced-size version
if [ `du -ks "$1" | awk '{print $1}'` -gt `du -ks "${1%.pdf} Reduced.pdf" | awk '{print $1}'` ]
then
mv -f "${1%.pdf} Reduced.pdf" "$1"
else
rm "${1%.pdf} Reduced.pdf"
fi
Of course, you can probably run this last step using Hazel with an Automator reference. In fact, it's probably easier to do it that way, you'd just miss out on the size check portion of the code.
Lastly, PDFpen(Pro) recently added some OCR lingo to their Applescript dictionary, so some enterprising person can likely automate the OCR capabilities of PDFpen(Pro) via Hazel. That code may look something like this, but I haven't tested it:
- Code: Select all
tell application "PDFpenPro"
open _file as alias
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
end tell
Good luck!