Determine if a PDF needs to be OCR'd & Automate FineReader

I recently discovered a way to determine if a PDF file contains a Optical Character Recogition (OCR'd) text layer, or even a native text layer, and thought this board might benefit.

Basically, if the text for a PDF file contains the word "Font", it's likely it's either been printed natively to a PDF or already been run through a OCR program, such as Adobe Acrobat, ABBYY FineReader, or the OCR capabilities of PDFpen(Pro). By "text", I mean the binary-level code of the PDF, not the PDF text itself.

To prove this to yourself, open a freshly scanned (non-OCR'd) PDF in TextMate, or your favorite text editor. Search for the word "Font". Then, run the PDF through your favorite OCR program, open in TextMate, then search again. This time, you should find words like "FontName", "BaseFont", and/or "Font".

I automated this capability by using the command "grep". Here's a shell script example:

Code: Select all: #! /bin/bash if ! grep Font "$1" then echo "This file needs to be OCRd" else echo "This file does not need to be OCRd" fi

Even better, my first rule for one of my watched folders contains a shell script to automatically test a PDF to see if it needs to be OCR'd then OCRs it if necessary. Since my version of ABBYY FineReader came bundled with my ScanSnap, this code assumes your freshly-scanned PDF originated from a Fujitsu ScanSnap scanner. (Question: Why don't I just let the ScanSnap software run the PDF though FineReader automatically? Answer: I don't feel like waiting for it.) If it didn't, and you still want to use the version of ABBYY FineReader that came with your ScanSnap, you'll need to use PDFInfo, PDFAuxInfo, or pdftk to set the PDF's Creator to "ScanSnap Manager". I'm sure I have script that can automate that if you'd like.

Code: Select all: #! /bin/bash if ! grep Font "$1" then # Open the file in ABBYY's FineReader open -a 'Scan to Searchable PDF.app' "$1" # Wait around until the file is done being processed by testing to see if a file named "yourfile processed by FineReader" exists. while [ ! -e "${1%.pdf} processed by FineReader.pdf" ]; do sleep 5 done sleep 5 # Overwrite the original, un-OCR'd file with the new, OCR'd file mv -f "${1%.pdf} processed by FineReader.pdf" "$1" # Use Applescript to tell the system to hide ABBYY's FineReader osascript -e 'tell application "System Events" to set visible of process "Scan to Searchable PDF" to false' fi

If you'd like to run a Quartz Filter on your PDF to help reduce the file size, it's a touch trickier to do. You'll need to find a system running 10.5 and grab an Apple utility called "quartzfilter". If you're running 10.6 or greater, I'll assume you found that file on a computer running 10.5 under /System/Library/Printers/Libraries/ and placed it under the same directory on your current computer. Next, add the following code to the script above after the "osascript" line, but before the last line reading "fi":

Code: Select all: # Run the magical quartz filter to reduce the file size - the same one you can access using Preview.app /System/Library/Printers/Libraries/quartzfilter "$1" /System/Library/Filters/Reduce\ File\ Size.qfilter "${1%.pdf} Reduced.pdf" # Sometimes, strangely, the resulting file is larger. If it is, toss it, otherwise keep the new reduced-size version if [ `du -ks "$1" | awk '{print $1}'` -gt `du -ks "${1%.pdf} Reduced.pdf" | awk '{print $1}'` ] then mv -f "${1%.pdf} Reduced.pdf" "$1" else rm "${1%.pdf} Reduced.pdf" fi

Of course, you can probably run this last step using Hazel with an Automator reference. In fact, it's probably easier to do it that way, you'd just miss out on the size check portion of the code.

Lastly, PDFpen(Pro) recently added some OCR lingo to their Applescript dictionary, so some enterprising person can likely automate the OCR capabilities of PDFpen(Pro) via Hazel. That code may look something like this, but I haven't tested it:

Code: Select all: tell application "PDFpenPro" open _file as alias tell document 1 ocr repeat while performing ocr delay 1 end repeat delay 1 close with saving end tell end tell

Good luck!

Pretty nifty searching the file for the font directives. It's too bad there's no good metadata to differentiate between bitmap PDFs and vector ones but this should work for the time being. Thanks for the tip

Hi. Thanks for the workflow.

Your script works great unless your document needs to multipage. Hazel has already run the rule before I can finish the second page.

Is there any way of getting Hazel to wait until the scan is complete?

That's probably an issue with your scanning software not locking the file properly. You should add a condition to wait some amount of time based on the date modified like "Date Modified is not in the last 5 minutes".

Thanks Mr Noodle

Thats a great idea. The software is ScanSnap. Is there an obvious way of getting it to lock the file that I've missed?

Thanks for your response!

That's up to the programmer to do it properly. Not something you can do yourself.

Hm.. interesting. How can this be done in Hazel 3 now that we have passes shell script type conditions? What would the code be for the condition so that it exits with 0 when no text layer is found? I assume this is really easy to answer but unfortunately I don't know any shell scripting.

Nevermind, I just found out how:

Code: Select all: #! /bin/bash if ! grep Font "$1" then exit 0 else exit 1 fi

@AppleSuperlatives

I am trying to use your script using 10.8.2. I notice you say that quartz filer needs to be installed for the portion of the script that reduces the file size. I have a quartzfilter located in /users/mydirectory/Library/PDF Services called "Reduce File Size (50%)". I tried substituting (indicated by orange text) this in the line of your script where you have:

/System/Library/Filters/Reduce\ File\ Size.qfilter "${1%.pdf} Reduced.pdf"

The problem is that I am not sure what the first part of this line should be in my case since I am not dealing with the copied file "quartzfilter" which you have indicated should be in the directory /System/Library/Printers/Libraries/quartzfilter

This quartzfilter works just fine when I invoke it from Preview using PDF button so I am confident that it should work, just not sure how to get the script to call it.

Any help would be greatly appreciated.

@taxman1

You need three things to reduce the size of your PDF using my methodology:
1) An Apple-created executable program called "quartzfilter" (one word) that Apple last included with Mac OS X 10.5. This is the program that uses a "Quartz filter" (two words) to filter your PDF. You can copy this executable program anywhere you want. I suggested copying it to /System/Library/Printers/Libraries/ because that is where Apple placed it in Mac OS X 10.5. If you don't have this program, perhaps some nice internet denizen can help you get it.

2) A "Quartz filter" (two words) set to your specifications. In Mac OS X 10.8, Apple stores a bunch of these Quartz filters under /System/Library/Filters. Its file name extension is .qfilter. This is the file that tells the "quartzfilter" program how to reduce your PDF. For example, one of the Apple-provided filters is called "Lightness Decrease.qfilter". A Quartz filter can fairly easily be created and edited using the Apple-supplied ColorSync Utility.app located in /Applications/Utilities.

3) A PDF to be reduced.

So, in your case, the command you'd run is:

Code: Select all: /System/Library/Printers/Libraries/quartzfilter "$1" /users/mydirectory/Library/PDF\ Services/Reduce\ File\ Size\ (50%).qfilter "${1%.pdf} Reduced.pdf"

where:
1) "/System/Library/Printers/Libraries/quartzfilter" - the path to the "quartzfilter" executable program.
2) "$1" - the script-supplied path to the current file being processed. Don't change this.
3) "/users/mydirectory/Library/PDF\ Services/Reduce\ File\ Size\ (50%).qfilter" - the properly-escaped path to your magical Quartz filter.
4) "${1%.pdf} Reduced.pdf" - The script-supplied path and filename to the PDF the quartzfilter will write. Don't change this.

(Forgive me if I didn't properly escape all of the appropriate characters in the above command.)

Lastly, I guarantee that there are other ways to programmatically reduce the size of a PDF. This is the way I discovered and it still works beautifully.

Hi AppleSuperlatives,

Thank you so much, you've given me hope that there are solutions to getting paperless. I'm relatively new to Hazel and Applescript but have been able to get your code to work with our ScanSnap. Could you help me figure out how to get it working with another scanner. I do a lot of my scanning when I'm at work and the high speed network scanner isn't busy.

You mentioned PDFInfo, PDFAuxInfo, and pdftk, are those elements of applescript or something else?

Your help is greatly appreciated,

~Dietrich

Ok, I've found that Automator can set the Content Creator to ScanSnap Manager, but how do I set the input and output of the workflow so that I can use it in Hazel?

Dietrich - I'm not entirely certain how you're setting things up...and I don't know how to use Automator to work with Hazel. You'll have to search around the forums. However, to give my best guess, if I were to use Automator to create a workflow, it might be configured as follows:

Step 1: "Set PDF Data". Use it to set the Content Creator to "ScanSnap Manager". Leave the other fields blank.
Step 2: "Run Shell Script", shell: </bin/bash>, and Pass input: <as arguments>. Then fill in the remainder of the box with the shell script as described above.

After you save this as an Automator workflow, there's an option in Hazel to apply an Automator workflow to files matching your criteria.

Lastly:
1) PDFInfo is a freeware app that can be used to manually change metadata of a PDF using a GUI. I'm not sure if it's even still available.
2) PDFAuxInfo is/was a freeware, third-party action for Automator that would allow you to edit the metadata of a PDF. It stopped working for me a long time ago. Apple's "Set PDF Data" Automator action does basically everything PDFAuxInfo did, so it's not a big loss.
3) pdftk is an amazing, freeware, command-line tool that can manipulate PDF files in all sorts of ways. It's unnecessarily difficult for mere mortals to install, but it's powerful. Google it.

I would use one of those three programs to set the Content Creator to "ScanSnap Manager". Since you're using Apple's Automator action to set the Content Creator, you don't have to worry about them.

Thanks for the help!

I got it working this afternoon. This was my first time working in automator, so it took awhile for me to figure out how to have the file be the input. I found another example of someone using Automator within Hazel and they had used the Get Selected Finder Items which must automatically use the file being passed from Hazel. From there is was pretty simple using the Set PDF Metadata once I realized I didn't need any sort of save/save as object to make the changes.

Then my Hazel rule runs the Automator action to set the creator to the ScanSnap Manager and then runs the shell script exactly as provided by AppleSuperlatives.

Finally, since everything coming into this folder is a pdf directly from a network scanner and I know it needs OCR'd, I have it running on every pdf in the folder, requiring that I move the file to another folder once finished.

Working flawlessly!

Ok, I almost have it figured out. I setup an automator Folder Action to change the Creator to "ScanSnap Manager". Then I have a hazel rule to run on every file added to the folder. The rule basically just runs the shell script and moves it to another folder.

I get this error: (FYI: the file I was using is Management Agreement - Dubay.pdf

2013-09-09 21:01:49.722 hazelworker[6529] Processing folder Auto OCR Files (forced)
2013-09-09 21:01:51.728 hazelworker[6529] Management Agreement - Dubay.pdf: Rule OCR any Non-OCR File matched.
2013-09-09 21:04:38.543 hazelworker[6529] [Error] Shell script failed: Error processing shell script on file /Users/phantom27/Google Drive/Action /Auto OCR Files/Management Agreement - Dubay.pdf.
2013-09-09 21:04:38.543 hazelworker[6529] Shellscript exited with non-successful status code: 1
2013-09-09 21:04:40.568 hazelworker[6529] Management Agreement - Dubay.pdf: Rule OCR any Non-OCR File matched.
2013-09-09 21:04:41.155 hazelworker[6529] [File Event] File moved: Management Agreement - Dubay.pdf moved from folder /Users/phantom27/Google Drive/Action /Auto OCR Files to folder /Users/phantom27/Google Drive/Action .