Newbie - Getting Hazel to automate OCRíng

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Newbie - Getting Hazel to automate OCRíng Wed Mar 06, 2013 5:04 pm • by Cassady
Hello all,

This is my first post - and I'm not sure if it's customary to introduce myself [it is in these parts :) ]?
If so - well, "Hello from sunny South Africa"

I have what is possibly a very straightforward question - hoping someone can point me in the right direction.
I did a search of both "OCR" and "PDFpen", but couldn't find anything similar to what I am wanting to confirm.

I only recently jumped in to the Mac world. So whereas I have managed basic Automator scripts/workflows/folder actions - Hazel is requiring me to put a serious thinking cap on. And since I bought Hazel to save time - figured I would give the forum a bash, in the hopes that you can save me hours(?) of figuring out the required steps!

I have thousands of PDF's, as part of a research library for my PHD. Some of them have searchable content. Most don't.

They are contained in 4 folders, based on jurisdiction [USA; Aus; UK; RSA]. All the pdf's have as filename, starting with the relevant jurisdiction (for e.g.) "USA_Author_Article Name_Year published.pdf".

I recently purchased PDFpen, in order to OCR the various PDF's.

I am hoping Hazel can make what could be a very tedious job, much more simple.

What I want to do:

Have HAZEL automatically select PDF files that I label "purple", send over to PDFpen to convert/OCR them, and then return the file, OCR'd to its original folder, now "un-labeled"... Is this possible - or am I aiming a bit high?

I figure it would be relatively straightforward to move the Purple-labelled file to a different folder;
I presume some sort of script would then be necessary to invoke PDFpen's OCR action [the part where I'm hoping someone here can wave their magic wand in my direction]?;
Then have PDFpen save the file automatically [another magic wand] into a temporary holding folder;
To then be moved back "magically" into its original folder.

I presume the last action above, would probably only be possible by shifting the OCR'd files into a completely new set of folders, sorted according to file name? That's assuming all the magic beforehand is even possible...

If someone could kindly advise whether the above is possible, firstly - and secondly, whether it would be within the realm of possibility for a newbie - together with a few pointers, I would be ever so grateful!

Thanks in advance! :wink:
Cassady
 
Posts: 47
Joined: Wed Mar 06, 2013 4:34 pm

Welcome to the forums! Yes, Hazel can do all of this. By why not go a step further? Have Hazel detect if PDFs need to be OCR'd first!

Check out this shell script hint which illustrates a clever method to determine if a PDF needs to be OCR'd.

Then check out this applescript for causing PDFPen to OCR a file. You'll have to modify this to something like this:

Code: Select all
tell application "PDFpen"
    open theFile as alias
        tell document 1
            ocr
            repeat while performing ocr
                delay 1
            end repeat
        delay 1
        close with saving
    end tell
end tell


This is what I would do (hold OPT to get nested conditions):



Rule 1: Dive into PhD Subfolders
This rule should be set on the parent folder of your various jurisdiction subfolders. Doing this allows you to create one rule that will rush through all of your subfolders (but only to the first level of children):
Code: Select all
if (all) of the following conditions are met for (the file or folder being processed):
     kind is folder
     subfolder depth is 0

Do the following to the matched file or folder:
     run rules on folder contents


Rule 2: OCR Purple PDFs or New PDFs that Need It
This rule should be set on the parent folder as well. Files that are colored red are in the process of being converted.

Code: Select all
if (any) of the following conditions are met for (the file or folder being processed):
     if (all) of the following conditions are met for (the file or folder being processed):
          subfolder depth is 1
          kind is PDF
          color label is purple
     if (all) of the following conditions are met for (the file or folder being processed):
          subfolder depth is 1
          color label is not red
          kind is PDF
          date added is within the last 10 minutes
          passes shell script (embedded script)

Do the following to the matched file or folder:
     set color label red
     run AppleScript
     set color label none



Shell script is the one I lined above, the AppleScript is the script I edited above.

NOTE: I do not have PDFPen installed to test, but I believe that the above will function for you.
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

What a relief!

Good to know it can be done - and thank you for the comprehensive feedback. Apologies if I disappear for several hours, but timezone differences makes things tricky! :wink:

I'll start playing around this side, and see what I can sort out from your suggestions. Think I might need to do some work in understanding Hazel basics, before jumping in properly - but will pop up any issues I might develop along the way, over here.

Thanks again for the help!
Cassady
 
Posts: 47
Joined: Wed Mar 06, 2013 4:34 pm


Return to Support

cron