Getting Hazel to run a TextEdit search automatically

Hello all,

I was tempted to latch on to the forum topic as below, but my query is a bit more basic! (I think! ) :oops:

viewtopic.php?f=4&t=2021&p=8434&hilit=using+mdls#p8434

I have trawled the Support forum looking for information about getting into the TEXT-file information of pdf files. There are a few topics on this, but they are unfortunately aimed at people who presumably have plenty more coding knowledge in Terminal than me.

I'm hoping someone can dumb this down for me.

I have many of pdf files collected from a variety of places over the past few years.
Obviously, some of these are OCR'ed already, but many are not.
And as far as I can tell, the only way to know - off-hand, whether OCR has happened, is to open the file, and perform a search.
UNLESS you tell Hazel to do that for you, by differentiating between files that already have, or still have to be, OCR'ed.

I have managed the following so far, using various useful threads listed throughout the Support forums:

a.) Sort new and old PDF's as I need them, including [sort into subfolders];
b.) ...
c.) Have Hazel send a file (that needs to be OCR'ed) to PDFpen, and have the magic happen.
d.) Have Hazel sort and move the recently OCR'ed files back to where they came from.

As is hopefully apparent from the above - I'm stuck at b.)

I need to get Hazel to do the following:

I've managed to work out that if the TEXT information of a PDF file contains the following terms:
"encoding" (and/or) "decodeparms" - then there is a very high possibility that the PDF in question has not been OCR'ed.
What I mean by the above, is if I select a PDF, open it with [TextEdit], and the simply do a search for either "encoding" or "decodeparms", if the terms are not present in the text, then the file has (probably) not been OCR'ed.
[As an aside, I had a look at the metadata through running 'mdls' in Terminal, as many have suggested - but cannot discern a pattern emerging between files that have, and have not, been OCR'ed]

This is then step b.) - I want to get Hazel to call up Textedit on certain files moved into a watched folder, run the search to confirm if either "encoding" or "decodeparms" are present - AND IF SO, initiate c.) - d.).

I had a look at the following topic: viewtopic.php?p=3593#p3593

Unfortunately, it went straight over my head.
Specifically the following little bit of code:

Code: Select all: #! /bin/bash if ! grep Font "$1" then echo "This file needs to be OCRed"" else echo "This file does not need to be OCRed" fi

In the other, "Applesuperlatives" walked the same path I am trying to walk, and concluded that in his case - "Font" was the keyword. For whatever reason (and I checked), that doesn't work this side - as mentioned, mine are "encoding" and/or "decodeparms".

So presumably, I could simply replace (Font) in the above, with my two search-terms...
But I don't even vaguely know where to begin.

Do I simply enter the script as above into Hazel?
Do I need to expand on some of the steps - since I presume code-words have been used to simplify things?

I've googled quite a bit so far - and realised I am either going to spend hours sharpening my Terminal skills (that are virtually non-existent), in order to understand the above - or I can try my luck here...

Here's hoping someone can provide me with some tips...

It sounds like you're trying to find out which files need to be OCR'd and which files are ok to be passed somewhere else. You ran a forum search and found the relevant posts that explain clever ways of making this distinction. I'll explain those in detail now:

First, the rule:

Name: OCR my Flattened PDFs

Code: Select all: if (all) of the following conditions are met for (the file or folder being processed): kind is pdf passes shell script (embedded script color label is not yellow Do the following to the matched file or folder: set color label yellow run applescript (embedded script) ...

The 'passes shell script' script looks like this:

Code: Select all: if [ $(grep -ci "Font" "$1") -gt 0 ]; then echo 1; else echo 0; fi

I'll explain this script line by line.

if [ $(grep -ci "Font" "$1") -gt 0 ] - runs grep on the file ("$1") passed by to the script by Hazel. Grep basically opens the PDF in plaintext, then searches the file and counts (-c)case-insensitive (i) occurrences of the of the word "font". If the count of 'font' is greater than (-gt) zero, then the if statement is TRUE and we exit 1 to tell Hazel that this file does not need to be OCR'd.

But why do I care if font is there?

We are looking for the word FONT because the PDF spec requires font information for each OCR'd page so that PDF readers can display the font correctly. It would be analogous to searching an HTML page for a <body> tag. We know that each HTML page will have a <body> just like we know every OCR'd pdf will contain the plaintext of the word "FONT." This will work even for PDFs that don't include the word 'font' in their displayed text. We're not looking at the content of the PDF document, we're looking at the structure of the PDF document.

Now, a flat PDF is just a series of images. It is not necessary to store font information, so the word 'font' will not appear in the plaintext of the pdf file.

That's why we're looking for 'font' - it's a very clever way to distinguish between flat and OCR'd pdfs.

a_freyer wrote:It sounds like you're trying to find out which files need to be OCR'd and which files are ok to be passed somewhere else. You ran a forum search and found the relevant posts that explain clever ways of making this distinction. I'll explain those in detail now:

First, the rule:

Name: OCR my Flattened PDFs
Code: Select all
if (all) of the following conditions are met for (the file or folder being processed): kind is pdf passes shell script (embedded script color label is not yellow Do the following to the matched file or folder: set color label yellow run applescript (embedded script) ...

The 'passes shell script' script looks like this:

Code: Select all
if [ $(grep -ci "Font" "$1") -gt 0 ]; then echo 1; else echo 0; fi

I'll explain this script line by line.

if [ $(grep -ci "Font" "$1") -gt 0 ] - runs grep on the file ("$1") passed by to the script by Hazel. Grep basically opens the PDF in plaintext, then searches the file and counts (-c)case-insensitive (i) occurrences of the of the word "font". If the count of 'font' is greater than (-gt) zero, then the if statement is TRUE and we exit 1 to tell Hazel that this file does not need to be OCR'd.

But why do I care if font is there?

We are looking for the word FONT because the PDF spec requires font information for each OCR'd page so that PDF readers can display the font correctly. It would be analogous to searching an HTML page for a <body> tag. We know that each HTML page will have a <body> just like we know every OCR'd pdf will contain the plaintext of the word "FONT." This will work even for PDFs that don't include the word 'font' in their displayed text. We're not looking at the content of the PDF document, we're looking at the structure of the PDF document.

Now, a flat PDF is just a series of images. It is not necessary to store font information, so the word 'font' will not appear in the plaintext of the pdf file.

That's why we're looking for 'font' - it's a very clever way to distinguish between flat and OCR'd pdfs.

You are a lifesaver! Thank you for explaining the above in such a way that even I was able to get my head around it.

I will give this a try during the course of the evening, and pop something up again later.

On the use of "Font" - for some reason, that's not working my side - i.e. both pdf's that have been OCR'ed, and those that haven't, have yielded the word "Font" somewhere in the textedit information - which I was why I eventually narrowed it down to my two words of "encoding" and "decodeparms"...

Having said that, since you explained how the script works/what it means, I can now see how to use it. I presume simply replacing "Font" in the script with either of my two terms will do the job.

[Of course - you've given me ample reason to go back and check "Font" again... Too much playing around with scripts/codes/Hazel very late at night, and mistakes happen! :wink:

]

Yes, replacing 'font' will allow you to search for different works. However, note that you will not be searching within the content of the PDF (generally)... you are searching within the structure.

a_freyer wrote:Yes, replacing 'font' will allow you to search for different works. However, note that you will not be searching within the content of the PDF (generally)... you are searching within the structure.

Thanks. The two terms are what I pulled out of the 'structure' of the pdf's, after having opened the pdf's in textedit. It's these two terms that are present 95% of the time (in the files on my computer, in any event), in pdf's that have not been OCR'ed... I have no idea why "font" shows up in both OCR'ed and to-be-OCR'ed files, but it does - subject to me checking that again...

It seems like you have to tools to get what you're after. Let me know if there are any other issues.

Hmm...

Still battling. :roll:

Here's a screengrab of the file "Scan.pdf" opened by Textedit.
http://dl.dropbox.com/u/48750317/Screen ... 8%20PM.png

Note how I searched "encoding", and 0 hits are made.
This means the file must still be OCR'ed...

Next, please see a screengrab of my attempt at the script:
http://dl.dropbox.com/u/48750317/Screen ... 3%20PM.png

I think I'm doing things back to front.

Unless I'm mistaken (and there's a huge possibility that I am), the previous suggestions were catering for the situation that if the term "Font" was present - "then the if statement is TRUE and we exit 1 to tell Hazel that this file does not need to be OCR'd."

I need to have the script do the opposite - i.e. if the term "encoding" IS NOT present, then Hazel needs to act on it and proceed further - because then it needs to be OCR'ed.

Presently, I'm getting this error in the log. My pdf files are being labeled purple - my action in the test phase - regardless of the contents of the textedit information.
http://dl.dropbox.com/u/48750317/Screen ... 7%20PM.png

I'll be trying in the meanwhile to work out the opposite script...

I apologise for keeping this thread alive for selfish purposes - :roll:

- but would really appreciate any help with the above...

This is the final piece in my puzzle, and I just cannot get my head around it. If I can get this to work properly, then I am looking at saving hundreds of hours.

Any applescript gurus who can explain things to a newbie? As is clear from the above - a_freyer has taken me this far, but I just need help getting the inverse working!

Hello all,

I've now tried the following script - I'm not sure if the >3 makes sense - but regardless, I'm still getting the same error message. Any suggestions?

Code: Select all: if [ $(grep -ci "encoding" "$1") -gt >3 ]; then echo 1; else echo 0; fi

This is shellscripting, not AppleScript. You can search around for sh or bash scripting. The > there doesn't make sense so I'd fix that.

Mr_Noodle wrote:This is shellscripting, not AppleScript. You can search around for sh or bash scripting. The > there doesn't make sense so I'd fix that.

Thanks for the tip.

I only realised that there was a difference between Applescript / and the rest - a few days ago... :oops:

Been doing a fair bit of reading up, and slowly starting to find my way. Also registered over at MacScripter.net, so will pop up some queries over there.

I am one very happy camper.

Popped up a thread over at Macscripter.net, and was helped out by member McUsrll, in finishing the script I needed.
http://macscripter.net/viewtopic.php?pid=161712#p161712

Here be the script I would embed in Hazel, to have it run a text search of the plaintext pdf contents, to check the presence of my keyword "encoding".

Code: Select all: a=$(grep -ci "encoding" "$1") if [ x$a = x ]; then exit 1; fi if [ $a -lt 2 ]; then exit 0 else exit 1 fi

When Hazel finds the keyword two or less times, it is to presume that the file must be OCR'd, and I can then have Hazel proceed with the magic!

Thanks to all those who gave input in this thread - I will pop up a new topic with my completed workflow, on the off-chance that other Hazel users might be in the same boat as me.