Detecting PDFs with text with pdfgrep

I have been using the excellent idea from https://www.noodlesoft.com/forums/viewtopic.php?f=3&t=905 to separate PDF files with text from those without text to determine if they need to be OCRed.

But now I found a file, which clearly contains text, but the above method, doesn't recognize it

I have found pdfgrep, which has an option to recognized documents without text and want to use this to make more reliable. You can find install instruction under http://macappstore.org/pdfgrep/ (it took a while for me to finish)

I have fiddled with it a bit and in a .sh file it works

Code: Select all: #! /bin/bash if !(pdfgrep --warn-empty "^a" "$1" 2>&1 | grep "pdfgrep: File does not contain text") then echo "contains text" # exit 0 else echo "doesn't contain text" # exit 1 fi

The 2>&1 redirects the stderr to stdout so I can actually grep it
The ^a as a pattern is my attempt to make it process as quickly as possible

When I put that code into a hazel condition (with the exit codes commented in and the echo commented out, of course) I still don't get the desired result though. It always ends up in the "no text" branch

Any ideas? Does the redirection of the stderr to stdout not work within the context of Hazel?

Then it wouldn't find the error message

Thanks!

Note that the environment variables will be different when executed by Hazel. Make sure to set any of those that might be needed, especially those by pdfgrep. You may also want to try piping things to a file to see if it's working correctly.

Aaah... The pdfgrep command might not even be available!

Will try and update. Thanks!

Using the full path of pdfgrep in the script did the trick!

/usr/local/bin/pdfgrep instead of pdfgrep

Hi Sandro,
thank you for the script and the lead to pdfgrep. I am a total noob to shell scripts, that's why I do not get the script working for me – possibly I am just missing a fundamental point dealing with shell script and hazel – I am eager to learn!
So I put "if all conditions are met: extension is .pdf" then "Run your (shell script)".
That should give an output in text if the certain pdf hazel is detecting and processing is either "contains text" or "doesn't contain text" am I right? ... that does not work for me.
In addition: is it possible to if "doesn't contain text" is detected, to tell Hazel: move file to certain folder? How would I do that?
I would appreciate your help – even when my question most probably misses out the fundamentals of Hazel and Shell Script....

I am far from a pro :-)

As there is some info missing about your setup in your question, I can only guess...

Have you installed pdfgrep?

Sandro wrote:Have you installed pdfgrep?

Thanks for your reply – Yes I have pdfgrep installed!
And I actually managed to use another script from this post https://www.noodlesoft.com/forums/viewtopic.php?f=3&t=905&start=30 to make a workflow which works for me: using pdffonts, which is part of pdfgrep (I think).

To Determine if a PDF needs to be OCRed I have one rule on a Folder which says:
1. If all of the following conditions are met:
a. Extension is .pdf
b. Passes shell script:

Code: Select all: #! /bin/bash if [ `pdffonts "$1" | grep Type | sed -n '$='` ] then exit 1 else exit 0 fi

Then move to folder "without OCR"
and add Tag "NeedsOCR"

After that I have another rule on the Folder, which says:
2. If extension is .pdf
Move to folder "OCRed"

Every PDF which passes the first rule without being executed must be OCRed, so when the second hits, the PDF can be moved to an OCRed folder.
Now I was wondering if I could make it more elegant and use the script in the execution of the first rule and sign the exit 1 and exit 0 to destinations (folders) to sort the PDFs in "OCRed" or "not OCRed". That's what I was trying to do with your script, but couldn't manage to do...

I have duplicated that rule, but swapped the if-else-paths

That's it

This way I am order independent

Of course! :oops:

Thank You!

Robert wrote:
Sandro wrote:
To Determine if a PDF needs to be OCRed I have one rule on a Folder which says:
1. If all of the following conditions are met:
a. Extension is .pdf
b. Passes shell script:
Code: Select all
#! /bin/bash if [ `pdffonts "$1" | grep Type | sed -n '$='` ] then exit 1 else exit 0 fi

Then move to folder "without OCR"
and add Tag "NeedsOCR"

This script is not working for me.

What are the symptoms?
Does it work outside of Hazel?

For me the issue was that the context, in which the script was executed (from Hazel) was a different one and thus the command wasn't available. Using the full path helped

Sandro wrote:What are the symptoms?
Does it work outside of Hazel?

For me the issue was that the context, in which the script was executed (from Hazel) was a different one and thus the command wasn't available. Using the full path helped

Yes, I tried it using BBEdit.

All kinds of files (with or without OCR) are coming out as "need OCR".

Hmm... I haven't personally used that script, so I can't help you there

If it works in BBEdit (under your user profile) and not in Hazel, it must be a profile issue though, right?

Any way to raise the log level in Hazel to see what is actually being executed?