Detecting PDFs with text with pdfgrep

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Detecting PDFs with text with pdfgrep Fri Apr 27, 2018 4:59 am • by Sandro
I have been using the excellent idea from https://www.noodlesoft.com/forums/viewtopic.php?f=3&t=905 to separate PDF files with text from those without text to determine if they need to be OCRed.

But now I found a file, which clearly contains text, but the above method, doesn't recognize it

I have found pdfgrep, which has an option to recognized documents without text and want to use this to make more reliable. You can find install instruction under http://macappstore.org/pdfgrep/ (it took a while for me to finish)

I have fiddled with it a bit and in a .sh file it works

Code: Select all
#! /bin/bash
if !(pdfgrep --warn-empty "^a" "$1" 2>&1 | grep "pdfgrep: File does not contain text")
then
   echo "contains text"
   # exit 0
else
   echo "doesn't contain text"
   # exit 1
fi


The 2>&1 redirects the stderr to stdout so I can actually grep it
The ^a as a pattern is my attempt to make it process as quickly as possible

When I put that code into a hazel condition (with the exit codes commented in and the echo commented out, of course) I still don't get the desired result though. It always ends up in the "no text" branch

Any ideas? Does the redirection of the stderr to stdout not work within the context of Hazel?

Then it wouldn't find the error message

Thanks!
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Fri Apr 27, 2018 11:31 am • by Mr_Noodle
Note that the environment variables will be different when executed by Hazel. Make sure to set any of those that might be needed, especially those by pdfgrep. You may also want to try piping things to a file to see if it's working correctly.
Mr_Noodle
Site Admin
 
Posts: 11195
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Detecting PDFs with text with pdfgrep Fri Apr 27, 2018 11:34 am • by Sandro
Aaah... The pdfgrep command might not even be available!

Will try and update. Thanks!
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Sun May 06, 2018 4:52 am • by Sandro
Using the full path of pdfgrep in the script did the trick!

/usr/local/bin/pdfgrep instead of pdfgrep
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Sun Dec 16, 2018 8:42 am • by Robert
Hi Sandro,
thank you for the script and the lead to pdfgrep. I am a total noob to shell scripts, that's why I do not get the script working for me – possibly I am just missing a fundamental point dealing with shell script and hazel – I am eager to learn!
So I put "if all conditions are met: extension is .pdf" then "Run your (shell script)".
That should give an output in text if the certain pdf hazel is detecting and processing is either "contains text" or "doesn't contain text" am I right? ... that does not work for me.
In addition: is it possible to if "doesn't contain text" is detected, to tell Hazel: move file to certain folder? How would I do that?
I would appreciate your help – even when my question most probably misses out the fundamentals of Hazel and Shell Script....
"Behind all the inhuman aspects of automation (...) its real possibilities appear: the genesis of a technological world in which man can finally withdraw from (...) the apparatus of his labor – in order to experiment freely with it." /Marcuse
Robert
 
Posts: 52
Joined: Sun Dec 16, 2018 8:05 am

Re: Detecting PDFs with text with pdfgrep Mon Dec 17, 2018 5:53 pm • by Sandro
I am far from a pro :-)

As there is some info missing about your setup in your question, I can only guess...

Have you installed pdfgrep?
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Thu Dec 20, 2018 11:34 am • by Robert
Sandro wrote:Have you installed pdfgrep?


Thanks for your reply – Yes I have pdfgrep installed!
And I actually managed to use another script from this post https://www.noodlesoft.com/forums/viewtopic.php?f=3&t=905&start=30 to make a workflow which works for me: using pdffonts, which is part of pdfgrep (I think).

To Determine if a PDF needs to be OCRed I have one rule on a Folder which says:
1. If all of the following conditions are met:
a. Extension is .pdf
b. Passes shell script:
Code: Select all
#! /bin/bash
if  [ `pdffonts "$1" | grep Type | sed -n '$='` ]

then
   exit 1
else
   exit 0
fi

Then move to folder "without OCR"
and add Tag "NeedsOCR"

After that I have another rule on the Folder, which says:
2. If extension is .pdf
Move to folder "OCRed"


Every PDF which passes the first rule without being executed must be OCRed, so when the second hits, the PDF can be moved to an OCRed folder.
Now I was wondering if I could make it more elegant and use the script in the execution of the first rule and sign the exit 1 and exit 0 to destinations (folders) to sort the PDFs in "OCRed" or "not OCRed". That's what I was trying to do with your script, but couldn't manage to do...
"Behind all the inhuman aspects of automation (...) its real possibilities appear: the genesis of a technological world in which man can finally withdraw from (...) the apparatus of his labor – in order to experiment freely with it." /Marcuse
Robert
 
Posts: 52
Joined: Sun Dec 16, 2018 8:05 am

Re: Detecting PDFs with text with pdfgrep Tue Dec 25, 2018 12:04 pm • by Sandro
I have duplicated that rule, but swapped the if-else-paths

That's it

This way I am order independent
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Wed Dec 26, 2018 6:36 am • by Robert
Of course! :oops: Thank You!
"Behind all the inhuman aspects of automation (...) its real possibilities appear: the genesis of a technological world in which man can finally withdraw from (...) the apparatus of his labor – in order to experiment freely with it." /Marcuse
Robert
 
Posts: 52
Joined: Sun Dec 16, 2018 8:05 am

Re: Detecting PDFs with text with pdfgrep Thu Mar 14, 2019 7:11 am • by Dellu
Robert wrote:
Sandro wrote:
To Determine if a PDF needs to be OCRed I have one rule on a Folder which says:
1. If all of the following conditions are met:
a. Extension is .pdf
b. Passes shell script:
Code: Select all
#! /bin/bash
if  [ `pdffonts "$1" | grep Type | sed -n '$='` ]

then
   exit 1
else
   exit 0
fi

Then move to folder "without OCR"
and add Tag "NeedsOCR"



This script is not working for me.
Dellu
 
Posts: 17
Joined: Thu Dec 12, 2013 6:26 am

Re: Detecting PDFs with text with pdfgrep Thu Mar 14, 2019 7:32 am • by Sandro
What are the symptoms?
Does it work outside of Hazel?

For me the issue was that the context, in which the script was executed (from Hazel) was a different one and thus the command wasn't available. Using the full path helped
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am

Re: Detecting PDFs with text with pdfgrep Thu Mar 14, 2019 6:10 pm • by Dellu
Sandro wrote:What are the symptoms?
Does it work outside of Hazel?

For me the issue was that the context, in which the script was executed (from Hazel) was a different one and thus the command wasn't available. Using the full path helped



Yes, I tried it using BBEdit.

Image

All kinds of files (with or without OCR) are coming out as "need OCR".
Dellu
 
Posts: 17
Joined: Thu Dec 12, 2013 6:26 am

Re: Detecting PDFs with text with pdfgrep Mon Mar 18, 2019 5:01 am • by Sandro
Hmm... I haven't personally used that script, so I can't help you there

If it works in BBEdit (under your user profile) and not in Hazel, it must be a profile issue though, right?

Any way to raise the log level in Hazel to see what is actually being executed?
Sandro
 
Posts: 20
Joined: Mon Jul 31, 2017 7:08 am


Return to Support