Detecting PDFs with text with pdfgrep
Posted: Fri Apr 27, 2018 4:59 am
I have been using the excellent idea from https://www.noodlesoft.com/forums/viewtopic.php?f=3&t=905 to separate PDF files with text from those without text to determine if they need to be OCRed.
But now I found a file, which clearly contains text, but the above method, doesn't recognize it
I have found pdfgrep, which has an option to recognized documents without text and want to use this to make more reliable. You can find install instruction under http://macappstore.org/pdfgrep/ (it took a while for me to finish)
I have fiddled with it a bit and in a .sh file it works
The 2>&1 redirects the stderr to stdout so I can actually grep it
The ^a as a pattern is my attempt to make it process as quickly as possible
When I put that code into a hazel condition (with the exit codes commented in and the echo commented out, of course) I still don't get the desired result though. It always ends up in the "no text" branch
Any ideas? Does the redirection of the stderr to stdout not work within the context of Hazel?
Then it wouldn't find the error message
Thanks!
But now I found a file, which clearly contains text, but the above method, doesn't recognize it
I have found pdfgrep, which has an option to recognized documents without text and want to use this to make more reliable. You can find install instruction under http://macappstore.org/pdfgrep/ (it took a while for me to finish)
I have fiddled with it a bit and in a .sh file it works
- Code: Select all
#! /bin/bash
if !(pdfgrep --warn-empty "^a" "$1" 2>&1 | grep "pdfgrep: File does not contain text")
then
echo "contains text"
# exit 0
else
echo "doesn't contain text"
# exit 1
fi
The 2>&1 redirects the stderr to stdout so I can actually grep it
The ^a as a pattern is my attempt to make it process as quickly as possible
When I put that code into a hazel condition (with the exit codes commented in and the echo commented out, of course) I still don't get the desired result though. It always ends up in the "no text" branch
Any ideas? Does the redirection of the stderr to stdout not work within the context of Hazel?
Then it wouldn't find the error message
Thanks!