Shell script for ocrmypdf stopped working

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Shell script for ocrmypdf stopped working Sun Jan 07, 2024 9:24 pm • by needle
I have a Hazel rule that has recently stopped working. This may coincide with a recent Hazel update, or it may be related to an update to a dependent command line tool. The rule watches a folder, and when a new file is added, runs an embedded shell script that
1. Runs the cli program "ocrmypdf" on the file
2. Outputs the resulting OCR'd file to a different directory
3. Tags the resulting file with another cli program, "tag"

Here is the embedded script (run from ZSH):

Code: Select all
PATH=$PATH:/opt/homebrew/bin
export PATH

filename=$(basename "$1")
filename=${filename%.*}
converting_directory=~/Documents/processed/
converting_filename="$filename.pdf"

if ocrmypdf --clean --rotate-pages "$1" "$converting_directory""$converting_filename"
then
   tag -a "pdf/a,scan" "$converting_directory""$converting_filename"
   rm $1
   echo  "OCR succeeded"
else
   echo "OCR failed"
fi


When I run this script on a PDF outside of Hazel, it works. However, now when triggered via Hazel it looks like it's failing on the "ocrmypdf" command. The logs show that matching happens as expected, and the script definitely gets called but the OCR step fails.

This is the relevant bit from the logs
Code: Select all
2024-01-07 17:01:24.037 hazelworker[49612] DEBUG: About to process directory /Users/sam/Documents/inbox/hazel-scan-ocr
2024-01-07 17:01:24.042 hazelworker[49612] 20240107_151957.pdf: Rule Process Scans matched.
2024-01-07 17:01:24.042 hazelworker[49612] DEBUG: New rule signature. Executing actions.
Old signatures: (
)
New Signature:{dateAdded >[cd] dateMatched}:{(shellscript:/opt/homebrew/bin/zsh:0012fb12ac317454ed0977ff22fc908e,{
})}
2024-01-07 17:01:24.426 hazelworker[49612] DEBUG: == script output ==

    1 [tesseract] Leptonica Error in fopenReadStream: file not found: 000001_rasterize_preview.jpg
    1 [tesseract] Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.mvzd3qav/000001_rasterize_preview.jpg
    1 [tesseract] Leptonica Error in fopenReadStream: file not found:
    1 [tesseract] Leptonica Error in pixRead: image file not found:
    1 [tesseract] Image file  cannot be read!
    1 [tesseract] Error during processing.

SubprocessOutputError
OCR failed

== End script output ==


It looks like the ocrmypdf program isn't finding the source file, but I'm not sure what the problem is or why it was working previously but stopped. Hoping someone with more scripting knowledge can point me in the right direction.

OS X Version 13.6.3 (Build 22G436)
Hazel Version 5.3.1 (Build 2371)
needle
 
Posts: 5
Joined: Sun Jan 07, 2024 8:13 pm

Re: Shell script for ocrmypdf stopped working Mon Jan 08, 2024 10:46 am • by Mr_Noodle
You should try printing out different things (like the arguments passed in) to make sure everything is as you expect. It should show up in the logs.
Mr_Noodle
Site Admin
 
Posts: 11255
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Shell script for ocrmypdf stopped working Mon Jan 08, 2024 4:59 pm • by needle
When I print the arguments I don't see anything unexpected. I've tried reducing the script to a simpler form

Code: Select all
PATH=$PATH:/opt/homebrew/bin
export PATH

filename=$(basename "$1")
converting_directory=~/Documents/processed/
echo $converting_directory # /Users/sam/Documents/processed/
echo $filename # test.pdf
echo "$1" # /Users/sam/Documents/inbox/hazel-scan-ocr/test.pdf

ocrmypdf --no-progress-bar "$1" "$converting_directory""$filename"


I also tried recreating the script in an external file and calling it from Hazel, but both attempts result in the same error.

Code: Select all
New Signature:{dateAdded >[cd] dateMatched}:{(shellscript:/opt/homebrew/bin/zsh:cba2c281810ecaa9d147a8c2dd7d34ad,{
})}
2024-01-08 12:52:32.230 hazelworker[60993] DEBUG: == script output ==
/Users/sam/Documents/processed/
test.pdf
/Users/sam/Documents/inbox/hazel-scan-ocr/test.pdf

    1 [tesseract] Leptonica Error in fopenReadStream: file not found: 000001_ocr.png
    1 [tesseract] Leptonica Error in findFileFormat: image file not found: /tmp/ocrmypdf.io.u1oblfdh/000001_ocr.png
    1 [tesseract] Leptonica Error in fopenReadStream: file not found: PNG
    1 [tesseract] Leptonica Error in pixRead: image file not found: PNG
    1 [tesseract] Image file PNG cannot be read!
    1 [tesseract] Error during processing.

SubprocessOutputError

== End script output ==


The fact that I can call the script directly from the terminal and it runs without error makes me think there's a pathing issue, but I'm not sure what else to try. In Hazel I have the shell set to /opt/homebrew/bin/zsh which is the output of $ which zsh.
needle
 
Posts: 5
Joined: Sun Jan 07, 2024 8:13 pm

What is 000001_ocr.png? That is the file the program is complaining about.

Note that running things in Terminal is a different environment from running them in Hazel. You cannot assume any env vars are set and need to set those in your script.
Mr_Noodle
Site Admin
 
Posts: 11255
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Shell script for ocrmypdf stopped working Tue Jan 09, 2024 6:32 pm • by needle
000001_ocr.png is a temporary file created by Ghostscript as part of the OCR process. When I ran this issue by the OCRmyPDF creator their response was

My guess would be that macOS or Hazel is virtualizing /tmp in some way.
The file that is missing is supposed to be created by Ghostscript. (The gs -dQUIET command above tesseract.) That command completes without error, but the expected file is not created, so it's putting it somewhere else.

You could try setting an environment variable TMPDIR to redirect it to somewhere like /Users/you/tmp since I think that is less likely to be defeated.


After explicitly setting the TMPDIR environment variable as suggested, the script is working again. Here is the working version

Code: Select all
PATH=$PATH:/opt/homebrew/bin
export PATH
export TMPDIR="/Users/sam/tmp/"

filename=$(basename "$1")
converting_directory=~/Documents/processed/

if ocrmypdf --clean --rotate-pages --no-progress-bar -v 1 "$1" "$converting_directory""$filename"
then
   tag -a "pdf/a,scan" "$converting_directory""$filename"
   rm $1
   echo  "OCR succeeded"
else
   echo "OCR failed"
fi


So at this point I guess my remaining questions are
1. Is this an appropriate solution for when Hazel needs to know about TMPDIR, and
2. What changed to make this solution necessary? How was Hazel able to access TMPDIR in the past?
needle
 
Posts: 5
Joined: Sun Jan 07, 2024 8:13 pm

That would be the way to work around it. Is TMPDIR set when in Terminal?
Mr_Noodle
Site Admin
 
Posts: 11255
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Shell script for ocrmypdf stopped working Wed Jan 10, 2024 7:03 pm • by needle
Yes, when I echo $TMPDIR in Terminal it returns "/var/folders/blah/blah", but when I echo $TMPDIR in the Hazel embedded script it returns nothing.
needle
 
Posts: 5
Joined: Sun Jan 07, 2024 8:13 pm

So that would be a case of something you have to explicitly define in Hazel's environment.
Mr_Noodle
Site Admin
 
Posts: 11255
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Shell script for ocrmypdf stopped working Thu Jan 11, 2024 2:24 pm • by needle
Makes sense! Not sure how it was working before, but happy to have a solution. Thank you for the guidance.
needle
 
Posts: 5
Joined: Sun Jan 07, 2024 8:13 pm


Return to Support

cron