Buggy PDF content checking?

Hi,

I think there is a sneaky bug in the way Hazel inspects PDF content, not matching it when it should.
Regularly, I put OCR'd PDFs into my Action folder on Dropbox, to be processed and filed by Hazel.
Sometimes the rules do not recognize the files' content even though they should. Deactivating the rule, moving the file away, putting it back and reactivating the rule, or even modifying the file in an unrelated way sometimes does the trick.

For instance, here are two cases that just happened to me:
A)
- Got a fresh OCR'd file in my Action folder. Hazel does not notice it based on criteria that it should fulfill.
- I open Hazel and use the rule preview function to check if PDF file is recognized. On the four criteria, only one matches, so the rule does not fire (even though it did on similar files a moment before). So far, I get it.
- Okay, I'm thinking, maybe I have special characters in my OCR'd PDF file that prevent recognition by Hazel, so I go in my PDF, copy the exact string I'm looking for, paste it into the corresponding field in Hazel. Boom, suddenly, not only that criterion matches, but also the other two I have not touched and were NOT matching a second before!

B) Same kind of situation:
- Got a fresh OCR'd file in my Action folder. Hazel does not notice it based on criteria that it should fulfill.
- So I open the PDF in macOS' Preview, remove a blank page in the PDF (that has nothing to do with the rule's criteria).
- Hazel suddenly notices the file it was not noticing a moment before and acts on it as it should.

Am I doing something wrong here? It seems to me there is something weird going on with file or rule refreshes, but I obviously don't know! If it's a bug, just wanted to report it, and otherwise, can someone tell me what I'm doing wrong...?

I think I'll need to see more specifics. Screenshots of the rule/preview would help so I know specifically what you are matching on.

Here you go (the blanked field is my social security number, a number sequence).

When files aren't matched, only the last criterion (date match) is given as matching. It is seen clearly when I test it in rule previewing.
When I alter the file or the rule (remove a blank page / add another criteria), it strangely wakes Hazel up, all criteria match, and the rule fires as it should.

Ok, the issue here is that when using "Contents contain" which relies on Spotlight. I'm guessing that Spotlight isn't indexing the file and when you modify it, that gives it a little kick and it wakes up and indexes the file. What program is saving the file in the first place?

An alternative is to use "Contents contain match" for all cases as that does a direct scan of the file and is therefore more reliable. It's also more resource intensive though. I'd suggest adding conditions above them to filter out certain files before they get to those conditions.

Aaaah okay, thank you for that insight, that totally makes sense now!

The complete workflow is thus:
- Automatic scanning via Epson's Event Manager (feeder scanner GTS55) which creates a PDF file as image
- Hazel watches for the new scans based on a file naming scheme, when it finds one it launches OCR via Automator which calls for ABBYY FineReader (which then creates the PDF, to answer precisely your question)
- I have then Hazel watch the same folder for the OCR'd file contents.

It totally makes sense that it would take some time to kick in if it needs to wait for Spotlight. I can wait, then.

FWIW, I had done tests with PDFpen for OCR and I don't recall having encountered that issue. But I find PDFpen messes the OCR much more than ABBYY FineReader, so that's why I'm using the latter.