method for more complex text contents evaluation for PDF

Hi, new user here.

Trying to set up some rules to process PDF files that are generated by my scanner. During the scan process, I have checked the box to OCR the scan and include the OCR text in the PDF that is saved. So far everything is working as expected and I can successfully set up rules where I can check that the "Contents" "contain" various text.

But sometimes, the text that is generated during the OCR process sometimes has some issues. For example, one particular type of invoice has a name printed on it in all caps: "JOHN SMITH". The OCR will often convert this to "J O H N S M I T H" with extra spaces between the letters. Most times there is just one space, sometimes there are multiple spaces. I'd rather not have to explicitly specify each exact permutation of acceptable matches that have been generated

So if there was a pay to pass the Contents of this file to a shell script, I could just run grep in the script to evaluate the Contents against a regular expression that allowed matches on strings containing any number of optional embedded (0+) space characters between the desired search letters.

However it seems like the "passed shell script" option just passes in the file path, NOT the contents. This would mean that I would need to figure out some other tool to open the pdf and OCR the text again or extract the existing text contents for this purpose. This seems wasteful if I already have the contents available in the file, but not sure how to get at these contents without using a third-party tool of some kind.

in general the matching mechanisms in Hazel seem like they fall quite short of what can be done in regular expressions? or maybe I just don't understand them yet.

any ideas how I could accomplish this? maybe I am overcomplicating it.

Have you tried using Hazel's OCR to see if you get better results?

That's a good idea, I hadn't noticed this option, so I just tried it. While it seems to have done better with the specific example of extra spaces, it is actually doing significantly worse on getting the actual text correct.

Using scanner software I get things like (for example):
J O H N T S M I T H
ticking the box to force Hazel to run OCR even if the file has text data:
OHN7SMITH

so yeah, better but also worse at the same time.
maybe I could set something up using cascading folder rules that would do multiple comparisons using different OCR methods.
or maybe the more direct solution is to have OR condition with multiple comparisons against each permutation of the value "seen in the wild" after scanning multiple versions of this type of document (statement from a particular financial institution).
thanks for your thoughts on this

At the moment, it might be better to have different patterns to try and compensate. You could try a list/table to list all the different variations on a name.