Hi, new user here.
Trying to set up some rules to process PDF files that are generated by my scanner. During the scan process, I have checked the box to OCR the scan and include the OCR text in the PDF that is saved. So far everything is working as expected and I can successfully set up rules where I can check that the "Contents" "contain" various text.
But sometimes, the text that is generated during the OCR process sometimes has some issues. For example, one particular type of invoice has a name printed on it in all caps: "JOHN SMITH". The OCR will often convert this to "J O H N S M I T H" with extra spaces between the letters. Most times there is just one space, sometimes there are multiple spaces. I'd rather not have to explicitly specify each exact permutation of acceptable matches that have been generated
So if there was a pay to pass the Contents of this file to a shell script, I could just run grep in the script to evaluate the Contents against a regular expression that allowed matches on strings containing any number of optional embedded (0+) space characters between the desired search letters.
However it seems like the "passed shell script" option just passes in the file path, NOT the contents. This would mean that I would need to figure out some other tool to open the pdf and OCR the text again or extract the existing text contents for this purpose. This seems wasteful if I already have the contents available in the file, but not sure how to get at these contents without using a third-party tool of some kind.
in general the matching mechanisms in Hazel seem like they fall quite short of what can be done in regular expressions? or maybe I just don't understand them yet.
any ideas how I could accomplish this? maybe I am overcomplicating it.