Page 1 of 1

Contain Match no hits, Contain does

PostPosted: Tue Jul 19, 2022 8:26 am
by etienne74
Hello,

I have some pdfs from Pinterest, the invoices.
If I do contents - contain it does find e.g. 'Date:'
but when I do contents - contain match, then it can't find 'Date:'.

When I preview the file, all the text seems to be in inverse order, so 'Date' becomes 'etaD'.

Also the invoice date of 06/30/2022 is shown as 2202/03/60 in the preview.
I can get the individual numbers, but I don't think I can correct that in the correct date and then determine which quarter that was. Right?

So in short, the preview sees all characters in reverse but contain function works on the correct words, but contain match not.

Any ideas are highly appreciated.

Re: Contain Match no hits, Contain does

PostPosted: Tue Jul 19, 2022 9:06 am
by Mr_Noodle
Not sure if much can be done on Hazel's end here. There are some screwed up PDFs that do that. Not sure how possible this is, but you may want to look into getting this fixed at the source (whoever is generating the PDF).

Re: Contain Match no hits, Contain does

PostPosted: Sat Jul 23, 2022 8:49 am
by etienne74
Mr_Noodle wrote:Not sure if much can be done on Hazel's end here. There are some screwed up PDFs that do that. Not sure how possible this is, but you may want to look into getting this fixed at the source (whoever is generating the PDF).


Thanks for the reply.
So Contain uses a different method than Contain Match in looking for hits?

Pinterest generates these PDFs. I ask them, but we probably already know the answer ;-)

Re: Contain Match no hits, Contain does

PostPosted: Mon Jul 25, 2022 9:40 am
by Mr_Noodle
You can try opening them up in Preview, then doing Print->Save as PDF and see if that fixes it.

Re: Contain Match no hits, Contain does

PostPosted: Thu Nov 24, 2022 4:58 am
by harricool
Did you come to a resolution on this? I have the same issue with certain PDF invoices.

Re: Contain Match no hits, Contain does

PostPosted: Fri Nov 25, 2022 12:21 pm
by Mr_Noodle
Did you try the suggestions above and what were the results?

Re: Contain Match no hits, Contain does

PostPosted: Thu Dec 01, 2022 2:26 am
by harricool
No success with re-saving as PDF/printing to PDF. Same issue as OP, the 'contain' condition successfully finds data without it being reversed, but the 'contain match' condition shows reversed character order within a line.

I'm going to experiment with a workflow using a script and ocrmypdf (https://github.com/ocrmypdf/OCRmyPDF)
- force conversion to image at [x] DPI
- OCR image
- save result as PDF
- run through hazel for filing per normal
CMIIW - there is no way around needing two rules to handle this, one for the pre-processing OCR and one for the extraction/filing