Page 1 of 1

Why is this content contains match not matching

PostPosted: Thu Jun 20, 2013 8:47 pm
by tamarax
I have a bill in a searchable PDF that I am trying to process. I made a rule that worked for a few of them, but not others from the same company. The exact format of the bill did change at one point, but as far as I can tell it should still match the rule.

I have a content contains match rule of "Invoice Date(anything token)(Invoice Date Token)"

The invoice date token is "12/31/1999".

The following is a paste of the relevant section of the searchable text in the PDF:

Invoice Date Account Number
Invoice Number Due Date
Total Due
Summary of Charges
Balance Information Previous Balance Payments Received
Adjustments
Past Due Balance
New Charges
New Usage Charges Non Recurring Charges Monthly Charges Finance Charges
Taxes and Surcharges Discounts
Total New Charges
TOTAL AMOUNT DUE
Your Total Number of Referrals:
01/07/2008


It has "invoice date" with a bunch of stuff between, and then a date in the format I specified. Why wouldn't that match?

I tried deleting the "contents contain match" condition from my rule and all the other conditions match fine, so it should be that condition that is the problem.

Re: Why is this content contains match not matching

PostPosted: Fri Jun 21, 2013 10:36 am
by a_freyer
Content matching happens only after spotlight has indexed your file's content, so if you download a file and attempt to process on content immediately, it is possible that matching will fail.

In order to test whether this is the case, find a PDF that is not currently matching on content, and open in terminal running this command to force an import of the metadata (and content) into spotlight:

Code: Select all
mdimport /path/to/your/file.pdf


If Hazel cannot recognize the file after this import, then there is something else going on. In this case, I would have a few more questions for you starting with: Do these PDFs include a text layer, or have they been OCR'd?

Re: Why is this content contains match not matching

PostPosted: Fri Jun 21, 2013 11:05 am
by Mr_Noodle
Actually, the content matching goes directly to the files, ignoring Spotlight. One of the reasons I hesitated to add this feature was because of the performance impact having to read in the whole file but the result of some design decisions and some optimizations I put in, hopefully this impact is minimal. I also have other optimizations in mind should performance still be an issue.

That said, it probably won't work in this instance because it won't allow a pattern to go across lines. The main reason for this is that it limits the damage if someone creates a custom token like (...) which would grab the whole text of the file. Secondly, with PDFs, it might not be clear what order lines are so you may get very unexpected results. Lastly, allowing patterns to match across lines would be much slower in some cases.

In your example, I only see one date so I'd try just matching the bare date without any contextual text around it and see if that works.

Re: Why is this content contains match not matching

PostPosted: Wed Jun 26, 2013 2:37 pm
by tamarax
Hmm. Unfortunately, there are multiple dates in the file, so I'm not sure that matching on just the date would work. I only pasted the beginning of the file because, well, it's a lot of text, and since it's a bill, some would be personal. It is OCR'd, so it's unfortunate that the OCR interpreted the layout of the document as it did.

I don't suppose you know of any utilities that could manipulate/alter the text within the PDF?

Re: Why is this content contains match not matching

PostPosted: Wed Jun 26, 2013 2:47 pm
by Mr_Noodle
Don't know of any utilities off-hand but I am planning to add a feature where you an specify which match in a file (1st, 2nd, 3rd, etc.) to use which should help with cases like this.

Re: Why is this content contains match not matching

PostPosted: Wed Jun 26, 2013 3:09 pm
by a_freyer
I'd run a shell script here.