Why is this content contains match not matching

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

I have a bill in a searchable PDF that I am trying to process. I made a rule that worked for a few of them, but not others from the same company. The exact format of the bill did change at one point, but as far as I can tell it should still match the rule.

I have a content contains match rule of "Invoice Date(anything token)(Invoice Date Token)"

The invoice date token is "12/31/1999".

The following is a paste of the relevant section of the searchable text in the PDF:

Invoice Date Account Number
Invoice Number Due Date
Total Due
Summary of Charges
Balance Information Previous Balance Payments Received
Adjustments
Past Due Balance
New Charges
New Usage Charges Non Recurring Charges Monthly Charges Finance Charges
Taxes and Surcharges Discounts
Total New Charges
TOTAL AMOUNT DUE
Your Total Number of Referrals:
01/07/2008


It has "invoice date" with a bunch of stuff between, and then a date in the format I specified. Why wouldn't that match?

I tried deleting the "contents contain match" condition from my rule and all the other conditions match fine, so it should be that condition that is the problem.
tamarax
 
Posts: 13
Joined: Thu Mar 01, 2012 11:46 pm

Content matching happens only after spotlight has indexed your file's content, so if you download a file and attempt to process on content immediately, it is possible that matching will fail.

In order to test whether this is the case, find a PDF that is not currently matching on content, and open in terminal running this command to force an import of the metadata (and content) into spotlight:

Code: Select all
mdimport /path/to/your/file.pdf


If Hazel cannot recognize the file after this import, then there is something else going on. In this case, I would have a few more questions for you starting with: Do these PDFs include a text layer, or have they been OCR'd?
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

Actually, the content matching goes directly to the files, ignoring Spotlight. One of the reasons I hesitated to add this feature was because of the performance impact having to read in the whole file but the result of some design decisions and some optimizations I put in, hopefully this impact is minimal. I also have other optimizations in mind should performance still be an issue.

That said, it probably won't work in this instance because it won't allow a pattern to go across lines. The main reason for this is that it limits the damage if someone creates a custom token like (...) which would grab the whole text of the file. Secondly, with PDFs, it might not be clear what order lines are so you may get very unexpected results. Lastly, allowing patterns to match across lines would be much slower in some cases.

In your example, I only see one date so I'd try just matching the bare date without any contextual text around it and see if that works.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Hmm. Unfortunately, there are multiple dates in the file, so I'm not sure that matching on just the date would work. I only pasted the beginning of the file because, well, it's a lot of text, and since it's a bill, some would be personal. It is OCR'd, so it's unfortunate that the OCR interpreted the layout of the document as it did.

I don't suppose you know of any utilities that could manipulate/alter the text within the PDF?
tamarax
 
Posts: 13
Joined: Thu Mar 01, 2012 11:46 pm

Don't know of any utilities off-hand but I am planning to add a feature where you an specify which match in a file (1st, 2nd, 3rd, etc.) to use which should help with cases like this.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

I'd run a shell script here.
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado


Return to Support