Date matching fails

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Date matching fails Mon Sep 16, 2019 2:33 am • by zoobert
Certain date formats do not appear to work. In this case, I have a quarterly report with a date in the format "January 1, 2019 - March 31, 2019". I've created a rule to extract the date, but nothing I do seems to work. In the image below, you can see I have extracted the "date" in both text and date form, but only one appears to work. I've tried to allow Hazel to both automatically detect the date format and tried to specify the date format, but nothing seems to work. Can you provide me with some suggestions?

https://drive.google.com/file/d/1F2qdN5WHkG74t6_xWzFbZwpmGIPBVrzI/view?usp=sharing

Thanks,
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Mon Sep 16, 2019 10:08 am • by Mr_Noodle
Can you post the date formats you are using? Also, click on the red X button as that will give you more info.
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Date matching fails Tue Sep 17, 2019 1:50 am • by zoobert
I tried several date formats, but none of them work. This shows the date format in "December 31, 1999": https://www.dropbox.com/s/grhuqob7ose0i ... M.png?dl=0

I think the problem may be related to the data stored in the PDF. When I try to view the contents of a valid match or invalid match, I do not see anything:
* Valid match : https://www.dropbox.com/s/zi4vrh9w7cqtf ... M.png?dl=0
* Invalid match: https://www.dropbox.com/s/p1rug5rb3ibez ... M.png?dl=0

What I find odd here is that the "contains" matches OK, but the "contains match" does not. I've tried a single alpha character for a "contains match" and that fails too. It is almost as if the PDF has not contents that it recognizes for a "contains match". I've output the txt contained within the pdf using "pdftotext" and it does contain valid data, so I'm stumped. The files which are giving me problems are more recent statements from Fidelity Investments. Older statements (two years or greater) seem to be processed OK.

Do you know of anything I should look for or post-processing of these PDF files that I can do to "normalize" these files into a state that hazel can recognize?
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Tue Sep 17, 2019 10:59 am • by Mr_Noodle
Did you click on the button with three dots? That will expand to show everything.

Also note that "Contents contain" uses Spotlight whereas "Contents contain match" has Hazel extract the text itself.
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Date matching fails Tue Sep 17, 2019 10:38 pm • by zoobert
When I hit the button with the three dots it does not show anything. Strange, eh?

This is not the case with all PDFs, just a few of them. They have content that can be seen with pdxtotext, but I'm not sure how they are different.
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Wed Sep 18, 2019 9:58 am • by Mr_Noodle
Text in PDF documents can be troublesome. I'm not sure what exactly is going on here but if you open the file in Preview, are you able to search and select text there?
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Date matching fails Thu Sep 19, 2019 10:06 am • by zoobert
Yes, but I did not record a screencast to prove this. What are you using to extract the text data when "contents contains match"? I was looking at exploring this further over the weekend by writing a small program to do the same thing. If you can't share (e.g., language, libraries, etc.), I'll probably just try Python and PDF miner or something that flavor.
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Thu Sep 19, 2019 10:34 am • by Mr_Noodle
If you're a developer, I'm using Apple's PDFDocument class. I assume that's what Preview uses but I could be wrong there.
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Date matching fails Sun Sep 22, 2019 11:03 am • by zoobert
I've fiddled around with this some and confirmed that I cannot see the page text in the document in question while I can in others basically using the following simple Swift code (note: I'm a Java developer, not familiar with Swift):
Code: Select all
let fileName = NSString(string: "~/Downloads/<some file>.pdf").expandingTildeInPath
let fileUrl = URL(fileURLWithPath: fileName)
let pdfDocument = PDFDocument(url: fileUrl)
let page1 = pdfDocument?.page(at: 1)

print(page1?.string)


The interesting thing I discovered is that parts of the document in question are missing in Preview, but can be seen in Adobe Reader. This particular problem has been encountered by others: https://discussions.apple.com/thread/3304322

I'm not sure a way around this, nor do I have a reason why NONE of the text is available through PDFKit, but some of it can be seen by Preview. I have confirmed that the text that appears "hidden" with Preview cannot be found with a "contains" rule, but text that is not "hidden" can be found.

I'm not sure that helps. Wish I could share the document causing the problem, but it has too much PII. I'll continue to look for a solution on my end, but until then these documents cannot take advantage of hazel goodness. :oops:
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Mon Sep 23, 2019 10:03 am • by Mr_Noodle
Thanks for the info. Maybe there's some function in Acrobat that can convert the PDF into something Preview/Hazel can use?
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Date matching fails Tue Sep 24, 2019 2:28 am • by zoobert
I believe I've discovered the reason for the problem and the solution.

First, the thing that is unique about the documents is that they contain form fields. Although it is not apparent, these are financial statements generated a financial advisor from a financial institution, so it's sort of like a form letter with personalized information. However, with these fields the preview shows only the non-field data, but the Swift code below is not able to extract ANYTHING.

Second, I was able to get Hazel to recognize the data from the fields by flattening the document. This is a process which basically makes the form field data static text. The process is only available in paid versions of Adobe Acrobat which I was unwilling to pay for ($180/yr), but you can accomplish the same thing with pdf2ps and ps2pdf (see https://unix.stackexchange.com/questions/162922/is-there-a-way-to-flatten-a-pdf-image-from-the-command-line)

I believe you can replicate the problem with any kind of PDF form. I don't know if other alternatives to PDFKit, but I will have to flatten the files regardless in order to display in Evernote properly. Thanks for your help.
zoobert
 
Posts: 6
Joined: Sun Sep 15, 2019 9:45 pm

Re: Date matching fails Tue Sep 24, 2019 10:01 am • by Mr_Noodle
Thanks for the update and info and glad you figured it out. Not sure if this is something Apple will update in PDFKit but one can hope.
Mr_Noodle
Site Admin
 
Posts: 8363
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City


Return to Support