Content Matching: Multiple words as token?

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Content Matching: Multiple words as token? Tue Jul 02, 2013 10:20 pm • by sd001632
Hi,

I have setup a query with content matching whereby Hazel looks up an invoice's pdf and recognizes successfully the invoice number BUT I can't figure out how to recognize the client's name BECAUSE they are made of a variable number of words. For example, some clients are called 'Baycrest Hospital' (2 words), some 'Senior Center for the Disabled' (5 words) and I don,t know how to program the token.

I need to find a way to recognize that name, independently of how many words make up that name, because then hazel sorts invoices in subfolder according to that token.

Everything works except this element.

Thanks for your help,

Regards,
David
www.solpak.ca
P-S I am a noob here, but i did try to find this issue in the forum. Thanks!
sd001632
 
Posts: 4
Joined: Mon Jul 01, 2013 9:45 am

Can you post an actual example. In particular is there any context around this that you can leverage? Like something saying "Client name: " and then the client name itself?
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Hi,

here is the original invoice we scan: https://dl.dropboxusercontent.com/u/241 ... 20copy.pdf

Here is the rule:
https://dl.dropboxusercontent.com/u/241 ... 3%20PM.png

Here is the token:
https://dl.dropboxusercontent.com/u/241 ... 2%20PM.png

I am able to tell it to look after "Sold to:" in what is ocr'd but I am not sure how to tell it how long the word (company name) is. Is there a 'Line Break' I could put in the token.

This way, it takes everything that is after 'Sold to:' but not what is on the next line (the contact name underneath). If I mention a number of words but the company name is shorter, it will take the next words in line (contact name underneath).

thanks!

David
sd001632
 
Posts: 4
Joined: Mon Jul 01, 2013 9:45 am

I'm not sure if this will work since "Sold to:" is on a separate line (patterns can't cross lines). Even then, I'm not sure how this would work. I'll need to take a closer look at how the text is structured internally and get back to you.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Taking a closer look, it appears this document is not OCRed. So, there's no text content that can be matched at all. I say OCR it and then re-send it.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

sorry, my initial was a french invoice and when I switched to an english one, I took and older invoice not OCRed.

Now the file is OCRed.

I need to find a way to have that client name delimited so Hazel can match it without going to the next line

thanks
sd001632
 
Posts: 4
Joined: Mon Jul 01, 2013 9:45 am

I think in the next version, I'm going to allow spaces also match a line separator so that you can have patterns traverse lines (at least traverse them wherever there are spaces). While useful in general, in this particular case, it may not work so well because the text is encoded in an order that you probably wouldn't expect. For instance, I get something like this:
Code: Select all
Sold to:
Ship to:
Baycrest Center for Geriatric Care

The "Sold to:" and "Ship to:" are considered to be on the same line so you would have to take that into account when doing your patterns.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Interesting discovery: this document was ocr'd post scan by pdfpenpro 6 while our usual workflow is to OCR upon scanning with snapscan software. It does not scan the same way in the sense that it places text in a different orde?

Hmmm,, gotta look into that.

The line separator update I think will be great though.

Thanks,
sd001632
 
Posts: 4
Joined: Mon Jul 01, 2013 9:45 am

The version released this morning allows line breaks in patterns. Basically, anywhere there is a space in your pattern, it will also match a line break if possible. That said, there still may be issues with the ordering of text. PDF is a free form format where text is laid out graphically and can be specified in any order. Add on top of that if you are OCRing it that the OCR software may not considering chunks of text to be together even though they look that way to you.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City


Return to Support