Page 1 of 1

Content Matching: Multiple words as token?

PostPosted: Tue Jul 02, 2013 10:20 pm
by sd001632
Hi,

I have setup a query with content matching whereby Hazel looks up an invoice's pdf and recognizes successfully the invoice number BUT I can't figure out how to recognize the client's name BECAUSE they are made of a variable number of words. For example, some clients are called 'Baycrest Hospital' (2 words), some 'Senior Center for the Disabled' (5 words) and I don,t know how to program the token.

I need to find a way to recognize that name, independently of how many words make up that name, because then hazel sorts invoices in subfolder according to that token.

Everything works except this element.

Thanks for your help,

Regards,
David
www.solpak.ca
P-S I am a noob here, but i did try to find this issue in the forum. Thanks!

Re: Content Matching: Multiple words as token?

PostPosted: Wed Jul 03, 2013 12:50 pm
by Mr_Noodle
Can you post an actual example. In particular is there any context around this that you can leverage? Like something saying "Client name: " and then the client name itself?

Re: Content Matching: Multiple words as token?

PostPosted: Wed Jul 03, 2013 5:46 pm
by sd001632
Hi,

here is the original invoice we scan: https://dl.dropboxusercontent.com/u/241 ... 20copy.pdf

Here is the rule:
https://dl.dropboxusercontent.com/u/241 ... 3%20PM.png

Here is the token:
https://dl.dropboxusercontent.com/u/241 ... 2%20PM.png

I am able to tell it to look after "Sold to:" in what is ocr'd but I am not sure how to tell it how long the word (company name) is. Is there a 'Line Break' I could put in the token.

This way, it takes everything that is after 'Sold to:' but not what is on the next line (the contact name underneath). If I mention a number of words but the company name is shorter, it will take the next words in line (contact name underneath).

thanks!

David

Re: Content Matching: Multiple words as token?

PostPosted: Fri Jul 05, 2013 3:16 pm
by Mr_Noodle
I'm not sure if this will work since "Sold to:" is on a separate line (patterns can't cross lines). Even then, I'm not sure how this would work. I'll need to take a closer look at how the text is structured internally and get back to you.

Re: Content Matching: Multiple words as token?

PostPosted: Fri Jul 05, 2013 3:54 pm
by Mr_Noodle
Taking a closer look, it appears this document is not OCRed. So, there's no text content that can be matched at all. I say OCR it and then re-send it.

Re: Content Matching: Multiple words as token?

PostPosted: Fri Jul 05, 2013 9:37 pm
by sd001632
sorry, my initial was a french invoice and when I switched to an english one, I took and older invoice not OCRed.

Now the file is OCRed.

I need to find a way to have that client name delimited so Hazel can match it without going to the next line

thanks

Re: Content Matching: Multiple words as token?

PostPosted: Mon Jul 08, 2013 12:25 pm
by Mr_Noodle
I think in the next version, I'm going to allow spaces also match a line separator so that you can have patterns traverse lines (at least traverse them wherever there are spaces). While useful in general, in this particular case, it may not work so well because the text is encoded in an order that you probably wouldn't expect. For instance, I get something like this:
Code: Select all
Sold to:
Ship to:
Baycrest Center for Geriatric Care

The "Sold to:" and "Ship to:" are considered to be on the same line so you would have to take that into account when doing your patterns.

Re: Content Matching: Multiple words as token?

PostPosted: Wed Jul 24, 2013 8:20 am
by sd001632
Interesting discovery: this document was ocr'd post scan by pdfpenpro 6 while our usual workflow is to OCR upon scanning with snapscan software. It does not scan the same way in the sense that it places text in a different orde?

Hmmm,, gotta look into that.

The line separator update I think will be great though.

Thanks,

Re: Content Matching: Multiple words as token?

PostPosted: Thu Jul 25, 2013 12:20 pm
by Mr_Noodle
The version released this morning allows line breaks in patterns. Basically, anywhere there is a space in your pattern, it will also match a line break if possible. That said, there still may be issues with the ordering of text. PDF is a free form format where text is laid out graphically and can be specified in any order. Add on top of that if you are OCRing it that the OCR software may not considering chunks of text to be together even though they look that way to you.