Page 1 of 1

Extract text from OCR-ed PDF, use it for filename

PostPosted: Mon Jun 08, 2015 4:13 pm
by SamPieter
hi all,
I scanned and OCR-ed a utilities bill and want to use part of the text as the filename of this document.

The text I want to use is the bill's period, example: "2015 March"
Here's an example of a piece of the OCR-ed PDF with the needed text string:
Code: Select all
...
Resibu Aqualectra
11414527 2015 March
673.66 673.66
...

The string that I'm looking for comes directly after "11414527", which will always show on my bill as it is my account number.

How can I extract "2015 March" from the text?
I need to look for "11414527" and then select the text immediately after that.
All help is greatly appreciated!

Re: Extract text from OCR-ed PDF, use it for filename

PostPosted: Mon Jun 08, 2015 7:41 pm
by Mr_Noodle
If there is only one date in the file, you can use a date attribute and just match that date. If there are too many dates and their positions are variable, then using any surrounding context, like the account number will help.

Re: Extract text from OCR-ed PDF, use it for filename

PostPosted: Tue Jun 09, 2015 11:52 am
by SamPieter
...then using any surrounding context...

Yes that's exactly what I'd like to do, but how?

Re: Extract text from OCR-ed PDF, use it for filename

PostPosted: Tue Jun 09, 2015 6:34 pm
by Mr_Noodle
Enter the text into the pattern, like "11414527 (• your date attribute)". With a pattern like that, the date would have to appear after that account number.

Re: Extract text from OCR-ed PDF, use it for filename

PostPosted: Wed Jun 10, 2015 2:52 pm
by SamPieter
Enter the text into the pattern, like "11414527 (• your date attribute)".


Where in Hazel should I do this?

Re: Extract text from OCR-ed PDF, use it for filename

PostPosted: Thu Jun 11, 2015 1:05 pm
by Mr_Noodle
Since you are looking for it in the contents, you probably want to have a condition like "Contents contain match". Search the help for "match patterns" for more info on using patterns.