Extract date from OCR'd PDF

I'm a newbie here, and I'm sure this has been addressed before, but I can't find and answer. I'd like to be able to file copies of various old financial documents. I'd like to extract the date from these documents and use it in the filename. Any ideas, pointers, already created rules, etc. would be appreciated.

Is the date in the contents of the document itself? If so, then you need to have the OCR software somehow export that. Many programs will allow you to specify keywords which Hazel can then use.

Yes, the date is in the contents of the file. For instance, a copy of a payroll check, where therer is language to the effect of Pay Date: 01/02/2012. I know Hazel can find the document is I search specifically for 01/02/2012, and it does. But what I want is a script or some solution to find the Pay Date: and parse the date immediately following that and be able to use it in a variable to rename the file, i.e. 2012-01-02 - pay stub. There are several types of documents I'd like to do this date extraction with, but maybe that just can't be done.

Here's a shell script that uses some Unix commands to extract and rename the file based on its contents. You can modify it for your needs. This particular script expects the date in the format: "Closing Date mm/dd/yy".

The key is mdiimport, which creates a text file of the text in the PDF. This would be part of a Hazel rule, of course.

Code: Select all: path=`dirname "$1"` /usr/bin/mdimport -d2 "$1" >& "$path"/"_extract.txt" x=`egrep -o 'Closing Date ..\/..\/..' "$path"/"_extract.txt" | awk '{print $3}'` rm -rf "$path"/"_extract.txt" m=`echo $x | awk -F'/' '{print $1}'` d=`echo $x | awk -F'/' '{print $2}'` y=`echo $x | awk -F'/' '{print $3}' | awk '{print "20"$1}'` touch -t "$y$m$d"0000 "$1" mv "$1" "$path"/"$m-$d.pdf" exit 0

Thanks that looks perfect. While i understand some basic scripting, this is a bit over my head. Is there anyway you could illustrate how this could be used with a basic rename and move Hazel rule if thats possible? Thanks again

Outstanding. I was able to paste in the script and modify it for my needs.

Thanks so much

MIght make a useful future update (extracting text patterns from a PDF without using direct scripting).

Doing pattern matching against text would be very resource-intensive. I'm only able to do a basic text "contains" search because Spotlight indexes the stuff ahead of time. Pattern matching would require doing it on the fly and having to do that for every file would make things very slow. You'd have to be very careful to structure your rule conditions to minimize this.

I think I was suggesting an action rather than a condition.

Ah, yes, that would have less of an impact but I'd have to consider how general purpose it is. It could still be quite resource intensive and I'd have to deal with the case where there pattern doesn't match.

hi there, 1st post and new to Hazel. I'm doing Ok with the rules, Hazel is awesome and the functionality described in this thread is exactly what I need.
I never worked with scripts before so i'm a little lost here.

I pasted the script in Hazel action "Run shell script" and modified two things too try and match my needs as seen below (bold). (note: Data de Vencimento means Closing date in Portuguese):

path=`dirname "$1"`
/usr/bin/mdimport -d2 "$1" >& "$path"/"_extract.txt"
x=`egrep -o 'Data de Vencimento ..\/..\/....' "$path"/"_extract.txt" | awk '{print $3}'`
rm -rf "$path"/"_extract.txt"

d=`echo $x | awk -F'/' '{print $1}'`
m=`echo $x | awk -F'/' '{print $2}'`
y=`echo $x | awk -F'/' '{print $3}' | awk '{print "20"$1}'`

touch -t "$y$m$d"0000 "$1"
mv "$1" "$path"/"$m-$d.pdf"
exit 0

I modified the date format to use year with 4 digits and changed day - month order to match my date format.
The result was a filename with a dash and the extension, like this: "-.pdf"

The file I'm trying to extract the date has the name "Data de Vencimento" above the date, see image attached.

How can I make it work?

Thanks and appreciate your help!!!

I suggest trying some of the individual parts of the script on the commandline. I'm guessing the egrep is failing.

Mr_Noodle wrote:I suggest trying some of the individual parts of the script on the commandline. I'm guessing the egrep is failing.

Thanks Mr_Noodle but couldn't make it work. I tried a different document with the text in the same line like "Date: 05/06/2012" but got the same -.pdf filename.
Plus, I screwed up in the rules and ended up losing like 10 files. I was too generic in the "File name contains" condition and I think it overwritten all files and I ended up with just 1 because they had the same name in the beginning.

I really suck at scripts, I don't have a clue on what I'm doing here. Do I have to replace some of the code like dirname for a real dir name like Documents or $path with a real path?
I'm really lost.

I say just try the egrep command by itself. I'm guessing that it's not returning anything. It may be the case that the text in your PDF is not searchable (i.e. it's a big bitmap).

Ok, I tested the file and the PDF is searchable. I was able to copy text from it and paste in Text Editor.
I isolated the egrep part leaving just the first 3 lines of the code and ran the script. It successfully returned a file _extract.txt with all the text of the PDF in it, including Data de Vencimento 05/01/2012.

Is this correct? Is the script supposed to return all text or isolate just the quoted part ('Data de Vencimento ..\/..\/....') of the script?