Extract date from OCR'd PDF

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Extract date from OCR'd PDF Sat May 26, 2012 2:41 pm • by Bill0722
I'm a newbie here, and I'm sure this has been addressed before, but I can't find and answer. I'd like to be able to file copies of various old financial documents. I'd like to extract the date from these documents and use it in the filename. Any ideas, pointers, already created rules, etc. would be appreciated.
Bill0722
 
Posts: 5
Joined: Sat May 26, 2012 2:34 pm

Re: Extract date from OCR'd PDF Tue May 29, 2012 2:55 pm • by Mr_Noodle
Is the date in the contents of the document itself? If so, then you need to have the OCR software somehow export that. Many programs will allow you to specify keywords which Hazel can then use.
Mr_Noodle
Site Admin
 
Posts: 11240
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Tue May 29, 2012 3:58 pm • by Bill0722
Yes, the date is in the contents of the file. For instance, a copy of a payroll check, where therer is language to the effect of Pay Date: 01/02/2012. I know Hazel can find the document is I search specifically for 01/02/2012, and it does. But what I want is a script or some solution to find the Pay Date: and parse the date immediately following that and be able to use it in a variable to rename the file, i.e. 2012-01-02 - pay stub. There are several types of documents I'd like to do this date extraction with, but maybe that just can't be done.
Bill0722
 
Posts: 5
Joined: Sat May 26, 2012 2:34 pm

Re: Extract date from OCR'd PDF Wed May 30, 2012 1:12 am • by JBB
Here's a shell script that uses some Unix commands to extract and rename the file based on its contents. You can modify it for your needs. This particular script expects the date in the format: "Closing Date mm/dd/yy".

The key is mdiimport, which creates a text file of the text in the PDF. This would be part of a Hazel rule, of course.

Code: Select all
path=`dirname "$1"`
/usr/bin/mdimport -d2 "$1" >& "$path"/"_extract.txt"
x=`egrep -o 'Closing Date ..\/..\/..' "$path"/"_extract.txt" | awk '{print $3}'`
rm -rf "$path"/"_extract.txt"

m=`echo $x | awk -F'/' '{print $1}'`
d=`echo $x | awk -F'/' '{print $2}'`
y=`echo $x | awk -F'/' '{print $3}' | awk '{print "20"$1}'`

touch -t "$y$m$d"0000 "$1"
mv "$1" "$path"/"$m-$d.pdf"
exit 0
JBB
 
Posts: 21
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Wed May 30, 2012 9:05 am • by Bill0722
Thanks that looks perfect. While i understand some basic scripting, this is a bit over my head. Is there anyway you could illustrate how this could be used with a basic rename and move Hazel rule if thats possible? Thanks again
Bill0722
 
Posts: 5
Joined: Sat May 26, 2012 2:34 pm

Re: Extract date from OCR'd PDF Wed May 30, 2012 11:49 am • by Bill0722
Outstanding. I was able to paste in the script and modify it for my needs.

Thanks so much
Bill0722
 
Posts: 5
Joined: Sat May 26, 2012 2:34 pm

Re: Extract date from OCR'd PDF Fri Jun 01, 2012 2:08 am • by JBB
MIght make a useful future update (extracting text patterns from a PDF without using direct scripting).
JBB
 
Posts: 21
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Fri Jun 01, 2012 2:54 pm • by Mr_Noodle
Doing pattern matching against text would be very resource-intensive. I'm only able to do a basic text "contains" search because Spotlight indexes the stuff ahead of time. Pattern matching would require doing it on the fly and having to do that for every file would make things very slow. You'd have to be very careful to structure your rule conditions to minimize this.
Mr_Noodle
Site Admin
 
Posts: 11240
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Sat Jun 02, 2012 1:35 am • by JBB
I think I was suggesting an action rather than a condition.
JBB
 
Posts: 21
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Mon Jun 04, 2012 4:35 pm • by Mr_Noodle
Ah, yes, that would have less of an impact but I'd have to consider how general purpose it is. It could still be quite resource intensive and I'd have to deal with the case where there pattern doesn't match.
Mr_Noodle
Site Admin
 
Posts: 11240
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Wed Jun 27, 2012 6:40 am • by rpedro
hi there, 1st post and new to Hazel. I'm doing Ok with the rules, Hazel is awesome and the functionality described in this thread is exactly what I need.
I never worked with scripts before so i'm a little lost here.

I pasted the script in Hazel action "Run shell script" and modified two things too try and match my needs as seen below (bold). (note: Data de Vencimento means Closing date in Portuguese):

path=`dirname "$1"`
/usr/bin/mdimport -d2 "$1" >& "$path"/"_extract.txt"
x=`egrep -o 'Data de Vencimento ..\/..\/....' "$path"/"_extract.txt" | awk '{print $3}'`
rm -rf "$path"/"_extract.txt"

d=`echo $x | awk -F'/' '{print $1}'`
m=`echo $x | awk -F'/' '{print $2}'`
y=`echo $x | awk -F'/' '{print $3}' | awk '{print "20"$1}'`

touch -t "$y$m$d"0000 "$1"
mv "$1" "$path"/"$m-$d.pdf"
exit 0

I modified the date format to use year with 4 digits and changed day - month order to match my date format.
The result was a filename with a dash and the extension, like this: "-.pdf"

The file I'm trying to extract the date has the name "Data de Vencimento" above the date, see image attached. Image

How can I make it work?

Thanks and appreciate your help!!!
rpedro
 
Posts: 29
Joined: Wed Jun 27, 2012 6:17 am

Re: Extract date from OCR'd PDF Thu Jun 28, 2012 3:25 pm • by Mr_Noodle
I suggest trying some of the individual parts of the script on the commandline. I'm guessing the egrep is failing.
Mr_Noodle
Site Admin
 
Posts: 11240
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Fri Jun 29, 2012 5:32 am • by rpedro
Mr_Noodle wrote:I suggest trying some of the individual parts of the script on the commandline. I'm guessing the egrep is failing.


Thanks Mr_Noodle but couldn't make it work. I tried a different document with the text in the same line like "Date: 05/06/2012" but got the same -.pdf filename.
Plus, I screwed up in the rules and ended up losing like 10 files. I was too generic in the "File name contains" condition and I think it overwritten all files and I ended up with just 1 because they had the same name in the beginning.

I really suck at scripts, I don't have a clue on what I'm doing here. Do I have to replace some of the code like dirname for a real dir name like Documents or $path with a real path?
I'm really lost. :(
rpedro
 
Posts: 29
Joined: Wed Jun 27, 2012 6:17 am

Re: Extract date from OCR'd PDF Fri Jun 29, 2012 2:58 pm • by Mr_Noodle
I say just try the egrep command by itself. I'm guessing that it's not returning anything. It may be the case that the text in your PDF is not searchable (i.e. it's a big bitmap).
Mr_Noodle
Site Admin
 
Posts: 11240
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Sat Jun 30, 2012 5:02 pm • by rpedro
Ok, I tested the file and the PDF is searchable. I was able to copy text from it and paste in Text Editor.
I isolated the egrep part leaving just the first 3 lines of the code and ran the script. It successfully returned a file _extract.txt with all the text of the PDF in it, including Data de Vencimento 05/01/2012.

Is this correct? Is the script supposed to return all text or isolate just the quoted part ('Data de Vencimento ..\/..\/....') of the script?
rpedro
 
Posts: 29
Joined: Wed Jun 27, 2012 6:17 am

Next

Return to Support