Extract date from OCR'd PDF

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Re: Extract date from OCR'd PDF Sat Jul 07, 2012 6:07 am • by rpedro
Ok, it's working now.
I learned a little about grep and awk commands, and had to change one part of the code in this line:
x=`egrep -o 'Vencimento ..\/..\/....' "$path"/"_extract.txt" | awk '{print $2}'`
it was $3 before.

Worked like magic.

The only thing is the PDF has to be searchable, or it won't get the text from it. I got errors in the last 2 files I tried because they weren't OCR'd, despite the fact that they are from the same utility company, same format, etc. Go figure.
rpedro
 
Posts: 29
Joined: Wed Jun 27, 2012 6:17 am

Re: Extract date from OCR'd PDF Mon Jul 09, 2012 2:02 am • by JBB
The "print $2" in your awk command prints the second field in the text (where fields are separated by spaces). You could have used "print $4" in your original script.
JBB
 
Posts: 22
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Tue Sep 18, 2012 12:29 am • by Scooter
Ugg. Shell scripting.

I've been playing with this and what I need to rename had a different date format. The good news is that date is always on line 6 of the file. Is there a way to get it to return only line six for renaming?
Scooter
 
Posts: 18
Joined: Tue May 18, 2010 12:55 pm

Re: Extract date from OCR'd PDF Wed Nov 28, 2012 7:19 pm • by swkent
Having trouble getting this to work. The PDF is searchable. It seems like the path variable is not working. If I egrep in terminal and list the exact file path, it returns the date I am looking for. Any suggestions?

SK
swkent
 
Posts: 3
Joined: Wed Nov 28, 2012 5:37 pm

Re: Extract date from OCR'd PDF Fri Nov 30, 2012 4:53 pm • by Mr_Noodle
Try to use absolute paths in shell scripts. It's a good practice for security and safety reasons and it'll work better in Hazel.
Mr_Noodle
Site Admin
 
Posts: 11866
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Mon Dec 03, 2012 12:12 am • by Micah643
I have embedded the run shell script and I suppose it is doing what it is supposed to do, but it still is not putting the date in the title. Am I missing something? Hazel runs and does everything I've asked. I just copied and pasted the script that was posted. Do I need to tinker with it??? I would like the file name to end up being 2012-12-2 based on what it pulls from the PDF file. Could someone please help me? Thank you so much!
Micah643
 
Posts: 3
Joined: Mon Dec 03, 2012 12:09 am

Re: Extract date from OCR'd PDF Mon Dec 03, 2012 4:06 pm • by Mr_Noodle
You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.
Mr_Noodle
Site Admin
 
Posts: 11866
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Tue Dec 04, 2012 8:55 pm • by Micah643
Mr_Noodle wrote:You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.



I have altered it. Instead of "Closing date" I changed it to "Date:", which seems to work. However, my file ends up just saying "-.pdf" I don't know enough to change the script past that. Thanks.
Micah643
 
Posts: 3
Joined: Mon Dec 03, 2012 12:09 am

Re: Extract date from OCR'd PDF Sun Feb 24, 2013 12:37 pm • by wpatton
I am new to hazel and scripting. I am trying to move billing statements from a "downloaded/scanned" folder to their respective folders on my hard drive. I have no problem moving the files form one folder to another, but I would also like to rename the files with their "statement date" included in the new file name. i.e. EnergyUnited1-30-2013. I tried the above script and it works well when the date in the file is xx/xx/xxxx, but does not when the date is x/xx/xxxx or xx/x/xxxx. How can I change the script to allow for variable formats of the date.

This is the line of the script I am using:
x=`egrep -o 'Billing Date ..\/..\/....' "$path"/"_extract.txt" | awk '{print $3}'`

Thanks
wpatton
 
Posts: 2
Joined: Sun Feb 24, 2013 11:31 am

Re: Extract date from OCR'd PDF Mon Feb 25, 2013 3:28 am • by JBB
Try this:

x=`egrep -o 'Billing Date [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}' "$path"/"_extract.txt" | awk '{print $3}'`
JBB
 
Posts: 22
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Mon Feb 25, 2013 9:31 pm • by wpatton
Thanks JBB! Exactly what I needed.
wpatton
 
Posts: 2
Joined: Sun Feb 24, 2013 11:31 am

Re: Extract date from OCR'd PDF Tue Mar 05, 2013 2:43 pm • by HirnHorn
Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris
HirnHorn
 
Posts: 3
Joined: Tue Mar 05, 2013 2:34 pm
Location: Hamburg (GER) and London (GBR)

Re: Extract date from OCR'd PDF Thu Mar 07, 2013 7:23 pm • by vanduse1
I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.
vanduse1
 
Posts: 4
Joined: Thu Mar 07, 2013 7:20 pm

Re: Extract date from OCR'd PDF Fri Mar 08, 2013 11:21 am • by a_freyer
HirnHorn wrote:Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris


Add another awk pipe at the end:

Code: Select all
 ... | awk -F "." '{print $3"-"$2"-"$1}'
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

Re: Extract date from OCR'd PDF Fri Mar 08, 2013 11:44 am • by a_freyer
vanduse1 wrote:I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.


The easy way would be to store your date as a variable and extract the fields separately.

Code: Select all
BAD_DATE=$( this is the line you've already had success with)
YEAR=$(echo "$BAD_DATE" | tail -c 4)
MONTH=$(echo "$BAD_DATE" | head -c 2)
GOOD_DATE=$(echo "$MONTH-$YEAR")
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

PreviousNext

Return to Support