Extract date from OCR'd PDF

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Re: Extract date from OCR'd PDF

Postby rpedro » Sat Jul 07, 2012 6:07 am

Ok, it's working now.
I learned a little about grep and awk commands, and had to change one part of the code in this line:
x=`egrep -o 'Vencimento ..\/..\/....' "$path"/"_extract.txt" | awk '{print $2}'`
it was $3 before.

Worked like magic.

The only thing is the PDF has to be searchable, or it won't get the text from it. I got errors in the last 2 files I tried because they weren't OCR'd, despite the fact that they are from the same utility company, same format, etc. Go figure.
rpedro
 
Posts: 4
Joined: Wed Jun 27, 2012 6:17 am

Re: Extract date from OCR'd PDF

Postby JBB » Mon Jul 09, 2012 2:02 am

The "print $2" in your awk command prints the second field in the text (where fields are separated by spaces). You could have used "print $4" in your original script.
JBB
 
Posts: 12
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF

Postby Scooter » Tue Sep 18, 2012 12:29 am

Ugg. Shell scripting.

I've been playing with this and what I need to rename had a different date format. The good news is that date is always on line 6 of the file. Is there a way to get it to return only line six for renaming?
Scooter
 
Posts: 5
Joined: Tue May 18, 2010 12:55 pm

Re: Extract date from OCR'd PDF

Postby swkent » Wed Nov 28, 2012 7:19 pm

Having trouble getting this to work. The PDF is searchable. It seems like the path variable is not working. If I egrep in terminal and list the exact file path, it returns the date I am looking for. Any suggestions?

SK
swkent
 
Posts: 1
Joined: Wed Nov 28, 2012 5:37 pm

Re: Extract date from OCR'd PDF

Postby Mr_Noodle » Fri Nov 30, 2012 4:53 pm

Try to use absolute paths in shell scripts. It's a good practice for security and safety reasons and it'll work better in Hazel.
Mr_Noodle
Site Admin
 
Posts: 2917
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF

Postby Micah643 » Mon Dec 03, 2012 12:12 am

I have embedded the run shell script and I suppose it is doing what it is supposed to do, but it still is not putting the date in the title. Am I missing something? Hazel runs and does everything I've asked. I just copied and pasted the script that was posted. Do I need to tinker with it??? I would like the file name to end up being 2012-12-2 based on what it pulls from the PDF file. Could someone please help me? Thank you so much!
Micah643
 
Posts: 3
Joined: Mon Dec 03, 2012 12:09 am

Re: Extract date from OCR'd PDF

Postby Mr_Noodle » Mon Dec 03, 2012 4:06 pm

You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.
Mr_Noodle
Site Admin
 
Posts: 2917
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF

Postby Micah643 » Tue Dec 04, 2012 8:55 pm

Mr_Noodle wrote:You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.



I have altered it. Instead of "Closing date" I changed it to "Date:", which seems to work. However, my file ends up just saying "-.pdf" I don't know enough to change the script past that. Thanks.
Micah643
 
Posts: 3
Joined: Mon Dec 03, 2012 12:09 am

Re: Extract date from OCR'd PDF

Postby wpatton » Sun Feb 24, 2013 12:37 pm

I am new to hazel and scripting. I am trying to move billing statements from a "downloaded/scanned" folder to their respective folders on my hard drive. I have no problem moving the files form one folder to another, but I would also like to rename the files with their "statement date" included in the new file name. i.e. EnergyUnited1-30-2013. I tried the above script and it works well when the date in the file is xx/xx/xxxx, but does not when the date is x/xx/xxxx or xx/x/xxxx. How can I change the script to allow for variable formats of the date.

This is the line of the script I am using:
x=`egrep -o 'Billing Date ..\/..\/....' "$path"/"_extract.txt" | awk '{print $3}'`

Thanks
wpatton
 
Posts: 2
Joined: Sun Feb 24, 2013 11:31 am

Re: Extract date from OCR'd PDF

Postby JBB » Mon Feb 25, 2013 3:28 am

Try this:

x=`egrep -o 'Billing Date [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}' "$path"/"_extract.txt" | awk '{print $3}'`
JBB
 
Posts: 12
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF

Postby wpatton » Mon Feb 25, 2013 9:31 pm

Thanks JBB! Exactly what I needed.
wpatton
 
Posts: 2
Joined: Sun Feb 24, 2013 11:31 am

Re: Extract date from OCR'd PDF

Postby HirnHorn » Tue Mar 05, 2013 2:43 pm

Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris
HirnHorn
 
Posts: 3
Joined: Tue Mar 05, 2013 2:34 pm
Location: Hamburg (GER) and London (GBR)

Re: Extract date from OCR'd PDF

Postby vanduse1 » Thu Mar 07, 2013 7:23 pm

I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.
vanduse1
 
Posts: 4
Joined: Thu Mar 07, 2013 7:20 pm

Re: Extract date from OCR'd PDF

Postby a_freyer » Fri Mar 08, 2013 11:21 am

HirnHorn wrote:Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris


Add another awk pipe at the end:

Code: Select all
 ... | awk -F "." '{print $3"-"$2"-"$1}'
a_freyer
 
Posts: 569
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

Re: Extract date from OCR'd PDF

Postby a_freyer » Fri Mar 08, 2013 11:44 am

vanduse1 wrote:I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.


The easy way would be to store your date as a variable and extract the fields separately.

Code: Select all
BAD_DATE=$( this is the line you've already had success with)
YEAR=$(echo "$BAD_DATE" | tail -c 4)
MONTH=$(echo "$BAD_DATE" | head -c 2)
GOOD_DATE=$(echo "$MONTH-$YEAR")
a_freyer
 
Posts: 569
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

PreviousNext

Return to Support

Who is online

Users browsing this forum: Google [Bot], Heritrix [Crawler] and 2 guests

cron


Copyright © 2006-2012 Noodlesoft, LLC, All Rights Reserved. Mac and the Mac logo are trademarks of Apple Computer, Inc., registered in the U.S. and other countries.