Extract date from OCR'd PDF

Ok, it's working now.
I learned a little about grep and awk commands, and had to change one part of the code in this line:
x=`egrep -o 'Vencimento ..\/..\/....' "$path"/"_extract.txt" | awk '{print $2}'`
it was $3 before.

Worked like magic.

The only thing is the PDF has to be searchable, or it won't get the text from it. I got errors in the last 2 files I tried because they weren't OCR'd, despite the fact that they are from the same utility company, same format, etc. Go figure.

The "print $2" in your awk command prints the second field in the text (where fields are separated by spaces). You could have used "print $4" in your original script.

Ugg. Shell scripting.

I've been playing with this and what I need to rename had a different date format. The good news is that date is always on line 6 of the file. Is there a way to get it to return only line six for renaming?

Having trouble getting this to work. The PDF is searchable. It seems like the path variable is not working. If I egrep in terminal and list the exact file path, it returns the date I am looking for. Any suggestions?

SK

Try to use absolute paths in shell scripts. It's a good practice for security and safety reasons and it'll work better in Hazel.

I have embedded the run shell script and I suppose it is doing what it is supposed to do, but it still is not putting the date in the title. Am I missing something? Hazel runs and does everything I've asked. I just copied and pasted the script that was posted. Do I need to tinker with it??? I would like the file name to end up being 2012-12-2 based on what it pulls from the PDF file. Could someone please help me? Thank you so much!

You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.

Mr_Noodle wrote:You may need to tweak the script. It is only looking for dates after "Closing date". Unless your files have that text, it's not going to work.

I have altered it. Instead of "Closing date" I changed it to "Date:", which seems to work. However, my file ends up just saying "-.pdf" I don't know enough to change the script past that. Thanks.

I am new to hazel and scripting. I am trying to move billing statements from a "downloaded/scanned" folder to their respective folders on my hard drive. I have no problem moving the files form one folder to another, but I would also like to rename the files with their "statement date" included in the new file name. i.e. EnergyUnited1-30-2013. I tried the above script and it works well when the date in the file is xx/xx/xxxx, but does not when the date is x/xx/xxxx or xx/x/xxxx. How can I change the script to allow for variable formats of the date.

This is the line of the script I am using:
x=`egrep -o 'Billing Date ..\/..\/....' "$path"/"_extract.txt" | awk '{print $3}'`

Thanks

Try this:

x=`egrep -o 'Billing Date [0-9]{1,2}\/[0-9]{1,2}\/[0-9]{2,4}' "$path"/"_extract.txt" | awk '{print $3}'`

Thanks JBB! Exactly what I needed.

Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris

I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.

HirnHorn wrote:Thanks for your help up until now, it works fine but...

...can somebody please tell me how to change the script to automatically name my document YYYY-MM-DD.
I just can't get it sorted out. The dates I extract are formatted DD.MM.YYYY

Thanks,

Chris

Add another awk pipe at the end:

Code: Select all: ... | awk -F "." '{print $3"-"$2"-"$1}'

vanduse1 wrote:I have a PDF that only contains the date in a string like this MMDDYYYY.

How can I modify AWK to separate these into the separate variables so that I can properly name my file YYYY-MM.pdf

I played around with split but couldn't get it to work.

The easy way would be to store your date as a variable and extract the fields separately.

Code: Select all: BAD_DATE=$( this is the line you've already had success with) YEAR=$(echo "$BAD_DATE" | tail -c 4) MONTH=$(echo "$BAD_DATE" | head -c 2) GOOD_DATE=$(echo "$MONTH-$YEAR")