Extract date from OCR'd PDF

I hope I'm not derailing this thread, but I have an alternate method, using a Ruby script, for achieving the same thing.

DISCLAIMER: Until today, I have never interacted with Ruby before, so if I'm not following best practices, etc., feel free to correct me...

I was having trouble getting the script from this thread to work for my scenario, and through some googling I found this thread. Someone there had posted a Ruby script that would extract the date from a pdf and rename the file using the extracted name.
However, that didn't work well with the flow of my Hazel rule, so I found some commands that allowed me to simply update the 'date modified' of the file, which I could then use in the renaming process.

Here is the code:

The "Shell:" path should be set to "/usr/bin/ruby", and "SearchStringHere" should be changed to whatever text precedes the date you are trying to extract.

Code: Select all: require 'osx/cocoa' require 'fileutils' require 'time' include OSX OSX.require_framework('PDFKit') path = ARGV[0] if not path puts "Missing path." exit 1 end pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path)) if not pdfDoc puts "Failed to open pdf" exit 1 end pdfText = pdfDoc.string.to_s match = pdfText.match(/SearchStringHere\s*(\d+)\/(\d+)\/(\d+)/i) if not match puts "Didn't find search string" exit 1 end month, day, year = match[1], match[2], match[3] year = year.to_i year += 2000 if year < 50 year += 1900 if year < 100 t = Time.parse("#{day}-#{month}-#{year}") File.utime(t,t, path) exit 0

Thanks a lot...works like a charm!

a_freyer wrote:Add another awk pipe at the end:

Code: Select all
... | awk -F "." '{print $3"-"$2"-"$1}'

I have another scenario where the dates in my PDF are only 1 digit for month and date if < 10.

I want to ensure a two digit value. So if my result is M/D/YYYY how to add logic to make it MM/DD/YYYY?

Replace {print $1} with {printf "%02s", $1}

JBB wrote:Replace {print $1} with {printf "%02s", $1}

Great! It works. Thanks!

Guys, I have read all of the posts above and I was wondering if someone could help me with the code. Here is what I would like to do with mine. It is close and after hours of frustration, I resorted to signing up on the forums to ask for help before I pull out the few remaining hairs I have left.

The first thing I will do is scan the receipt, then I have a program oct it, and after the OCR it runs the rule.

I would like for Hazel to do the following in a rule:

1.) Kind - PDF

2.) Contents - Is there a limit to the contents field? I would love to use a long list of vendor names that I have on receipts and if any are matched it goes to step 3

3.) Some kind of script - I want the script to search the PDF for a date in multiple formats (it could be in mm/dd/yyyy, mm/dd/yy, m/dd/yy, m/d/yy, m/dd/yyyy, m/d/yyyy, Jan dd' yy or any variation that could be a date string

4.) If it matches any of the vendors, finds a date format (it could be any format of date), then moves it to a specified folder. There are more details desired than this then I would love to talk with someone about it, but if I could just get the top part mentioned that would be awesome for my paperless work flow. I look forward to anyone's help on this. If it is beyond the scope of this thread, then let me know. I would be willing to pay someone for the help on the scripting of this.

Is there a clever way to extract a date in the format
"Statement ends 16 April 2013"
and then outputs the date in the format
"2013-04-16 - my file name.pdf"

jgse01 wrote:3.) Some kind of script - I want the script to search the PDF for a date in multiple formats (it could be in mm/dd/yyyy, mm/dd/yy, m/dd/yy, m/d/yy, m/dd/yyyy, m/d/yyyy, Jan dd' yy or any variation that could be a date string

The component of the script that you'd need to check if a date string exists within the document would be best done with regex, below shown with grep:

Code: Select all: grep -E "\b([0-9]{1}|[0-9]{2}|[0-9]{4})[/\-]([0-9]{1}|[0-9]{2})[/\-]([0-9]{1}|[0-9]{2}|[0-9]{4})\b"

This text piped to grep for:

[word break][one, two, or four digits][dash slash or backslash][one or two digits][dash slash or backslash][one, two, or four digits][word break]

This does not check for month word strings, but you could do that like this:

Code: Select all: grep -iEe "\b([0-9]{1}|[0-9]{2}|[0-9]{4})[/\-]([0-9]{1}|[0-9]{2})[/\-]([0-9]{1}|[0-9]{2}|[0-9]{4})\b" -e "\b(Jan|Feb|Mar|Apr|May|June|July|Aug|Sept|Oct|Nov|Dec){1}"

Which tests for the name of the month within the file. Note here that since all month abbreviations are at least the first three letters of the month, we do not include a word break search \b at the end of this second string. Also, this search is case insensitive.

Here is a Ruby Script which I so far successfully use to extract the date from the OCR'd PDF and store that date as the modification date. As you might guess from the naming of the months, I am from Germany, but US works as well.

Code: Select all: require 'osx/cocoa' require 'fileutils' require 'date' require 'time' include OSX OSX.require_framework('PDFKit') path = ARGV[0] if not path puts "Missing path." exit 1 end pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path)) if not pdfDoc puts "Failed to open pdf" exit 1 end pdfText = pdfDoc.string.to_s # Standard German matches match = pdfText.match(/(\d\d?)(\.)(\d\d?)(\.)((20|19)?(\d\d))\s/i) if match then day, month, year = match[1], match[3], match [5] end if not match then match = pdfText.match(/(\d\d?)(\-)(\d\d?)(\-)((20|19)?(\d\d))\s/i) if match then day, month, year = match[1], match[3], match [5] end end if not match then match = pdfText.match(/(\d\d?)(\/)(\d\d?)(\/)((20|19)?(\d\d))\s/i) if match then day, month, year = match[1], match[3], match [5] end end # Match with full date name if not match then match = pdfText.match(/(\d\d?)\.\s?(Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\s((20|19)?(\d\d))/i) if match then day, month, year = match[1], match[2], match[3] case month when "Januar" month = 1 when "Februar" month = 2 when "März" month = 3 when "April" month = 4 when "Mai" month = 5 when "Juni" month = 6 when "Juli" month = 7 when "August" month = 8 when "September" month = 9 when "Oktober" month = 10 when "November" month = 11 when "Dezember" month = 12 else match = "" end end end # US Match with full date name if not match then match = pdfText.match(/(January|February|March|April|May|June|July|August|September|Oktober|November|December)\s(\d\d),\s?((20|19)?(\d\d))/i) if match then day, month, year = match[2], match[1], match[3] case month when "January" month = 1 when "February" month = 2 when "March" month = 3 when "April" month = 4 when "May" month = 5 when "June" month = 6 when "July" month = 7 when "August" month = 8 when "September" month = 9 when "Oktober" month = 10 when "November" month = 11 when "December" month = 12 else match = "" end end end if not match puts "Didn't find period ending" exit 1 end year = year.to_i year += 2000 if year < 50 year += 1900 if year < 100 parentFolder = File.dirname(path) filename = File.basename(path) fileDate = "#{year}-#{month}-#{day}" begin mydate = Date.parse(fileDate) rescue puts "Not a valid date " + fileDate exit 1 end newPath = parentFolder + "/" + mydate.strftime('%Y-%m-%d') + " " + filename File.utime(File.atime(path), Time.parse(mydate.to_s), path) FileUtils.mv(path, newPath) puts newPath exit 0

I have been fooling around with GREP and it is great but am finding that the files created by the OCR both the native one that comes with the scan snap and also pdfpenpro all seem to have invalid characters in them that cause grep to fail. How is everone getting around this. I can open file with emacs but when you check status of grep it has failed and never finds anything.

Have created a couple of different files and is happening on them all. Straight text files created from the computer side grep works beautifully on but on anything that I have scanned and OCRed I get nothing even if I extract the text out of the pdf these characters are still in there and cause grep to abort. .

Weird have not tried the ruby script yet but this was interesting. Any ideas, OSX spotlight seems to be able to find stuff in files okay but grep is not working for me.

The best practice is frankly to use sed or another tool to correct for common errors before grepping. For example, in many OCR scripts that I run, I will replace "l" with "1" or "S" with "5" or "O" with "0" when they are adjacent to other recognized numbers.

Is there still a need for this, given 3.1's ability to extract dates from contents?

Do you execute this code is a shell or script? Does it use the default ruby on osx or do you need to install another package or packages.

I'd be all over any way to automatically pull out a date in common formats and use it in the rename - every single statement I download would be suitable for example!

hey im also looking for a way to take out text from some of my old pdfs... The text is really long so copying by hand would take forever... so far only found http://pdftoword.pro/ ... but kind of jumbles text up

does anyone know a better converter?

regards jules