Extract date from OCR'd PDF

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

Re: Extract date from OCR'd PDF Fri Mar 08, 2013 3:27 pm • by semiamaurotic
I hope I'm not derailing this thread, but I have an alternate method, using a Ruby script, for achieving the same thing.

DISCLAIMER: Until today, I have never interacted with Ruby before, so if I'm not following best practices, etc., feel free to correct me...

I was having trouble getting the script from this thread to work for my scenario, and through some googling I found this thread. Someone there had posted a Ruby script that would extract the date from a pdf and rename the file using the extracted name.
However, that didn't work well with the flow of my Hazel rule, so I found some commands that allowed me to simply update the 'date modified' of the file, which I could then use in the renaming process.

Here is the code:

The "Shell:" path should be set to "/usr/bin/ruby", and "SearchStringHere" should be changed to whatever text precedes the date you are trying to extract.

Code: Select all
require 'osx/cocoa'
require 'fileutils'
require 'time'
include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
   puts "Missing path."
   exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
   puts "Failed to open pdf"
   exit 1
end
pdfText = pdfDoc.string.to_s
match = pdfText.match(/SearchStringHere\s*(\d+)\/(\d+)\/(\d+)/i)
if not match
   puts "Didn't find search string"
   exit 1
end
month, day, year = match[1], match[2], match[3]
year = year.to_i
year += 2000 if year < 50
year += 1900 if year < 100
t = Time.parse("#{day}-#{month}-#{year}")
File.utime(t,t, path)
exit 0
semiamaurotic
 
Posts: 1
Joined: Fri Mar 08, 2013 2:54 pm

Re: Extract date from OCR'd PDF Sat Mar 09, 2013 4:01 am • by HirnHorn
Thanks a lot...works like a charm!

a_freyer wrote:Add another awk pipe at the end:

Code: Select all
 ... | awk -F "." '{print $3"-"$2"-"$1}'
HirnHorn
 
Posts: 3
Joined: Tue Mar 05, 2013 2:34 pm
Location: Hamburg (GER) and London (GBR)

Re: Extract date from OCR'd PDF Tue Apr 02, 2013 6:47 pm • by vanduse1
I have another scenario where the dates in my PDF are only 1 digit for month and date if < 10.

I want to ensure a two digit value. So if my result is M/D/YYYY how to add logic to make it MM/DD/YYYY?
vanduse1
 
Posts: 4
Joined: Thu Mar 07, 2013 7:20 pm

Re: Extract date from OCR'd PDF Fri Apr 05, 2013 2:27 am • by JBB
Replace {print $1} with {printf "%02s", $1}
JBB
 
Posts: 22
Joined: Mon Sep 12, 2011 6:09 pm

Re: Extract date from OCR'd PDF Fri Apr 05, 2013 1:31 pm • by vanduse1
JBB wrote:Replace {print $1} with {printf "%02s", $1}


Great! It works. Thanks!
vanduse1
 
Posts: 4
Joined: Thu Mar 07, 2013 7:20 pm

Re: Extract date from OCR'd PDF Thu Apr 11, 2013 3:14 am • by jgse01
Guys, I have read all of the posts above and I was wondering if someone could help me with the code. Here is what I would like to do with mine. It is close and after hours of frustration, I resorted to signing up on the forums to ask for help before I pull out the few remaining hairs I have left.

The first thing I will do is scan the receipt, then I have a program oct it, and after the OCR it runs the rule.


I would like for Hazel to do the following in a rule:

1.) Kind - PDF

2.) Contents - Is there a limit to the contents field? I would love to use a long list of vendor names that I have on receipts and if any are matched it goes to step 3

3.) Some kind of script - I want the script to search the PDF for a date in multiple formats (it could be in mm/dd/yyyy, mm/dd/yy, m/dd/yy, m/d/yy, m/dd/yyyy, m/d/yyyy, Jan dd' yy or any variation that could be a date string

4.) If it matches any of the vendors, finds a date format (it could be any format of date), then moves it to a specified folder. There are more details desired than this then I would love to talk with someone about it, but if I could just get the top part mentioned that would be awesome for my paperless work flow. I look forward to anyone's help on this. If it is beyond the scope of this thread, then let me know. I would be willing to pay someone for the help on the scripting of this.
jgse01
 
Posts: 1
Joined: Thu Apr 11, 2013 3:03 am

Re: Extract date from OCR'd PDF Sun Apr 28, 2013 9:11 am • by sgfido
Is there a clever way to extract a date in the format
"Statement ends 16 April 2013"
and then outputs the date in the format
"2013-04-16 - my file name.pdf"
sgfido
 
Posts: 1
Joined: Sun Apr 28, 2013 9:07 am

Re: Extract date from OCR'd PDF Sun Apr 28, 2013 10:48 am • by a_freyer
jgse01 wrote:3.) Some kind of script - I want the script to search the PDF for a date in multiple formats (it could be in mm/dd/yyyy, mm/dd/yy, m/dd/yy, m/d/yy, m/dd/yyyy, m/d/yyyy, Jan dd' yy or any variation that could be a date string



The component of the script that you'd need to check if a date string exists within the document would be best done with regex, below shown with grep:

Code: Select all
grep -E "\b([0-9]{1}|[0-9]{2}|[0-9]{4})[/\-]([0-9]{1}|[0-9]{2})[/\-]([0-9]{1}|[0-9]{2}|[0-9]{4})\b"


This text piped to grep for:

[word break][one, two, or four digits][dash slash or backslash][one or two digits][dash slash or backslash][one, two, or four digits][word break]

This does not check for month word strings, but you could do that like this:

Code: Select all
grep -iEe "\b([0-9]{1}|[0-9]{2}|[0-9]{4})[/\-]([0-9]{1}|[0-9]{2})[/\-]([0-9]{1}|[0-9]{2}|[0-9]{4})\b" -e "\b(Jan|Feb|Mar|Apr|May|June|July|Aug|Sept|Oct|Nov|Dec){1}"


Which tests for the name of the month within the file. Note here that since all month abbreviations are at least the first three letters of the month, we do not include a word break search \b at the end of this second string. Also, this search is case insensitive.
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

Re: Extract date from OCR'd PDF Wed May 29, 2013 6:09 am • by sussdorff
Here is a Ruby Script which I so far successfully use to extract the date from the OCR'd PDF and store that date as the modification date. As you might guess from the naming of the months, I am from Germany, but US works as well.

Code: Select all
require 'osx/cocoa'
require 'fileutils'
require 'date'
require 'time'

include OSX
OSX.require_framework('PDFKit')
path = ARGV[0]
if not path
puts "Missing path."
exit 1
end
pdfDoc = PDFDocument.alloc.initWithURL(NSURL.fileURLWithPath(path))
if not pdfDoc
puts "Failed to open pdf"
exit 1
end
pdfText = pdfDoc.string.to_s

# Standard German matches
match = pdfText.match(/(\d\d?)(\.)(\d\d?)(\.)((20|19)?(\d\d))\s/i)
if match then
  day, month, year = match[1], match[3], match [5]
end

if not match then
  match = pdfText.match(/(\d\d?)(\-)(\d\d?)(\-)((20|19)?(\d\d))\s/i)
  if match then
    day, month, year = match[1], match[3], match [5]
  end
end

if not match then
  match = pdfText.match(/(\d\d?)(\/)(\d\d?)(\/)((20|19)?(\d\d))\s/i)
  if match then
    day, month, year = match[1], match[3], match [5]
  end
end

# Match with full date name
if not match then
  match = pdfText.match(/(\d\d?)\.\s?(Januar|Februar|März|April|Mai|Juni|Juli|August|September|Oktober|November|Dezember)\s((20|19)?(\d\d))/i)
  if match then
    day, month, year = match[1], match[2], match[3]
    case month
    when "Januar"
      month = 1
    when "Februar"
      month = 2
    when "März"
      month = 3
    when "April"
      month = 4
    when "Mai"
      month = 5
    when "Juni"
      month = 6
    when "Juli"
      month = 7
    when "August"
      month = 8
    when "September"
      month = 9
    when "Oktober"
      month = 10
    when "November"
      month = 11
    when "Dezember"
      month = 12
    else
      match = ""
    end
  end
end

# US Match with full date name
if not match then
  match = pdfText.match(/(January|February|March|April|May|June|July|August|September|Oktober|November|December)\s(\d\d),\s?((20|19)?(\d\d))/i)
  if match then
    day, month, year = match[2], match[1], match[3]
    case month
    when "January"
      month = 1
    when "February"
      month = 2
    when "March"
      month = 3
    when "April"
      month = 4
    when "May"
      month = 5
    when "June"
      month = 6
    when "July"
      month = 7
    when "August"
      month = 8
    when "September"
      month = 9
    when "Oktober"
      month = 10
    when "November"
      month = 11
    when "December"
      month = 12
    else
      match = ""
    end
  end
end

if not match
puts "Didn't find period ending"
exit 1
end

year = year.to_i
year += 2000 if year < 50
year += 1900 if year < 100
parentFolder = File.dirname(path)
filename = File.basename(path)
fileDate = "#{year}-#{month}-#{day}"

begin
   mydate = Date.parse(fileDate)
rescue
  puts "Not a valid date " + fileDate
  exit 1
end
newPath = parentFolder + "/" + mydate.strftime('%Y-%m-%d') + " " + filename
File.utime(File.atime(path), Time.parse(mydate.to_s), path)
FileUtils.mv(path, newPath)
puts newPath
exit 0
sussdorff
 
Posts: 2
Joined: Fri Jul 27, 2012 4:05 am

Re: Extract date from OCR'd PDF Mon Jun 24, 2013 2:32 pm • by craw
I have been fooling around with GREP and it is great but am finding that the files created by the OCR both the native one that comes with the scan snap and also pdfpenpro all seem to have invalid characters in them that cause grep to fail. How is everone getting around this. I can open file with emacs but when you check status of grep it has failed and never finds anything.

Have created a couple of different files and is happening on them all. Straight text files created from the computer side grep works beautifully on but on anything that I have scanned and OCRed I get nothing even if I extract the text out of the pdf these characters are still in there and cause grep to abort. .

Weird have not tried the ruby script yet but this was interesting. Any ideas, OSX spotlight seems to be able to find stuff in files okay but grep is not working for me.
craw
 
Posts: 5
Joined: Fri May 10, 2013 6:03 pm

Re: Extract date from OCR'd PDF Mon Jun 24, 2013 2:54 pm • by a_freyer
The best practice is frankly to use sed or another tool to correct for common errors before grepping. For example, in many OCR scripts that I run, I will replace "l" with "1" or "S" with "5" or "O" with "0" when they are adjacent to other recognized numbers.
a_freyer
 
Posts: 631
Joined: Tue Sep 30, 2008 9:21 am
Location: Colorado

Re: Extract date from OCR'd PDF Mon Jun 24, 2013 3:17 pm • by Mr_Noodle
Is there still a need for this, given 3.1's ability to extract dates from contents?
Mr_Noodle
Site Admin
 
Posts: 11866
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: Extract date from OCR'd PDF Mon Jun 24, 2013 7:37 pm • by craw
Do you execute this code is a shell or script? Does it use the default ruby on osx or do you need to install another package or packages.
craw
 
Posts: 5
Joined: Fri May 10, 2013 6:03 pm

Re: Extract date from OCR'd PDF Tue Jun 25, 2013 11:38 am • by Tetsugaku
I'd be all over any way to automatically pull out a date in common formats and use it in the rename - every single statement I download would be suitable for example!
Tetsugaku
 
Posts: 9
Joined: Thu Nov 08, 2012 9:16 am

Re: Extract date from OCR'd PDF Thu Oct 31, 2013 5:56 am • by Jules
hey im also looking for a way to take out text from some of my old pdfs... The text is really long so copying by hand would take forever... so far only found http://pdftoword.pro/ ... but kind of jumbles text up

does anyone know a better converter?

regards jules :)
Jules
 
Posts: 1
Joined: Thu Oct 31, 2013 5:54 am

PreviousNext

Return to Support