OCR PDF in Acrobat

From your noodle to other noodles. Talk about ways to get the most from Hazel. Even exchange recipes for the cool rules you've thought up. DO NOT POST YOUR QUESTIONS HERE.

Moderators: Mr_Noodle, Moderators

OCR PDF in Acrobat Fri Jul 24, 2009 12:28 pm • by macdrifter
I was trying out Hazel to see if it could handle an OCR project. I have a folder that all ScanSnap scans get dumped to. I want those to be run through an Acrobat Pro OCR, saved and then moved to an inbox. This worked perfectly. Here is the applescript I cobbled together to do this:
Code: Select all
try
   tell application "Adobe Acrobat Professional"
      activate
      open theFile
      tell application "System Events"
         tell process "Acrobat"
            tell menu bar 1
               tell menu "Document"
                  tell menu item "OCR Text Recognition"
                     tell menu 1
                        click menu item "Recognize Text Using OCR..."
                     end tell
                  end tell
               end tell
               
            end tell
            keystroke return
         end tell
      end tell
      save the front document
      close the front document
   end tell
end try


Follow this up in Hazel with a "move to folder" and "growl notification" and I think it is a pretty elegant solution. I could have done the whole thing with folder actions, but i find that they are pretty unreliable at times. Not to mention that I need to maintain the code if I decide to change the workflow. It's not worth the effort.

I hope this helps someone else. I guess I will go and buy Hazel now.

Gabe
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Mon Jul 27, 2009 2:45 pm • by Mr_Noodle
Thanks for the script. As for what to do with it after you've scanned it, here's a nice little article describing one way to file it: http://www.macsparky.com/2009/06/02/sor ... ith-hazel/
Mr_Noodle
Site Admin
 
Posts: 11193
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: OCR PDF in Acrobat Mon Jul 27, 2009 6:49 pm • by macdrifter
Thanks David. That was actually the inspiration. It was the Mac Power User Podcast that got me looking into Hazel to solve this problem. It's really annoying that Adobe doesn't provide a good solution even when you shell out the money for Acrobat Pro.

Any who, thanks for the heads up on Hazel. I'm a believer.
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Sun Aug 23, 2009 10:11 pm • by josephbphillips
I am having a hard time with this. In my workflow, Hazel seems to be sending the file to Acrobat before it is finished scanning. Did you have that problem? What is your whole work flow?

Thanks!!
josephbphillips
 
Posts: 8
Joined: Sat Nov 17, 2007 4:49 pm

Re: OCR PDF in Acrobat Mon Aug 24, 2009 8:14 pm • by macdrifter
josephbphillips wrote:I am having a hard time with this. In my workflow, Hazel seems to be sending the file to Acrobat before it is finished scanning. Did you have that problem? What is your whole work flow?

Thanks!!

I don't seem to have that problem. I'm using a ScanSnap and scanning directly to PDF. The Hazel script simply looks at the OCR inbox I created waiting for a document of Kind PDF. It then runs the embedded AppleScript and then moves the OCR'd file to a new folder and adds the date_OCR as the new file name. The file is then colored green (for me, all OCR'd PDF's are colored green. It's an easy way to know while browsing). This probably will not work if the scanner is too slow. Try adding a delay at the beginning of the script to wait a few seconds for the file to finish. Good luck and let us know if you figure it out.
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Mon Aug 24, 2009 9:37 pm • by josephbphillips
I have the fix. I was using "extension is pdf" instead of "kind is pdf". I don't understand why that changes the behavior, but it does and is reproducible if I change it back.

Thanks for walking me through your workflow!

As long as we are on the topic, how do you organize your PDFs?

I set up scansnap with profiles for each of the paper bills/statements I typically receive and I have a generic profile for other items I want to scan. Each profile gives the PDF an appropriate name (ie DWP 001.pdf) and dumps it in my inbox. Hazel watches the inbox and has a rule set up to match each bill name (ie name contains DWP and Kind is PDF), OCR's it, renames it to "DWP mm/dd/yy.pdf", then moves it to ~Documents/Bills/DWP/.

I am trying to decide whether I should quit there, but for now I have also been using Yep! to tag the files for easier retrieval. I think it may be organization for organization's sake.

I have also tried having the documents scan in to neatworks, but for the sake of longevity I don't really want to be locked in to their software.

Anything brilliant you guys are doing?

Thanks!
josephbphillips
 
Posts: 8
Joined: Sat Nov 17, 2007 4:49 pm

Re: OCR PDF in Acrobat Mon Aug 24, 2009 9:58 pm • by macdrifter
Good to hear.
Even with a bunch of great Spotlight comments and meta tags I still keep many of my PDF's organized by folders (much like you describe). Partly because I learned the hard way that all of those Spotlight meta tags are very tenuous. The spotlight meta data is not actually embedded in the file. I made the mistake of archiving to a Windows NTFS disk and lost ALL of my meta data.
My system is a series of structured folders and disk images. I keep all of my personal and financial docs on a secure disk image that I mount using Knox. I keep all of my bills in another encrypted disk image. Finally I have an encrypted disk image for my receipts. Each disk image has a traditional folder structure for things like "Taxes", "Bank Statements" and "Mortgage".

I still add Spotlight meta tags but I do not rely on them. I figure the worst case scenario now is that I can still search all of the OCR'd text and the folders are structured in a meaningful way so I can quickly narrow a search.

The advantage to the encrypted disk images is that if my mac is stolen, there is NO WAY anyone can get at those documents. Also, I can quickly archive a copy of the disk image onto a thumb drive and have an entire encrypted portable copy.

I tried Devonthink Office Pro but found it to be too cumbersome for PDF management. I try to keep my PDF's as universal as possible.

Neatworks looks nice but I tried it out and performance for a large library was terrible.

Since you use multiple ScanSnap profiles (I prefer to use the scan button on the front) you could have Hazel add meta tags to the document comments field when it is filed. I recommend prefixing spotlight comments with an "@" symbol to help with searching. That way you know you are searching on the tag only since only your meta tags will begin with an "@"
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Mon Aug 24, 2009 10:15 pm • by josephbphillips
I do use encrypted disk images for bank records and medical paperwork, but I haven't automated that piece yet.

Do you manually sort your OCR files in to the disk images and folder structure? I am just setting up my workflow, but I found the profiles to be the easiest way to make this automatic. I haven't tried to have it put the files in a disk image. Is there an easy way to do that?

I have been using rsync to backup my files to a web server, but encrypted disk images obviously eliminate the advantage of incremental backup. I have been considering getting Mozy or something, moving the vanity page/domain to google sites, and getting rid of the server. Do you use offsite storage for your disk images? Any tips?

Thanks again for the help!
josephbphillips
 
Posts: 8
Joined: Sat Nov 17, 2007 4:49 pm

Re: OCR PDF in Acrobat Tue Aug 25, 2009 11:32 am • by macdrifter
Long ago I used Quicksilver to move files to their correct directory. Then I did some terminal commands to accomplish this. Since starting to use hazel, I've setup a series of rules that look at my inbox. As soon as I tag a file with a specific tag or set of tags then the file gets moved to where it is supposed to go. For example adding the spotlight comment "@mortgage" triggers hazel to move the file to my "Mortgage" archive folder on the secure disk image. It complicated if you want fine control over moving the files.

It sounds like your Scansnap profiles will accomplish what you need. It's a cool idea. I just hate to go to the mouse every time I want to scan a doc.

I do not use a traditional online backup service. I paid for Mozy for one year but found it to be too buggy and unreliable. It is also ridiculously slow. I have not heard a first hand account of someone recovering their files that way. I use Dropbox for a fair amount of critical stuff and for quick file exchange between machines. Dropbox definitely works and is very fast.

The real backup combo for me is a Drobo at home and an external drive that I keep at work. Once a month I bring it home and re-image my boot drive. That way, if there is a fire at home my data is still safe.

If you have a web server and it is working for backup then I would stick with that. Just make sure it is secure and the storage is reliable. Can the host actually restore data if there is an outage?
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Thu Aug 27, 2009 11:54 am • by macdrifter
I also used Leap for quite awhile. It essential just catalogs pdf files (along with tags) and allows you to search. There's also a smart folder feature that allows you to quickly locate collections of pdfs. http://www.ironicsoftware.com/leap/index.html
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Thu Oct 01, 2009 6:06 am • by noodle1965
Hi,

It's interesting to read the different approaches to the filing of scanned documents. I have written my approach on my blog at http://103home.com/?p=193. Whatever approach is taken Hazel sure makes the process much easier.

Cheers

Iain
noodle1965
 
Posts: 10
Joined: Sun Sep 13, 2009 12:37 am
Location: Perth, Western Australia

Re: OCR PDF in Acrobat Fri Nov 13, 2009 2:26 pm • by noddy67
Has anyone tried to use the highlighter pen functionality in the scan snap software to attach keywords to the scanned pdfs and then have hazel automatically rename the files based on the keyword data and move it to the correct folder?

If so I'd be very grateful if you could share exactly how you've done it.
Thanks
noddy67
 
Posts: 4
Joined: Wed Oct 28, 2009 4:41 am

Re: OCR PDF in Acrobat Sun Feb 28, 2010 1:42 pm • by macdrifter
Well, I've finally dumped all of my Adobe products. They were constantly consuming 30% of my 4 core MacPro. Even when they were not running.

I already own DEVONthink Pro Office 2, so I decided to checkout the included ABBYY OCR engine. The accuracy is actually better than Acrobat. It's pretty slow in comparison to Acrobat, but the quality of the text recognition is important and the process is completely automated by DEVONthink.

So now my workflow is to Scan to the DEVONthink inbox (OCR is done automatically). Weekly, I do a review and export to a folder for Hazel to finish processing. It renames the files, applies some tags and labels and then archives the to an encrypted disk image.

I'm pretty happy with the new workflow and VERY happy to get rid of all of that bloated Adobe crap-ware.
macdrifter
 
Posts: 8
Joined: Fri Jul 24, 2009 12:23 pm

Re: OCR PDF in Acrobat Tue Mar 02, 2010 4:33 pm • by Mr_Noodle
Thanks for the update. I don't do much OCR now but have been thinking of getting some OCR software at some point for a project so it's good to hear about what people are using and how.
Mr_Noodle
Site Admin
 
Posts: 11193
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: OCR PDF in Acrobat Sun May 30, 2010 11:40 pm • by mjc
This is a great idea! We've been doing those in batches, I'll have to see if I can do this in Hazel on an unused machine... :)
mjc
 
Posts: 2
Joined: Sat Aug 02, 2008 2:03 am

Next

Return to Tips & Tricks - DO NOT POST QUESTIONS

cron