OCR search and Blocks of Text - Applescript needed for Abbyy

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

HI,

I understand this is not specificly a HAZEL issue, but its all part of the workflow I have been working on for the last week.

I feel like I am 99.9% to solving the issues.

I using HAZEL auto OCR files as they appear in a "Watch Folder" using PDFpenPro

The apple script I am using is this:
Code: Select all
try
   tell application "PDFpenPro"
      open theFile as alias
      tell document 1
         ocr
         repeat while performing ocr
            delay 1
         end repeat
         delay 1
         close with saving
      end tell
      
   end tell
   
on error
   tell document 1 to close
   tell application "PDFpenPro" to quit
   --This captures the error so that a document isn't OCR'd ad infinitum.
end try


I have found that the OCR of PDRpenPro is not consistent. One some PDF's it will allow a pattern match, on others it will miss.

However, I tried ABBYY Fine Reader Express, and so far it is working 100%, it seems to "block Text" ocr the documents in a far more logical manner.

So onto my question. Can someone please provide an applescript foe HAZEL to convert PDF files to with ABBYY fine reader to a searchable PDF. My applescript ability os about 5%

Regards

Hans


NB: I did try this script, but it deals with automating a folder rather than using HAZEL, + it did not work with OSX 10.8.4 and I think I would need to amend the script to work with ABBYY FineReader Express.

http://paperjammed.com/2010/01/04/automate-scansnap-ocr-process-on-your-mac-with-applescript-snow-leopard-edition/

Code: Select all
(*

NOTE: This script was written for Snow Leopard. It may work
on Leopard, but I never tried it.

This is a folder listener script that will act as a queue, receiving
PDF files from the ScanSnap scanner and feeding them, one by one, to
the Abbyy FineReader OCR software.

This allows you to keep scanning while the OCR job runs in the background
on all of the unprocessed files.

Why do we want to do this?

The ScanSnap Manager software does not support this by default, so
when you scan in a file, it sends it to FineReader for OCR. You then
must wait until FineReader finishes its work before scanning in another
document.

This script allows you to keep scanning without waiting for OCR.

Installation:

o   Copy this script to:

   <home>/Library/Scripts/Folder Action Scripts

   You may have to create the "Folder Action Scripts" folder.

o   Open a Finder window and navigate to the parent folder
  of the scanned documents folder.

o Right click (control-click) the scanned documents folder and
  choose:

   Folder Actions Setup...

o At this point if folder actions are not enabled, you will
  likely have to enable them and add the script manually.
   - check "Enable Folder Actions"
   - Use the "+" buttons on the left and right sides to add the
     scan folder and then this script.
   
o Otherwise, a list of scripts will come up. Choose this script
  from the "Choose a Script to Attach" dialog.

o Close all windows.

Copyright (C) 2010 Tad Harrison
*)
property ocrFileSuffix : " processed by Abbyy"
property ocrApplicationName : "Scan to Searchable PDF"
property ocrApplicationWindow : "Converting the document"
property ocrLockFileName : "OCR in Progress"
on adding folder items to this_folder after receiving added_items
   set lockFilePath to (POSIX path of (path to desktop folder as text)) & ocrLockFileName
   try
      logEvent("=== Run OCR on New Folder Items ===")
      -- Test for lockfile; exit if lockfile exists
      tell application "System Events" to set lockFileExists to exists file lockFilePath
      if lockFileExists then
         logEvent("Other script running. Exiting...")
         return
      else
         do shell script "/usr/bin/touch \"" & lockFilePath & "\""
      end if
      -- Main loop
      set moreWorkToDo to true
      repeat while moreWorkToDo
         set aFile to getNextFile(this_folder)
         if not aFile = "" then
            ocrFile(aFile)
         else
            set moreWorkToDo to false
         end if
      end repeat
      logEvent("No more work.")
      exitApp(ocrApplicationName)
   on error errorStr number errNum
      display dialog "Error " & errNum & " while running OCR: " & errorStr
      set my isRunning to false
   end try
   -- Get rid of the lockfile, ignoring any errors
   try
      do shell script "/bin/rm \"" & lockFilePath & "\""
   end try
end adding folder items to
(*
Name: ocrFile
Description: Runs OCR on the next un-OCR'd file
Parameters:
  aFile - the file to be OCR'd
*)
on ocrFile(aFile)
   set posixFilePath to POSIX path of aFile
   set posixOcrFilePath to getPosixOcrFilePath(posixFilePath)
   logEvent("OCR: " & posixFilePath)
   tell application ocrApplicationName to open aFile
   --
   -- Now sit in a loop checking once per second for the OCR file
   -- Give up after five minutes
   --
   with timeout of 300 seconds
      set ocrFileExists to false
      repeat until ocrFileExists
         set ocrFileExists to posixFileExists(posixOcrFilePath)
         if ocrFileExists then
            logEvent("OCR file generated.")
            -- Wait 5 even if the file was found, to let things settle
            delay 5
         else
            -- Wait a second before checking again
            delay 1
         end if
      end repeat
   end timeout
end ocrFile
(*
Name: appIsRunning
Description: Determines if a particular application is running.
Parameters:
   appName - the name of the application to be tested
Returns: True if the application is running; otherwise False
*)
on appIsRunning(appName)
   tell application "System Events" to (name of processes) contains appName
end appIsRunning
(*
Name: posixFileExists
Description: Determines if a particular file exists.
Parameters:
   posixFilePath - the POSIX path to the file
Returns: True if the file exists; otherwise False
*)
on posixFileExists(posixFilePath)
   tell application "System Events" to exists file posixFilePath
end posixFileExists
(*
Name: exitApp
Description: Exits the specified app if it is running.
Parameters:
   appName - the application name
*)
on exitApp(appName)
   if appIsRunning(appName) then
      tell application appName to quit
   end if
end exitApp
(*
Name: getPosixOcrFilePath
Description: Gets the OCR output filename for a given input filename.
Parameters:
   posixFilePath - the full path to the source file
Return: the POSIX path of the OCR output file
*)
on getPosixOcrFilePath(posixFilePath)
   set posixBaseName to do shell script ¬
      "filename=" & quoted form of posixFilePath & "; echo ${filename%\\.*}"
   set posixOcrFilePath to posixBaseName & ocrFileSuffix
   return posixOcrFilePath
end getPosixOcrFilePath
(*
Name: getNextFile
Description: Finds the next unprocessed ScanSnap PDF
Return: the file or ""
*)
on getNextFile(aFolder)
   logEvent("Getting next file...")
   set masterFileList to list folder aFolder ¬
      without invisibles
   set posixPath to POSIX path of aFolder
   repeat with i from 1 to count masterFileList
      set fileName to item i of masterFileList
      set posixFilePath to posixPath & fileName
      log posixFilePath
      --
      -- Construct a FineReader file name from our file
      --
      set posixOcrFilePath to getPosixOcrFilePath(posixFilePath)
      --
      -- See if the FineReader file we constructed exists
      --
      set ocrFileExists to posixFileExists(posixOcrFilePath)
      tell me to set fileCreator to getSpotlightInfo for "kMDItemCreator" from posixFilePath
      log ("Creator: " & fileCreator)
      if not ocrFileExists and fileCreator = "ScanSnap Manager" then
         return POSIX file posixFilePath
      end if
   end repeat
   return ""
end getNextFile
(*
Name: getSpotlightInfo
Description: Gets a named attribute from metadata for a specific file.
Parameters:
   for myattribute - the name of the attribute
   from myfile - the name of the file
Returns: the attribute value or "" if none found
*)
on getSpotlightInfo for myattribute from myfile
   try
      set this_kMDItemResult to ""
      
      tell application "Finder"
         set this_item to myfile as string
         set this_item to POSIX path of this_item
         set this_kMDItem to myattribute
         set theResult to words of (do shell script "/usr/bin/mdls -name " & this_kMDItem & " -raw -nullMarker None " & quoted form of this_item)
         log "Result: " & theResult as string
         repeat with j from 1 to number of items in theResult
            set this_kMDItemResult to this_kMDItemResult & item j of theResult as string
            if j < number of items in theResult then
               set this_kMDItemResult to this_kMDItemResult & " "
            end if
         end repeat
      end tell
   on error
      set this_kMDItemResult to ""
   end try
   return this_kMDItemResult
end getSpotlightInfo
(*
Name: logEvent
Description: Write an event to an event log
Parameters:
   themessage - the message to write to the log
*)
on logEvent(themessage)
   set theLine to (do shell script ¬
      "date  +'%Y-%m-%d %H:%M:%S'" as string) ¬
      & " " & themessage
   do shell script "echo " & theLine & ¬
      " >> ~/Library/Logs/AppleScript-events.log"
end logEvent
hansvanderdrift
 
Posts: 13
Joined: Wed Jul 31, 2013 6:55 am

Yes, you'd need to modify that script to work with Hazel. Hazel expects a particular handler and sends an argument. Check the help for AppleScript for more details.
Mr_Noodle
Site Admin
 
Posts: 11865
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City


Return to Support