Page 1 of 1

DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Wed Apr 04, 2018 10:29 am
by pav
So I am using Hazel in order to move scanned and ocr'ed PDF documents into a specific database within Devonthink. But there is a major flaw and I really don't know how to overcome it.

After scanning has finished, the PDF is placed in the folder watched by Hazel. In the meantime the OCR process begins using SnapScan's own "Searchable PDF Converter", but the process can not be completed since Hazel has moved the unfinished PDF to Devonthink. I know that I can set up additional static time rules in Hazel ("move the file after xx seconds/minutes"). But scanning time depends on how many documents have to be proceeded. Therefore it might work well for documents with few pages, but what about a stack of 30 pages in one session?

Have a look on my rules and script so far.

The AppleScript for Devonthink tells to import the file in the inbox of database "__ SCANSNAP"
Code: Select all
tell application id "DNtp"
   set img_ib to get record at "/Inbox" in database "__ SCANSNAP"
   import theFile to img_ib
end tell


Image


My suggestions:

Additional code in order to prevent any import to Devonthink until the OCR process by the ScanSnap app called "SearchablePDFConverter" is done. However, I am NOT a script pro

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Wed Apr 04, 2018 11:05 am
by Mr_Noodle
I'm not clear on this extra conversion step. Can you not move the file into the folder until after that is done?

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Wed Apr 04, 2018 11:27 am
by pav
Mr_Noodle wrote:I'm not clear on this extra conversion step. Can you not move the file into the folder until after that is done?


The only reason for all that is that since Sierra, Preview/macOS prevents subsequent data modification within the Global Inbox Database of Devonthink. Instead it creates a copy of the file, which has to be saved another time. This behaviour appears only within the macOS system path environment - and since the Global Inbox Database is located in the system environment, I had to create a separate Database outside of that path.
In addition, I don't like the internal OCR process by Devonthink. Compared to the ScanSnap solution, it creates larger PDF files and the process takes significantly longer.

I really need help in this regard.

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Wed Apr 04, 2018 4:24 pm
by NaOH
Different OCR software often changes file metadata. DevonThink, for example, changes the Content Creator field to DevonThink Pro Office after running OCR on a PDF. On the other hand, Doxie scanning software doesn't apply any metadata that's associated with the Doxie software when doing the same.

So you may be able to have Hazel look for ScanSnap-specific metadata that may only be present after the OCR process is complete. An easy way to check for ScanSnap metadata is to open a PDF in Preview, show the inspector (Tools > Show Inspector), then review the General Info tab. If there's a ScanSnap metadata association, you could add that criteria to a Hazel rule, and that may provide the delay you need for the OCR process to complete.

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Thu Apr 05, 2018 8:19 am
by pav
NaOH wrote:Different OCR software often changes file metadata. DevonThink, for example, changes the Content Creator field to DevonThink Pro Office after running OCR on a PDF. On the other hand, Doxie scanning software doesn't apply any metadata that's associated with the Doxie software when doing the same.

So you may be able to have Hazel look for ScanSnap-specific metadata that may only be present after the OCR process is complete. An easy way to check for ScanSnap metadata is to open a PDF in Preview, show the inspector (Tools > Show Inspector), then review the General Info tab. If there's a ScanSnap metadata association, you could add that criteria to a Hazel rule, and that may provide the delay you need for the OCR process to complete.


Thanks for the advice! I have compared both files (before and after the OCR process), both remained the same metadata, as far I can tell by using Preview. Is it possible to check the metadata in more detail, for example by using a terminal command (through poppler) ?

Edit: So in addition I have used the command line tool "pdfimage" as part of poppler . there is no difference in metadata!

Any other advice is welcomed!

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Thu Apr 05, 2018 11:53 am
by Mr_Noodle
Is there any sort of ability to set a tag or comment on the file after the ScanSnap stage? If so, you can have Hazel only process after that is set.

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Thu Apr 05, 2018 4:31 pm
by pav
Well, I have decided to learn some AppleScript coding. My aim is to implement some code for the OCR process behaviour.

something like

tell the application to process the ocr job -> verify the job is done -> if positive start to import the file to devonthink.


lets hope the best.

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Thu Apr 05, 2018 5:06 pm
by NaOH
Is it possible to check the metadata in more detail, for example by using a terminal command (through poppler) ?


In Terminal, enter

Code: Select all
mdls


followed by a space, then drag a file to the Terminal window and press Return. I tested this on a Doxie-scanned (and OCR'd) receipt, and it showed the Creator metadata and the PDF Producer metadata associated with Doxie.

It may be worth doing a test on your end, whereby you check the metadata of a PDF after scanning but before the OCR process (if the ScanSnap software process makes that possible). If you can do that, you may see some metadata change when looking at scanned versus post-OCR.

Re: DevonThink & ScanSnap with Hazel. Rule & Script problem

PostPosted: Thu Apr 05, 2018 5:50 pm
by pav
NaOH wrote:
Is it possible to check the metadata in more detail, for example by using a terminal command (through poppler) ?


In Terminal, enter

Code: Select all
mdls


followed by a space, then drag a file to the Terminal window and press Return. I tested this on a Doxie-scanned (and OCR'd) receipt, and it showed the Creator metadata and the PDF Producer metadata associated with Doxie.

It may be worth doing a test on your end, whereby you check the metadata of a PDF after scanning but before the OCR process (if the ScanSnap software process makes that possible). If you can do that, you may see some metadata change when looking at scanned versus post-OCR.


thanks once again, I did the comparison already. the same with mdls, both matadata are the same