Page 1 of 1

Recognising OCR'd text in a PDF

PostPosted: Sat Mar 19, 2011 7:41 am
by Lusule
I'm trying to set up rules based on the contents of documents I've scanned to PDF and OCR'd. For the most part, these rules have worked admirably.

However I have a bunch of files that Hazel is not processing for some reason. I have performed Spotlight searches and preview searches on the text I want to base the rule on (a very distinctive account number) and both Spotlight and Preview find the correct text accurately in every single document. Nevertheless, Hazel seems to have decided that for these documents, the rule does not apply. It has applied the rule just fine for a couple of similar documents (monthly bills).

My rule is simply 'if all of the following conditions are met:

Contents contain <account no.>
Kind is PDF

Then do the following

<process>

I can't work out why Hazel isn't recognising the required document in this text; does anyone have any idea, or know how I best troubleshoot this?

Thanks

Re: Recognising OCR'd text in a PDF

PostPosted: Wed Mar 23, 2011 2:34 pm
by Mr_Noodle
Have you checked the preview? Make sure it's not matching some other rule before it gets to this one.

I suggest emailing me at the support address and sending me your rule and one of the files in question. I'll see if I can replicate it here.

Re: Recognising OCR'd text in a PDF

PostPosted: Wed May 30, 2012 3:35 pm
by andief
Did you resolve this? I'm hitting the same problem where I can see the PDF has been correctly OCR'd, the text is embedded (can copy paste it all into TextEdit), I can search it in preview and acrobat and find the exact text in my rule, and can also enter the same text in spotlight and the file appears in the list. Hazel preview shows the file isn't matching any rules, including the one that it should which I've even moved to the top of the list.

I have quite a lot of rules on the same folder as it's where I scan all my letters to. Lots of other similar rules running successfully with just couple that don't. Tried to find the cause for quite a while and just don't get it.

Only criteria are the same as many other of my rules...
All conditions met for the file/folder being matched
Kind is PDF
Contents contain <my text>

Re: Recognising OCR'd text in a PDF

PostPosted: Thu May 31, 2012 4:45 pm
by Mr_Noodle
The original article was a long time ago and I can't say whether your issue is the same. As I asked the other poster, please email support with the supporting documentation (rules, logs, etc) so I can analyze it further.

Re: Recognising OCR'd text in a PDF

PostPosted: Tue Jun 19, 2012 10:55 am
by dani_w
Most of the PDF documents embed plain text in a compressed format, hence Hazel might not be able to process such PDF files correctly without prior decompressing the text.

As a workaround you might use Applescript and tools such as "pdftotext", "Adobe Acrobat Standard/Professional", or "skim" to retrieve the text to process it further. (All the mentioned tools need to be installed on your Mac).


A possible Applescript implementation using pdftotext could look as following (searching for 'Hazel'):
Code: Select all
set itemPath to quoted form of POSIX path of theFile
-- use grep -w 'Hazel' for word matching
set res to do shell script "/usr/local/bin/pdftotext " & (quoted form of itemPath) & " - | grep 'Hazel'"
return res is not equal to ""


I hope this helps.

Re: Recognising OCR'd text in a PDF

PostPosted: Sun Aug 05, 2012 12:55 pm
by bilke
I am having the same problem. I have an ocr'd pdf (tried both the OCR from the Doxie Go scanner as well as the OCR from PDFpen) and I want to match with a hazel "content" "contains" rule. I have only one rule for the folder and I am using the latest version of Hazel. How can I debug this further?

Thanks!

Re: Recognising OCR'd text in a PDF

PostPosted: Sun Aug 05, 2012 8:40 pm
by flynn
I hope a fix is found for this, as I was planning on buying a doxie go scanner and scan all the papers from my classes, then have hazel sort them into subfolders according to name&date. I might have to reconsider if hazel is not able to do the job

Re: Recognising OCR'd text in a PDF

PostPosted: Mon Aug 06, 2012 4:01 pm
by Mr_Noodle
Are the files appearing in Spotlight when you search for the text? If not, then it's an issue outside of Hazel. You need to use proper OCR that will index the document in Spotlight properly.

Otherwise, as noted above, you need to email support with evidence of the problem as I can't fix anything if I don't have proof that it's broken in the first place.

Re: Recognising OCR'd text in a PDF

PostPosted: Tue Aug 07, 2012 1:34 pm
by a_freyer