find PDF with/without text-content

hello,

I´m new with hazel and no english native speaker.

I have lots of scanned documents here (manuals, receipts, invoices, contracts, etc.) , which were saved as PDF's right after scanning.
My OCR engine at the time did not reliably convert the PDF's to text PDF, so there are still lots of unprocessed image PDF's.
My problem is that I can´t recognize from the "outside" whether it is an image or text PDF.
I would have to open each PDF and try to highlight text. After about 3000 PDFs.

Here are my questions:

1) How do I teach hazel what to look for in the PDF content so that it can clearly distinguish image PDF from text PDF?

2) Finally, how do I search for image PDFs so that I can move them to a separate directory so that an OCR can subsequently convert all image PDFs to text PDF?

Search the forums here as I believe there is a script someone came up with to do this. Note that it's not straightforward as, AFAIK, no standard for this and inexact heuristics have to be used.

truecolor wrote:hello,

I´m new with hazel and no english native speaker.

I have lots of scanned documents here (manuals, receipts, invoices, contracts, etc.) , which were saved as PDF's right after scanning.
My OCR engine at the time did not reliably convert the PDF's to text PDF, so there are still lots of unprocessed image PDF's.
My problem is that I can´t recognize from the "outside" whether it is an image or text PDF.
I would have to open each PDF and try to highlight text. After about 3000 PDFs.

Here are my questions:

1) How do I teach hazel what to look for in the PDF content so that it can clearly distinguish image PDF from text PDF?

2) Finally, how do I search for image PDFs so that I can move them to a separate directory so that an OCR can subsequently convert all image PDFs to text PDF?

Did you figure out how to do this?

So I can't figure out how to post a picture here so what I did was I created a rule with a that I titled "OCR If PDF does not contain a LETTER or NUMBER".

If "ALL" of the conditions are met
Kind - Is - PDF

Then you need to add a nested condition by holding down the + sign. the nesting condition would then read
If "ALL" of the current conditions are met for "The Current File or Folder"
Contents - Does not Contain - a
Contents - Does not Contain - e
Contents - Does not Contain - I
Contents - Does not Contain - o
Contents - Does not Contain - u
Contents - Does not Contain - 1
Contents - Does not Contain - 2
Contents - Does not Contain - 3
Contents - Does not Contain - 4
Contents - Does not Contain - 5
Contents - Does not Contain - 6
Contents - Does not Contain - 7
Contents - Does not Contain - 8
Contents - Does not Contain - 9
Contents - Does not Contain - 0

Do the following to the matched file or folder

Open - with application - (Insert your OCR application)

Mr. Mohon wrote:So I can't figure out how to post a picture here so what I did was I created a rule with a that I titled "OCR If PDF does not contain a LETTER or NUMBER".

Great idea!

I modified this to make it (perhaps) a little bit more stable and readable:

Works like a charm.

Thanks a lot.

PS: Is there a way to upload screenshots (eg phpBB Image Upload Extension)?

CONSULitAS wrote:
Mr. Mohon wrote:PS: Is there a way to upload screenshots (eg phpBB Image Upload Extension)?

You need to use a third party file/image hosting service. imgur.com is a free and readily available one.