find PDF with/without text-content

Get help. Get answers. Let others lend you a hand.

Moderator: Mr_Noodle

find PDF with/without text-content Wed Dec 16, 2020 1:25 pm • by truecolor
hello,

I´m new with hazel and no english native speaker.

I have lots of scanned documents here (manuals, receipts, invoices, contracts, etc.) , which were saved as PDF's right after scanning.
My OCR engine at the time did not reliably convert the PDF's to text PDF, so there are still lots of unprocessed image PDF's.
My problem is that I can´t recognize from the "outside" whether it is an image or text PDF.
I would have to open each PDF and try to highlight text. After about 3000 PDFs.

Here are my questions:

1) How do I teach hazel what to look for in the PDF content so that it can clearly distinguish image PDF from text PDF?

2) Finally, how do I search for image PDFs so that I can move them to a separate directory so that an OCR can subsequently convert all image PDFs to text PDF?
truecolor
 
Posts: 2
Joined: Wed Dec 16, 2020 12:24 pm

Re: find PDF with/without text-content Thu Dec 17, 2020 11:03 am • by Mr_Noodle
Search the forums here as I believe there is a script someone came up with to do this. Note that it's not straightforward as, AFAIK, no standard for this and inexact heuristics have to be used.
Mr_Noodle
Site Admin
 
Posts: 11235
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: find PDF with/without text-content Sat Apr 17, 2021 12:32 am • by Mr. Mohon
truecolor wrote:hello,

I´m new with hazel and no english native speaker.

I have lots of scanned documents here (manuals, receipts, invoices, contracts, etc.) , which were saved as PDF's right after scanning.
My OCR engine at the time did not reliably convert the PDF's to text PDF, so there are still lots of unprocessed image PDF's.
My problem is that I can´t recognize from the "outside" whether it is an image or text PDF.
I would have to open each PDF and try to highlight text. After about 3000 PDFs.

Here are my questions:

1) How do I teach hazel what to look for in the PDF content so that it can clearly distinguish image PDF from text PDF?

2) Finally, how do I search for image PDFs so that I can move them to a separate directory so that an OCR can subsequently convert all image PDFs to text PDF?


Did you figure out how to do this?
Mr. Mohon
 
Posts: 8
Joined: Wed Apr 09, 2014 1:17 am

Re: find PDF with/without text-content Sat Apr 17, 2021 1:06 am • by Mr. Mohon
So I can't figure out how to post a picture here so what I did was I created a rule with a that I titled "OCR If PDF does not contain a LETTER or NUMBER".

If "ALL" of the conditions are met
Kind - Is - PDF

Then you need to add a nested condition by holding down the + sign. the nesting condition would then read
If "ALL" of the current conditions are met for "The Current File or Folder"
Contents - Does not Contain - a
Contents - Does not Contain - e
Contents - Does not Contain - I
Contents - Does not Contain - o
Contents - Does not Contain - u
Contents - Does not Contain - 1
Contents - Does not Contain - 2
Contents - Does not Contain - 3
Contents - Does not Contain - 4
Contents - Does not Contain - 5
Contents - Does not Contain - 6
Contents - Does not Contain - 7
Contents - Does not Contain - 8
Contents - Does not Contain - 9
Contents - Does not Contain - 0

Do the following to the matched file or folder

Open - with application - (Insert your OCR application)
Mr. Mohon
 
Posts: 8
Joined: Wed Apr 09, 2014 1:17 am

Re: find PDF with/without text-content Mon Jan 17, 2022 11:52 am • by CONSULitAS
Mr. Mohon wrote:So I can't figure out how to post a picture here so what I did was I created a rule with a that I titled "OCR If PDF does not contain a LETTER or NUMBER".


Great idea!

I modified this to make it (perhaps) a little bit more stable and readable:

    * Contents do not contain match abc (Word)
    * Contents do not contain match 123 (Number)
    * Contents do not contain match ab12 (Letters & Digits)

Works like a charm.

Thanks a lot. :)

PS: Is there a way to upload screenshots (eg phpBB Image Upload Extension)?
CONSULitAS
 
Posts: 8
Joined: Wed Feb 12, 2020 4:16 pm

Re: find PDF with/without text-content Mon Jan 17, 2022 1:33 pm • by Mr_Noodle
CONSULitAS wrote:
Mr. Mohon wrote:PS: Is there a way to upload screenshots (eg phpBB Image Upload Extension)?


You need to use a third party file/image hosting service. imgur.com is a free and readily available one.
Mr_Noodle
Site Admin
 
Posts: 11235
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City


Return to Support