Noodlesoft Forums

Posted: **Thu Sep 13, 2018 5:30 pm**

What is the best way to evaluate a large number of pdf files and see which have already undergone OCR?

I use hazel to process pdf files and OCR them but i do not want to run them thru OCR twice.

Thank you.

Posted: **Fri Sep 14, 2018 11:07 am**

There's no standard way to detecting that but search the forums as I believe someone has come up with a script to check whether any fonts are used, which is usually a good indication.

Posted: **Mon Oct 01, 2018 8:27 pm**

I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.

Posted: **Tue Oct 02, 2018 11:01 am**

Possibly. Not sure how reliable that is. The script others are using look for a font directive which usually indicates text content.

Posted: **Wed Aug 21, 2019 2:30 pm**

SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.

Could you please post the code for how you perfoem the above check? Thanks!

Posted: **Mon Aug 26, 2019 12:23 pm**

I use a program called ‘pdftotext’ from https://www.bluem.net/files/pdftotext.dmg

It is abandonware (no longer being developed) but it works for now, and if it stops working in the future, I’ll deal with it then.

The use is very simple:

Code: Select all: pdftotext file.pdf file.txt

then check to see if ‘file.txt’ is just a newline. If it is, that means it wasn’t able to find any text in ‘file.pdf’.

I’ve yet to find a more reliable test.

Posted: **Tue Sep 03, 2019 11:50 am**

You may also try Grooper to tackle this particular problem: https://www.bisok.com/grooper-data-capt ... -pass-ocr/ . I have used it for this job several times and have been pretty happy with the results. Good luck!

Posted: **Mon Jul 13, 2020 3:57 pm**

Steelersfan, thanks, that works!

Posted: **Wed Sep 09, 2020 8:42 am**

Hi, here is my Hazel rule

Code: Select all: If all of the following conditions are met Kind is PDF Contents do not contain match a

Code: Select all: Do the following to the matched file or folder: Run AppleScript embedded script Toggle extension visible Rename with pattern: name extension Continue matching rules

embedded script:

Code: Select all: tell application "PDFpenPro" open theFile as alias tell document 1 ocr repeat while performing ocr delay 1 end repeat delay 1 close with saving end tell quit end tell

gacorrea wrote:
SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.

Could you please post the code for how you perfoem the above check? Thanks!

Posted: **Mon Sep 14, 2020 1:11 pm**

I have had success with the following shell script to determine if the PDF has been OCR'd it is pretty fast...

Code: Select all: #! /bin/bash if ! grep Font "$1" then exit 0 else exit 1 fi

Noodlesoft Forums

OCR - determine if pdf has text layer

OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer

Re: OCR - determine if pdf has text layer