OCR - determine if pdf has text layer

Talk, speculate, discuss, pontificate. As long as it pertains to Hazel.

Moderators: Mr_Noodle, Moderators

OCR - determine if pdf has text layer Thu Sep 13, 2018 5:30 pm • by t2clej
What is the best way to evaluate a large number of pdf files and see which have already undergone OCR?

I use hazel to process pdf files and OCR them but i do not want to run them thru OCR twice.

Thank you.
t2clej
 
Posts: 5
Joined: Sun Jan 28, 2018 6:42 pm

Re: OCR - determine if pdf has text layer Fri Sep 14, 2018 11:07 am • by Mr_Noodle
There's no standard way to detecting that but search the forums as I believe someone has come up with a script to check whether any fonts are used, which is usually a good indication.
Mr_Noodle
Site Admin
 
Posts: 11195
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.
SimonBrowneNZ
 
Posts: 1
Joined: Tue Jul 09, 2013 11:30 pm

Re: OCR - determine if pdf has text layer Tue Oct 02, 2018 11:01 am • by Mr_Noodle
Possibly. Not sure how reliable that is. The script others are using look for a font directive which usually indicates text content.
Mr_Noodle
Site Admin
 
Posts: 11195
Joined: Sun Sep 03, 2006 1:30 am
Location: New York City

Re: OCR - determine if pdf has text layer Wed Aug 21, 2019 2:30 pm • by gacorrea
SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.


Could you please post the code for how you perfoem the above check? Thanks!
gacorrea
 
Posts: 5
Joined: Fri Oct 14, 2016 10:39 pm

Re: OCR - determine if pdf has text layer Mon Aug 26, 2019 12:23 pm • by luomat
I use a program called ‘pdftotext’ from https://www.bluem.net/files/pdftotext.dmg

It is abandonware (no longer being developed) but it works for now, and if it stops working in the future, I’ll deal with it then.

The use is very simple:

Code: Select all
pdftotext file.pdf file.txt


then check to see if ‘file.txt’ is just a newline. If it is, that means it wasn’t able to find any text in ‘file.pdf’.

I’ve yet to find a more reliable test.
luomat
 
Posts: 78
Joined: Wed Mar 10, 2010 3:57 pm

Re: OCR - determine if pdf has text layer Tue Sep 03, 2019 11:50 am • by steelersfan
You may also try Grooper to tackle this particular problem: https://www.bisok.com/grooper-data-capt ... -pass-ocr/ . I have used it for this job several times and have been pretty happy with the results. Good luck!
steelersfan
 
Posts: 1
Joined: Tue Sep 03, 2019 11:47 am

Re: OCR - determine if pdf has text layer Mon Jul 13, 2020 3:57 pm • by Betty35
Steelersfan, thanks, that works!
Betty35
 
Posts: 1
Joined: Mon Jul 13, 2020 11:56 am

Re: OCR - determine if pdf has text layer Wed Sep 09, 2020 8:42 am • by Ni€ls
Hi, here is my Hazel rule

Code: Select all
If all of the following conditions are met

Kind is PDF
Contents do not contain match a


Code: Select all
Do the following to the matched file or folder:
Run AppleScript embedded script
Toggle extension visible
Rename with pattern: name extension
Continue matching rules



embedded script:
Code: Select all
tell application "PDFpenPro"
   open theFile as alias
   tell document 1
      ocr
      repeat while performing ocr
         delay 1
      end repeat
      delay 1
      close with saving
   end tell
   quit
end tell




gacorrea wrote:
SimonBrowneNZ wrote:I use Spotlight or Finder to search "contents" for "a" on the basis that the vast majority of documents contains "a" and if pdf has been ocr'd, then Finder contents will find an "a" as text. Is this thinking correct ?
By default, Spotlight searches insides files for text. But using Finder, I search for Kind pdf first, then + to add Contents search for "a" .

I have a Hazel folder - called ScanThis - to scan using Pdfpen and colour file green and purple - so I know which files have text layer. Katie Floyd has a script for this.


Could you please post the code for how you perfoem the above check? Thanks!
Ni€ls
 
Posts: 20
Joined: Thu May 16, 2019 10:50 am

Re: OCR - determine if pdf has text layer Mon Sep 14, 2020 1:11 pm • by IVG
I have had success with the following shell script to determine if the PDF has been OCR'd it is pretty fast...

Code: Select all
#! /bin/bash
if ! grep Font "$1"
then
     exit 0
else
     exit 1
fi
IVG
 
Posts: 18
Joined: Tue Jan 22, 2019 5:00 pm


Return to Open Discussion