How to Identify Non OCR'd Files with Hazel and OCR them in PDFpenPro

Jose_Ariel_Henriquez · November 2, 2018, 3:45am

I was looking up how to do this and wanted to share. I came up with a way to identify PDF’s that were not scanned by my ScanSnap.

I’ve got a folder with over 134 PDF files that I kept postponing the creation of Hazel rules for. I wanted to identify which ones were not scanned (And thereby OCR’d) by my ScanSnap. In Hazel, I used “Content Creator - does not contain - ScanSnap” along with it doesn’t contain a tag of OCR’d

39%20PM

Then I had Hazel run a Shell Script of
Sleep 30

to have it pause for 30 seconds.
Then I have Hazel use this Applescript (Sorry I forget where I found it when Katie Floyd’s version didn’t work for me)
tell application “PDFpenPro”

open theFile as alias

tell document 1

ocr

repeat while performing ocr

delay 1

end repeat

delay 1

close with saving

end tell

end tell

Then Finally, have Hazel add the tag of OCR’d

54%20PM

Hope this helps.

tomalmy · November 3, 2018, 5:38pm

But what if the PDF was already readable? It might not have come from the ScanSnap but be a text PDF instead of an image PDF, or might have been OCRed somewhere else. I use the stupid but effective rule that the PDF must contain an “e” and an “a” to be considered scanned.

Jose_Ariel_Henriquez · November 8, 2018, 8:00pm

I thought about that and I figured it wouldn’t hurt to get it OCR’d by PDFpen being my preferred way to OCR. But that’s a good way to think of it with an a or e. Thanks

tonywatkins · January 11, 2021, 11:48am

I had inconsistent results using ‘contents do not contain e’ (I didn’t bother with ‘a’ as ‘e’ is by the far the most prevalent letter in English): some pdfs were not matched and thus flagged as needing OCR, when they already contained text which I could select and copy or find with a search. I can’t for the life of me figure out why this happened. So instead I used ‘contents contain match’ and used ‘match word’, which has worked faultlessly:

tomalmy · January 11, 2021, 5:53pm

I had both “a” and “e” in my “stupid but effective rule” because I found that just “e” wasn’t enough. But your approach seems even better.

ldebritto · January 11, 2021, 10:57pm

Once I’ve reached out to the Smile folks looking for some help with my hazel script and discovered that PDFPen AppleScript library has a command that will check if the document needs OCR.

Some good soul replied to my question with an even more complete AppleScript that includes a few lines of code to have PDFPen check if the file needs OCR.

Here’s the code:

tell application "PDFpen"
	open theFile as alias
	-- does the document need to be OCR'd?
	get the needs ocr of document 1
	if result is true then
		tell document 1
			ocr
			repeat while performing ocr
				delay 1
			end repeat
			delay 1
			close with saving
		end tell
		--In PDFpen, when no documents are open, window 1 is "Preferences"
		--If other documents are open, do not close the App.
		if name of window 1 is "Preferences" then
			tell application "PDFpen"
				quit
			end tell
		end if
	else
		-- Scan Doc was previously OCR'd or is already a text type PDF.
		tell document 1
			close without saving
		end tell
		--In PDFpen, when no documents are open, window 1 is "Preferences"
		--If other documents are open, do not close the App.
		if name of window 1 is "Preferences" then
			tell application "PDFpen"
				quit
			end tell
		end if
	end if
end tell

If you happen to use PDFPen Pro, code will work just fine by changing the tell line to tell application "PDFPen Pro"

Personally, I have this script set to run at every PDF on my Downloads folder that does not have a OCR tag on it.

As per the actions, the first one will be this script and the second will be Add a tag named OCR. This way, even if the script resolves without OCR’ing (because the file did not needed it) it will still add the OCR tag that will break the match.

swh · June 4, 2021, 11:05pm

Does anyone have any idea how to modify the PDFpen-sourced AppleScript to use “display notification” to show a banner alert with a message that let’s the user know if the source PDF file has or has not been OCR’d?

It would be an “and” function that would come right after, “if result is true then” statement in line 5. In other words, I still want the script to do the rest of the work, I just want to include a notification so I know if the document has been OCR’d or not.

If it has, the message would say, “this has been OCR’d”, and if not, it would say “this has not been OCR’d.”

I know Hazel can send notifications, but I don’t know how that would work when most the action is happening within the script.

Thanks for any suggestions and alteratives.

//SWH