OCR PDF Script Help/PDFPenPro

gkmlaw · August 13, 2018, 12:50pm

I have a MacMini that is shared and stores all of my lawfirm’s files, including PDFs. I have tried to use the Apple Script that Katie published on her blog to achieve this but sometimes there are just too many PDFs for the process to keep up, which causes the whole machine to lock-up.

I am still learning Apple Script and was wondering if someone can help modify the below script to limit the number of PDFs that PDFPen opens at one time to OCR. It could even be as few as one at a time. The MacMini is always on so it would eventually catch up. Below is the script I am using. Thanks.

tell application "PDFpen"
open theFile as alias
-- does the document need to be OCR'd?
get the needs ocr of document 1
if result is true then
	tell document 1
		ocr
		repeat while performing ocr
			delay 1
		end repeat
		delay 1
		close with saving
	end tell
	--In PDFpen, when no documents are open, window 1 is "Preferences"
	--If other documents are open, do not close the App.
	if name of window 1 is "Preferences" then
		tell application "PDFpen"
			quit
		end tell
	end if
else
	-- Scan Doc was previously OCR'd or is already a text type PDF.
	tell document 1
		close without saving
	end tell
	--In PDFpen, when no documents are open, window 1 is "Preferences"
	--If other documents are open, do not close the App.
	if name of window 1 is "Preferences" then
		tell application "PDFpen"
			quit
		end tell
	end if
end if

end tell

Doc5506 · August 13, 2018, 4:13pm

How are you triggering the script? Do you have it as part of a Hazel rule? Are you trying to OCR a backlog of PDF’s or does your law firm just simply create so many new PDFs that the MacMini locks up just from the new PDFs added on a daily basis?

gkmlaw · August 13, 2018, 4:15pm

Yes, I am triggering the script from a Hazel rule, and yes, there is a backlog of PDFs that need to be OCR’d.

Doc5506 · August 13, 2018, 5:00pm

I think you’re going to have to re-think exactly how you are doing this because Katie’s AppleScript is designed to only process one file at a time. In fact it is Hazel that is trying to run the script on all the matched files at the same time.

One thought that I had is to modify your rule so Hazel only matches files with a creation date of say 10 years ago or longer. Let Hazel churn through those and then change the rule so it matches files created 9 years ago or longer, until you gradually churn through your backlog. Not a fully automated solution but perhaps a functional one.

The alternative is to trigger the rule instead from Keyboard Maestro or something and modify the AppleScript to recursively move through the files. That is a little above my pay grade but here is a link that may help:
Applescript recursive processing

Good luck!

gkmlaw · August 13, 2018, 6:03pm

Thanks. I will take a look at those suggestions. I kept thinking that, once all of old PDFs were OCR’d, the workload would lighten significantly, but it just hasn’t happened. Thanks again.

Doc5506 · August 13, 2018, 6:54pm

A couple of additional thoughts if you want to try to keep the Hazel workflow:

[1] Have Hazel rename OCR’d files to original name + OCR. Then change your original OCR rule to exclude files where the “name” contains OCR. That way Hazel won’t re-match files that have been already been OCR’d.

[2] You could also place all new PDF files into some “Triage” folder. Hazel can then OCR anything put in that Triage folder and subsequently move the file into your regular storage file. That way Hazel isn’t having to monitor the folders with tons of existing PDF documents.

Best of luck!

gkmlaw · August 13, 2018, 8:14pm

Thanks…and your response lets me know that I’m not completely incompetent. After fighting the script a couple of days, I realized that it might be re-OCRing documents over and over. So I implemented an adjustment similar to what you suggested. I included an instruction to apply a tag (OCR) when PDFPen finished OCR’ing a document. Then, I created another instruction directing PDFPen not to OCR any document with that tag.