Re-OCR and/or batch detect for OCR'd files?

wbarnes4393 · June 9, 2022, 2:40pm

I’m trying to get started on organizing my home document digital filing system. I have some scanned pdfs. I think most are already OCR’d, but some may not be. Two questions:

Is there any harm in simply batch re-OCR’ing them all or does OCR’ing a file for a second time wreak any havoc?
Is there a way to batch search a folder to determine quickly which files are and are not OCR’d? Right now, all I know to do is pull up each file individually in a PDF application and try searching for text. Surely a better way? (in tinkering with the trial of DEVONthink Pro, it seems like there is a “PDF+text” attribute that does this, but upon some reflection I don’t think I am taking the DEVONthink plunge for now (simple file/folder management and OCR text search should be sufficient for me for now).

Ulli · June 9, 2022, 3:05pm

I run into the problem, that doing a second (or third…) OCR will decrease the quality of the original PDF, at least with some of the OCR-Apps.

tomalmy · June 9, 2022, 3:11pm

I handle this in Hazel, which inspects all incoming documents. It looks at the contents to determine where to move the PDF. If there is no match, which might be because the PDF has not been OCRed it tries this rule:

Explained: if the file extension is PDF and the file is not tagged “OCRed” and there is no OCRed word in the file, then the file is OCRed (done via an AppleScript), the OCRed tag is added, and all the rules are checked again. The OCRed tag prevents an infinite loop if no other rules are matched.

wbarnes4393 · June 10, 2022, 11:33pm

Thank you so much! I will give this a whirl.