I am looking for help in doing bulk OCR

SteveU75 · March 26, 2020, 10:52pm

I am looking for help and doing bulk OCR.

I have the following software, ABBY Find reader for Scan Snap, DevonThink 3, PDF Element, ABBYY FineReader, Evernote, EagleFiler, OCR Wizard, PDF Expert, Keyboard Maestro, Hazel and Mac Sparky’s excellent book on going Paperless.

Read me thinks that DevonThink 3 may be the answer, but I have avoided that, as it seems like it will be quite an investment of time and effort to get that working well.

In the past I’ve tried using the Mac Power Users script along with Hazel to call PDFpen Pro and have it OCR about 500 files.

I’ve been trying off and on to get this to work for the last six years. So I’ve been through multiple changes of the operating system as well as new versions of PDFpen Pro. It will do several files and then just hang.

I don’t think it’s the PDF files from selves as I have an old version of Adobe Acrobat Pro running on a Windows machine and it’s able to OCR the files just fine.

I was really excited when the latest version of PDFpen Pro included “OCR Files” right inside the application. Unfortunately it exhibits the same performance as it will OCR 10 to 20 files and then just hang.

Computers are complicated so I’m sure that there could be something on my system that is causing these problems however I don’t have any other applications the behave like this, i.e. work for a while and then hang.

I’m not sure if it makes any difference but my files are stored in iCloud.

I’ve asked Smile Software to see if they could provide an updated script to replace the one that I got from Mac power users. I wonder if there were a few more or longer delay in so that it does not overwhelm the application it might be more reliable.

I have sent smile software log files but even after all this time we’ve not been able to get it working.

Here is the script that I am using.

tell application “PDFpenPro”
open theFile as alias
– does the document need to be OCR’d?
get the needs ocr of document 1
if result is true then
tell document 1
ocr
repeat while performing ocr
delay 1
end repeat
delay 1
close with saving
end tell
–In PDFpen, when no documents are open, window 1 is “Preferences”
–If other documents are open, do not close the App.
if name of window 1 is “Preferences” then
tell application “PDFpenPro”
quit
end tell
end if
else
– Scan Doc was previously OCR’d or is already a text type PDF.
tell document 1
close without saving
end tell
–In PDFpen, when no documents are open, window 1 is “Preferences”
–If other documents are open, do not close the App.
if name of window 1 is “Preferences” then
tell application “PDFpenPro”
quit
end tell
end if
end if
end tell

I hate having to go back to windows to be able to OCR the files, and when those files are on iCloud it takes a while for them to be reflected on my Mac.

Is anyone else having the same problem with PDFpen Pro hanging?

Do you see any additions to the above script that might help?

Can you recommend any additional software it might do a better job at bulk OCR.

I purchased ABBYY find reader pro which offers quite a few conversion utilities but I haven’t figured out how to have it do Bulk OCR yet.

As you can imagine I’m beyond frustrated as I have spent a lot of money on software to try and solve this as well as a lot of time scanning my documents in only to find that it’s difficult to locate what I need.

Evan · March 27, 2020, 2:18am

Put them in a group in Devonthink (or use a smart group), select all, and right click and pick the OCR option. It will take awhile to process them all.

jahala · March 27, 2020, 2:42am

I have been in your situation. For a solution to run on your personal computer, I think using Acrobat Pro is the easiest way. It has an option to OCR multiple files. Just point it at your files and tell it whether to OCR save back to the original file or make a copy. Then press go and wait. It’s OCR engine is very good these days.

You might consider subscribing to acrobat Pro DC for a month to do the OCR jon

fairlydoughnut · March 27, 2020, 3:17am

Abbyy has an applescript interface that you can use to automate OCRing. Here’s a script you can adjust for your needs:

repeat with thisItem in theSelection
		set outputPath to "/tmp/blah.pdf"

		tell application "FineReader"
			repeat while is busy
				delay 1
			end repeat

			export to pdf outputPath from file itemPath image quality high quality ocr languages enum [Japanese, English, GermanNewSpelling]

			repeat while is busy
				delay 1
			end repeat
		end tell
end repeat

Rob_Polding · March 27, 2020, 10:59am

I too have tried to get PDFpen to work and given up with bulk OCR. It always crashes for me too. DEVONthink on the other hand does not. I’ve processed over 1000 files before and it has worked well. I now use Adobe Acrobat because the OCR is much better than Abbyy, and it also works with bulk OCR. It is far more accurate, especially with scans that are lower resolution and with non-English documents (I have to deal with a lot of Spanish and English documents).

Rob_Polding · March 27, 2020, 12:09pm

There are programs that can handle this with ease. I do bulk OCR regularly, and it is wrong to state that no applications are capable of processing 500 files. In fact, decent apps don’t have any problem with more than double this amount.

tjluoma · March 28, 2020, 3:11am

For a bulk process like this, I would try a command-line tool which can be resumed if the process gets interrupted.

For example, OCRmyPDF is a free command-line tool that you can install via brew will let you do this. brew will take care of installing all of the parts that you need.

bulkocrpdfs.sh is a shell script that I wrote to handle the “bulk” part.

Once you have ocrmypdf installed, put all of the PDFs that you want OCR’d into a directory called ~/AllMyPDFs/

If you want to change the location of the directory with all of your PDFs, edit the line in the file

DIR="$HOME/AllMyPDFs"

to point to your preferred directory.

Download bulkocrpdfs.sh and save it somewhere such as /usr/local/bin/bulkocrpdfs.sh

Then make it executable:

chmod 755 /usr/local/bin/bulkocrpdfs.sh

Then just run it. It should tell you if anything goes wrong.

If something goes wrong and you need to start over, just run bulkocrpdfs.sh again. It will skip any files that it has already OCR’d.

It will also save a log of any files it fails to OCR so you can review them and try to OCR them by hand or using some other tool.

Let me know if you want to try this method but have questions.

KVZ · March 28, 2020, 10:46am

Yes, I have used Acrobat Pro in the past (I no longer subscribe because $$$), and I would set it to crank through OCRing a folder of hundreds of PDF scans while I went about my business elsewhere on my laptop. I don’t recall it ever failing due to bulk or quantity.

Sander · March 28, 2020, 3:37pm

This is really a simple solution and does not require any further knowledge of DT3. After scanning you can just copy the files out of DT3 again (DT3 has no lock-in what so-ever)

SteveU75 · April 7, 2020, 10:25pm

Thanks for your comment. I will give that a try. I would love to find a consultant on DevonThink because there are just several basic concepts that I am worried about messing up and then having to redo it.

For example, I had it index all of the files in Documents, about 120Gb and then I found that when I moved them in Finder the index did not get updated.

Using your method is there a way to mark the files that have been OCRd. Tag or add something to the name.

SteveU75 · April 7, 2020, 10:26pm

I have not had it fail or crash like PDFpen Pro but I have had it complain that the files are password protected when they are not.

SteveU75 · April 7, 2020, 10:30pm

I used to be technical.

I have not installed anything w Brew yet.

If I can’t get DevonThink to do it I will give this method another look.

Rob_Polding · April 8, 2020, 7:48am

I did a batch OCR on just over 2500 PDFs yesterday in Acrobat Pro for a data science project, more than double the amount I’d tried before, and it worked with no problem at all. They were all scanned books from an archive so they were big PDFs.

I was also running multiple other apps at the same time and took part in a video conference on Connect while it was running in the background. It finished without any issues or errors.