So I’ve recently tried OCRing all of my scanned documents but not wanting to spend the money required for a PDFPen, PDF Expert, or Adobe Acrobat Pro DC. In my research I discovered a free open-source command line application called OCRmyPDF. Having only used OCR sparingly before I’m not sure how well it actually scans so I don’t have much to compare it to but seems to work ok. My question is would the quality of the OCR be better in the proprietary applications, if so which one, or do they all use the same OCR technology so there isn’t much difference in the quality of the OCR I would get from them?
There are a number of different OCR engines that PDF applications use, and they are NOT all using the same engine. For example, ABBYY makes their own, very well-respected OCR engine that they sell to consumers (FineReader) and also license to third-parties (for example, DEVONthink Pro Office uses ABBYY for OCR).
Adobe has their own OCR engine and I am fairly sure they do not license it to any third parties. In my experience they are one of the best in terms of accuracy and maintaining reasonable file sizes without sacrificing image quality.
It looks like Smile uses an OCR system from Nuance in PDFPen Pro
Digging through the Legal Notes for Scanner Pro (They don’t offer OCR in PDF Expert for iOS or Mac), it appears that Scanner Pro users Tesseract, an open source OCR Engine that OCRmyPDF also uses.
If you want to see what’s out there you can consult this Wikipedia comparison
Some of these engines are not hard to implement yourself. Tesseract can be installed and controlled using terminal on your Mac, for example. It’s not particularly hard to use as long as you have even a little terminal experience.
While OCR engines do vary considerably in their accuracy and the resulting image quality/filesize, another significant factor is the quality of the image you are feeding them. Even a very good OCR engine like Adobe’s will struggle to produce good results if the image is poor.
Any suggestion on what DPI I should scan a document at when doing this? I feel like my searches I get answers all over the place.
Do the the highest quality you can muster. Tesseract apparently works best with at least 300 DPI. I think PDFpen Pro will make you jump through a hoop to OCR a document under 300 DPI (I THINK it’s 300, it might be 250, I can’t remember). So 300 DPI is a safe bet.
Remember though, it’s not just about DPI. Also consider the clarity of the image, the contrast between the text and the background, etc.
Which DPI to use depends a lot on the quality of the software doing the recognition – but it is very influenced by the source material. You can do a quick test of the results by OCRing a scan at various DPIs, then opening the resulting PDF in Preview. Use Edit > Select All, then paste into a plain text document in TextEdit. You’ll see the results of the text layer that OCRing creates and be able to just if the accuracy is close to what you see and read in the image.