Looking for a workaround for Preview app corrupting OCR in PDFs

klarsichtfolie · October 17, 2021, 11:25am

Hi,

there is this issue that the Preview app in BigSur corrupting the OCR layer in PDF files when it does modifications to the file. [1]

How do you work around it? What tools do you use? Do you just rerun the OCR again after modifying the pdf?

I am trying to find a setup for scanning and OCRing a stack of papers and ran into this problem myself, when I copied pages from one PDF to another to merge them and when I removed pages.

My scanner is the Brother ADS-2400N and I tried Abbyy Finereader Sprint 12 on Mojave and Abbyy Finereader PDF 15 on Big Sur.

[1] https://annoying.technology/posts/86f4ea27e4cd90d0/

95omega · October 17, 2021, 11:31am

This is scary stuff. I haven’t run into it but am suspecting I have PDF’s with OCR issues as a result of this. The referenced post is from almost a year ago. I’m assuming since you’re posting that there have been software updates and it remains an issue (sorry, dumb question but had to ask)?

klarsichtfolie · October 17, 2021, 11:36am

Hi 95omega, this is a valid question. I am running 11.6 which is the newest BigSur version, but I did not try yet with Monterey. Finereader Sprint 12 is a very old version, but I checked yesterday that I was using the most up-to-date Finereader PDF 15 version.

95omega · October 17, 2021, 11:42am

I just went back and read the post again. So you’re only experiencing this after modifying the file in preview? I don’t use preview for pdf modification, but I know there are a bunch of suggestions that have been made before regarding pdf management. Personally, I’m not sure what you can do without purchasing an app.

klarsichtfolie · October 17, 2021, 11:50am

I don’t mind paying for another app. I used Preview because it is the default and I often found other PDF apps to be clunky(?). What PDF app do you use / can recommend for editing?

anon41602260 · October 17, 2021, 11:53am

If you are looking to ingest a stack of PDFs, copy pages from some PDFs and insert them into other PDFs, and also merge two or more PDFs, then you might want to consider DEVONthink.

Of course, get a trial and test it before you decide.

JohnAtl · October 17, 2021, 12:34pm

Welcome!

PDF Expert is good at copying pages from PDF to PDF, deleting pages, etc.
I don’t know the implications with OCR. I have a few OCRed PDFs, and haven’t noticed any problems with them.

agmalis · October 17, 2021, 3:46pm

I’ve also seen Preview remove the OCR layer from scanned documents if you delete a page, move the pages around, or combine PDFs. As mentioned previously in this conversation, DevonTHINK does all this without an issue, OCR layer remains intact.

karlnyhus · October 17, 2021, 4:12pm

The standard for Portable Document Format files encompasses more than a thousand pages. Software to implement such a standard can hide a multitude of errors and omissions. In a way, PDF files were never intended to be editable without the original source text, images, etc. They were intended to allow documents to be shared across incompatible platforms but still allow all viewers to see the same document. In my view, all PDF editing apps are hacks. If they work (that is, do what you want them to), then fine. But I’m not surprised when a PDF document proves uneditable or comes out “wrong” with whatever software I’m using to make my edit.

karlnyhus · October 17, 2021, 5:59pm

Add and remove pages first and then OCR?

karlnyhus · October 17, 2021, 6:22pm

Think of it this way, you have a PDF document with a pre-existing “OCR layer”. You use the Preview app to add and remove pages. What is Preview to do when you ask to save the edited PDF? From its point of view, you have corrupted the PDF by rendering the pre-existing OCR incorrect, perhaps embarrassingly so. So Preview removes the out-of-date OCR text and the PDF is no longer corrupt. For all I know, this behavior is spelled out in the PDF standard for apps that lack OCR capability!

OogieM · October 18, 2021, 2:12pm

I think that the OCR layer is tied to the page in the PDF since PDF is meant to be page oriented and not editable within a page much if at all. So I see no reason why the OCR layer would be “corrupted” just by moving and deleting whole pages. Preview just has to delete the OCR associated with the PDF page you deleted or move it around and leave it attached. What I’d do if I was programming a PDF app is tie the OCR layer for a page to the PDF image of the page and treat it as a block that gets moved, deleted etc. It’s een a LONG time since I read the PDF spec, but I thought that was how it worked.

karlnyhus · October 18, 2021, 3:03pm

Types of PDFs: Searchable PDF, Image-Only, True PDF | ABBYY

An “image-only” PDF can be made searchable by applying OCR with which a text layer is added, normally under the page image.

This source would seem to agree with your idea that OCR’d text is associated with its page, although it offers a qualification of “normally.” PDF documents vary enormously in their construction and adherence to the “standard,” which has changed a lot over the years. Preview may be at fault in this situation or there may be an error or difference of opinion in how the OCR layer should be written by the other unspecified PDF editing app. I’ve gone beyond what I really know about interpreting PDFs, so I will now retire from the field.