Redoing PDF OCR Issue

Lampornis · July 13, 2021, 2:45pm

I have a bunch of Hazel rules set up to move and rename various bills and statements that I regularly download that, as most of you know, saves tons of time with regard to organizing my life. However, each month I get a few documents in which the quality of the OCR is not great so Hazel can’t adequately read them to apply certain rules. My standard tact to dealing with this is simply to redo the OCR then all is well. But when I get a bad document OCR from my insurance company I hit a snag because each time I try to redo the OCR I get an error that says I need the document owner password in order to do any sort of editing on the document. I need no password to open or read the PDF, just to edit it. Anyone have ideas on a work around for this?

JKoopmans · July 13, 2021, 6:17pm

Take a screenshot and pdf + ocr that?

nlippman · July 13, 2021, 6:33pm

What app are you using to open and to OCR the document?

Lampornis · July 14, 2021, 2:44am

I am using PDFPen Pro.

Stephen_C · July 14, 2021, 5:56am

Are you able to highlight the file in Finder, display the information panel (⌘ + I) and give yourself read and write permissions?

Stephen

vco1 · July 14, 2021, 7:10am

Does makes sense, doesn’t it? You insurance company obviously doesn’t want anyone to tamper with their documents.

Adobe pdf’s can have several levels of protection. One of them being editing/changing the pdf. Which is what you do if you (try to) re-ocr.

That being said, if you receive a pdf from a company there should be no need to ocr it. It already contains a text representation of the pdf ‘image’. So my first step would be to skip the initial ocr and use the document you received.

Lampornis · July 14, 2021, 12:35pm

Nope, says it is password protected.

Lampornis · July 14, 2021, 12:42pm

Perhaps it makes sense, but it simply is uncommon. They are the only ones that do this. My bank does not do this. It is not a huge deal, just a question of curiosity.

I always use the document I receive first. This is only an issue when for whatever reason the OCR is bad and Hazel cannot read it. It happens from time-to-time. When this happens I redo the OCR and life is good.

vco1 · July 14, 2021, 1:10pm

If you receive the document as a pdf the original (i.e. source) text is embedded. That is not OCR’ed. So it simply can’t be bad. It’s the source text from which the pdf was generated. Runing OCR on that pdf will only worsen the situation.

Compare it to a pdf that you generate from a Word doc. You don’t have to OCR it either. The text is already there. Not because Word or macOS ran some mysterious OCR task when creating the pdf. But because it already had the source text to start with.

There is no way mass-generated pdf’s are OCR’ed by the company/organization/whatever that sends it to you. There is simply no need to OCR these documents. The text layer is already there. And that text layer is the source from which the pdf is generated. It should (will) be a 100% match.

If Hazel has troubles with these files, there is something wrong with Hazel or (most likely) your rule. Keep in mind that text in pdf’s doesn’t always have the same structure (compared to content) as the format you see on screen. However, there won’t be any misspellings (“bad ocr”). Unless these are visible on screen too. In which case… wel you get it…they were in the source.

DEVONtech_Jim · July 14, 2021, 2:09pm

Any electronic documents from my insurance company are password-protected.

Stephen_C · July 14, 2021, 2:47pm

Sorry—I did figure that might be the case. It’s just something I’ve used as a work around on read-only invoices (which I appreciate are not the same as password protected files) which I’ve needed to imprint as “Paid” in DEVONThink.

Stephen

Stephen_C · July 14, 2021, 2:50pm

When I have received password protected pdf files I’ve been able to:

import them into DEVONThink;
open them there (using the password): and then
OCR them in DEVONThink

which has worked (even though I’m not sure that it should—or, indeed, would in every case).

Stephen

ldebritto · July 14, 2021, 4:06pm

If you don’t mind having the password striped altogether you could use qpdf with your Hazel rule.

First you’ll need to install qpdf via Homebrew.

After that, just add a Do shell script with

qpdf --decrypt —password=YOUR_PASSWORD --replace-input $1

Remember to place this command before your OCR command.

Lampornis · July 15, 2021, 12:11pm

Thanks! I’ll give that a go.

vco1 · July 15, 2021, 1:14pm

I doubt @Lampornis has the password for this document.

And this is the final time I say this – I promise – but a generated pdf document that’s received from an insurance company does not need to be (re) ocr’ed. The source text is already there. You usually don’t receive scanned documents from companies.

ldebritto · July 15, 2021, 4:07pm

I think you’re right in that it should not be the case for one to need OCR an insurance company document. But that may not be the case and only @Lampornis could be the judge of that.

From my own experience, some companies use part of personal document numbers as passwords (say the last 5 numbers in you Social Security).

As per the OCR part sometimes not all the text of the PDF is in a proper Text Layer (maybe something with the template for the bill being an image). Those cases still need OCR even though they’ve been automatically mass-generated.

That said, there may be some legal issues regarding the validity of the OCR’ed version vs the non processed one. Something to keep in mind when dealing with insurance companies.

Lampornis · July 17, 2021, 1:55pm

Apologies if I was not clear. MOST of the time the documents I get from my insurance company are just fine. This is only an occasional occurrence that results in a document that gets stuck in the filing/renaming process. I check these documents and I cannot even highlight specific text items that Hazel is looking for. This has happened with the odd downloaded PDF before but a quick redo of the OCR and the file gets renamed and sent to the proper place.

Lampornis · July 17, 2021, 1:56pm

Got this all set up and will try it next time this happens. Thanks!