I am trying to set up a Hazel rule to move a medical EOB file I download from my healthcare provider’s website but for whatever reason Hazel is struggling to recognize any text contained within the downloaded PDF. The OCR on the PDF appears really good as I can copy and past just about anything into the rule fields. I have rule preview on and oddly Hazel does seem to find the correct EOB statement date (I tested this by having Hazel rename the file using the date) but does not for whatever reason recognize any other text. The PDF was created from an ASPX webpage if that makes any difference. Also, I actually tried removing the OCR layer using PDFPen Pro so I could redo the OCR but that did not work either suggesting that there is something odd about this PDF file. I am going to get these files regularly and really would like to automate this. Thanks for any help!
2017 MacBook Pro with a crappy keyboard, 16GB RAM, Big Sur
Not a long term solution but to help isolate the problem “print” a brand new PDF with this file and add OCR. See if it helps. Is yes maybe it’s the original file the problem.
I have seen a similar problem where the OCRed text in a PDF cannot be read until the file has been “touched” by opening it in some application first. I was never able to figure out what was going on.
You might be having the same or similar problem – open the file in Preview first and then see if Hazel recognizes the text.
Actually I have had that problem as well. Sometimes a rule will not be run until I open it, usually in Preview (only because that is the default PDF app), then the rules run. Go figure. In any event the problem I posted here did resolve itself once Hazel did the rename. Again, go figure.
Getting all this to work has taken time but it is all starting to pay off. Doubt I will ever be free of problems to solve. My latest occurred just a few minutes ago while trying to run a rule to move my latest insurance card PDF. It appears that it is a password protected PDF that will not allow me to copy and paste anything without the password. Hazel is struggling with this one as well.
Thanks for your input!
I have an ‘autofile’ folder. After a suitable wait period hazel processes the folder filing stuff by file name or, for pdfs, running a check and then ocr if it fails the check.
The check is to see if it content contains a e i o or u. If not it sets a tag. The next run match cycle then triggers ocr via PDFpen and removes the tag. After that matches on account numbers etc enable automatic renaming.
Try looking at the extended attributes of a couple of your files in the finder. You might find an attribute that you can use in hazel to trigger an AppleScript that opens and closes the file in an appropriate application (assuming that the suggestions of others to ‘reset’ the ocr work).
Not sure if it’s what’s going on here, but I recall having a similar issue and it turned out to be happening because the “contains” rule (forgive me if I have the name wrong, I’m not in front of my computer) was dependent on Spotlight which didn’t always index immediately. Changing to the “contains match” option instead resolved this for me.
Hmmm…I’ll give that a try. Thanks!
Also, check that in Preview for the PDF that in addition to copy/paste from it, you can also search for the target text you are asking Hazel to find. I mention this as I have had this issue with some bank statements for which Hazel has worked for years, and now they don’t … bank giving me PDFs with OCR in them, but for whatever reason when I tried searching for the account number (as I asked Hazel to do), it didn’t work. For them I could just print a “fresh” PDF with new OCR, but frankly, I just move the document into place once a month.
I had that issue with one of my bank statements, and have resolved it by setting up a Hazel rule to decrypt the file which runs the shell script below. (I didn’t write this myself but unfortunately have no idea where I got it from so can’t give credit; I’m fairly certain it was here or on the Automators forum.
o=$(dirname "$1")/$(basename "$1" ".pdf")_decrypted.pdf; qpdf --decrypt --password=PASSWORD "$1" "$o"
(Change PASSWORD to your password, obviously.)
qpdf to be installed. I don’t 100% remember how I did that but I believe I probably just installed Homebrew and then ran
brew install qpdf in the terminal.
This creates a decrypted copy of the file with
_decrypted appended to the file name, and places it in the same place. I then have Hazel delete the original file and have a separate rule to match the decrypted version. (I place that second rule BEFORE the decrypt rule in the Hazel listing so that it doesn’t keep matching the encrypted version forever!)
I’ve had an issue pop up since the new year started that I recently discovered. Some rules that use date match naming that have always triggered without an issue all of a sudden in 2021 are naming the file with a date that’s just totally random. For example something I scanned on 2/16/2021 that should’ve been named. 2021-02-17- xyz.PDF got named 2020-05-31 - xyz.PDF. This is happening on more then one rule too. Haven’t had time to investigate but definitely started with 2021.