The matching rule inconsistency with Hazel is so frustrating. I’m trying to build out a system where Hazel matches particular bank account statements by the account number in a scanned PDF, but Hazel just randomly fails to match the number.
I’ve copy/pasted the number out of the PDF, and I’ve copy/pasted the number out of hazel’s window that shows you what Hazel can see. Both have the exact same number, but Hazel refuses to match on it.
I’ve had the same problem and never found a solution. Hazel will be able to read an account number in one statement and fail to do it in another, from the same bank.
My “solution” is to download statements on my iPad Pro to specific iCloud folders and let the folder determine where Hazel files the document.
Frankly, I do not think the root cause is Hazel. It is from my experience how the OCR Text layer in the PDF (which i assume you are using) is not consistent, or changing, or the pattern you specified is incorrect.
I have found the Hazel forum very helpful in sorting these things out.
I have found good results with Hazel and have found nothing better.
So, in here let’s say my bank account number is a bunch of 5s. I know that’s right because I copied it straight out of the PDF statement that I scanned and OCR’d with my ScanSnap. Older statements have dashes, newer ones do not. When that doesn’t match, if you click on that red X you can get to the text that Hazel can see:
So, I copy everything out of there and paste into BBEdit, find my account number, copy it from BBEdit back into Hazel, and it still doesn’t recognize the number. Not sure what else I could do.
Your structure looks good. Does contain match work any better than contain? Supposedly it can fix situations where something is wrong with Spotlight indexing, because Hazel does its own slower crawl of the document.
There are two solutions. You can OCR again to a better standard or instead of just using “contents contain”" - use “contents contain match”. Hazel seems a little more reliable on contents contain match.
Yep. Just use “contents contain match”. Then Hazel will scan the actual file contents rather than just using the Spotlight data. It’s slower but far more accurate, and since it’s working in the background if it takes a little longer it doesn’t matter.
I’ve switched over to “contains match” and you are all correct, works much much better. For some reason I guess I thought that what “contains match” does is what “contains” did the entire time. Didn’t realize it was relying on Spotlight.
Readdressing this, the just released Hazel version 6 has this feature under Core changes:
“Contents contain” should now be more reliable. While it still uses Spotlight, if it fails, it will fall back to using the same mechanism that “Contents contain match” does.
Also of interest, Hazel now appears to have built-in OCR on the fly for text matching.
Be aware it’s technically OCR (as characters are recognized), not traditional OCR which actually creates a proper text layer or generates a document of the text. This is mentioned in the release notes.