Searching PDFs with regular expressions

kerjsmit · December 1, 2020, 2:24am

Does anyone know if there’s a Mac app that can search PDFs using regular expressions? My most immediate need is to be able to find word1 near word2, such as “Mac” within 5 words of “Power Users.” So far my research has turned up nothing. I’m also open to scripting or other methods that might somehow make this possible.

Thank you for your time!
Kerry

jahala · December 1, 2020, 2:27am

Do you have to search within a PDF app? What are you going to do with the results? I already have a script that can batch convert PDF files to text and search through the results using regular expressions. It uses the popper libraries that you can install via homebrew.

jahala · December 1, 2020, 2:29am

Here is a web solution, but you have to give your documents to someone else: https://products.aspose.app/pdf/redaction

r2d2 · December 1, 2020, 3:04am

That’s really cool. How do you do that?

ChrisUpchurch · December 1, 2020, 11:43am

It seems like DEVONthink would fit the bill.

anon41602260 · December 1, 2020, 12:44pm

Yes, DEVONthink is very good for regex searching of PDFs and other document types.

If you do not want to go that route, you could look at FoxTrot Professional which supports regex searching. Or HoudahSpot, which also supports regex searching. Just limit your searches to a folder or file type and fire away.

dustinknopoff · December 1, 2020, 3:49pm

If you’re familiar with the command line, ripgrep-all will search pretty much any file format (and fast!). Their tagline includes PDFs, Word docs, it will even search chapter metadata in a video file!

kerjsmit · December 1, 2020, 11:53pm

My workflow would work best if I can search within a PDF app because I can automate the workflow with Keyboard Maestro. I have considered converting the PDF to text and searching it with something like BBEdit, but I was hoping to avoid that because searching that way would add hours and hours to the workflow. Thanks for your thoughts!

kerjsmit · December 1, 2020, 11:54pm

Thanks for the idea! Unfortunately, I work with proprietary files, so giving them to someone else isn’t an option.

kerjsmit · December 1, 2020, 11:55pm

I had considered checking out DEVONthink, but I wasn’t readily able to tell that it permitted regex. I’ll look into it further. Thanks!

kerjsmit · December 2, 2020, 12:05am

HoudahSpot can search text content with regex? I own it, but I thought it could only filter file names, paths, etc. that way. I’ll have to check into that. And I also will look into FoxTrot Professional. Thanks!

kerjsmit · December 2, 2020, 12:06am

Thanks! I don’t know if the command line approach would make my workflow any faster, but I’ll give it some thought.

anon41602260 · December 2, 2020, 12:09am

Yep

HoudahSpot 5 Manual 4.6.1: You may also further narrow down your search results using the filter text field on the top of the Results pane. This allows you to filter – show or exclude – files where the name, path, file name, parent folder or any folder in the path matches text or a regular expression you enter in the filter field.

So, find .pdf then filter the results with regex.

kerjsmit · December 2, 2020, 12:17am

Okay, thanks! I also see that FoxTrot Professional includes “neighboring words” as a built-in search parameter, which sounds exactly like what I need, depending on how the results are presented. Shall explore. Thanks so much for the help!

anon41602260 · December 2, 2020, 12:25am

You can’t go wrong with FoxTrot Pro.

jahala · December 3, 2020, 4:51am

This reminds me of pdfgrep. I had forgotten about it. It is available in homebrew.

jahala · December 3, 2020, 5:01am

I am still curious about what you do with the results? Are you highlighting them? Copying them somewhere else? Just reading the text around them? Counting them? Are you saying that you have that action covered now by a Keyboard Maestro script?

I installed poppler with homebrew to get the pdftotext command, then wrote a script to recursively traverse a directory tree and convert all PDF files. For one directory, a zsh script would look like the following:

for i in *.pdf
do
  pdftotext -q -nopgbrk "$i"
done

You will get a .txt file next to every .pdf file. Then you can do your regex search however you want. I actually developed my script in PowerShell because I need to run it in Windows as well. You can also install PowerShell using homebrew. Here is the equivalent powershell script, which will also recurse through all subdirectories.

param ([Parameter(Mandatory)]$sourcedir)
echo "Converting PDF files to text files. This may take a long time..."
  Get-ChildItem "$sourcedir" -Filter "*.pdf" -Recurse |  foreach {
  ## convert to text
  pdftotext -q -nopgbrk $_.Fullname
}
echo "Text files created."

Here is an example of how search and replace in all the text files for several regexes at once in PowerShell

Get-ChildItem $sourcedir -Filter "*.txt" -Recurse | foreach { 
 		$filename = $_.Fullname
 	(Get-Content $filename) -replace "(..-\d{6}).(\d{3})",'$1-$2' `
		-replace "(..-\d{6}) ?SHT\.? ?(\d{3})",'$1-$2' `
		-replace "(..-\d{6})(\d{3})",'$1-$2' `
		-replace "(..-\d{6})([^-.0-9])",'$1-001$2' | Set-Content $filename }

MartinPacker · December 3, 2020, 8:15am

Not to dampen anyone’s spirits but it is highly possible to have two strings physically next to each other on a page - such that they look like one string. Then a search for a hit across the pair would be more difficult to accomplish - particularly if there were a third string between them in the PDF data stream.

jahala · December 3, 2020, 5:15pm

Good point. I have experienced that before, mostly in drawings and tabular data, but it can happen in narrative style documents as well.

ygjose · December 3, 2020, 6:48pm

I don’t know if you are interested in jumping into corpus linguistics area. A free corpus tool called LancsBox allows you upload all your pdf files as a corpus to the software, supports regex search, and finds words (collocates) of a given search term within a specific span (like left 5 words to right 5 words). You can find a video tutorial in User Guide (GraphColl Tab).