Extract PDF text with Mac Automator

It’s such a shame that I just knew the existence of Mac built-in software—Automator. After a quick search, I found that we can do a lot of automations with it.

One of the automations that benefits me is that I can automatically extract text from PDF files. I often do textual analysis based on plain text, so PDF is not convenient. To do so, I only need to create one action in Automator, simply drag PDF files, and I’ll get all my TXTs. The quality is pretty good, better than a paid software.

Steps:

  1. Open Automator
  2. Select Application when a window prompts you
  3. Select PDF --> select “Extract PDF text” --> drag it to the workflow
  4. Save the automation and you’ll see the application on your desktop or wherever you saved it
  5. Drag PDF file(s) to that App icon and voila!

10 Likes

That’s a good one.

What I could really use is something that would parse a PDF into a tree - such that I could modify the tree and then write it back out again.

1 Like

Nice! Thanks for posting!

1 Like

@ygjose I was not aware of this feature. Can it extract PDFs as text in bulk? For example, my potential use case is an export of several hundred Apple Notes. As you probably know, one can only export AN as PDFs. Will the automator action extract multiple PDFs in bulk or does not have to perform the action one PDF at a time?

Yes, create this Automator as a ‘Service’ (instead of an ‘Application’ shown above), called say Extract PDF text. Then select all pdfs you want to extract text, right click->Service->Extract PDF text. You should be able to batch extract texts.

Ah, what a blast. I am looking for ways to extract text from my pdf banking accounts and writing those informations most important for me into an excel-file.

But how could I use this script and automator to

  1. reduce the extraced text only for those I definitely need and
  2. writing them into an external file that I could import and use in excel (for example csv)

For 1)
Since I only need specific parts out of my banking reports, I either export all data into a csv and only import those colums need for me into excel or I need to validate specific parts of the pdf.

Does anyone have an idea of how to accomplish this?

Many thanks…

Unless the information you want is minimal and your bank statement is better organized and formatted than any I have seen, it won’t be easy. If every line that you want from the statement has an identifying label, you could use a shell script in the terminal to “grep” all desired lines into a separate file. And you would still need some cleanup to get it ready to import. I once had to write a COBOL program to extract all patient financial data for a client converting to our system. They had lost their old system and all they had was a printed report. I used every trick I could think of to sort through the data as page breaks occurred randomly and special cases in the format abounded.

You’d probably be better off hiring @drdrang (Dr. Drang) to do it for you. :slightly_smiling_face:

1 Like

Take a look at the Hazel app.

Would PDFZone work for you? ‎PDFZone - PDF Automation on the Mac App Store

" It can also extract the desired content to useful data (CSV). In seconds."

Great tip! I just did a quick compare/contrast with Nitro PDF’s export function and Automator won because it reflowed the text and Nitro did not. It wasn’t perfect, of course, but getting usable text was much faster since I didn’t have to do the work of taking out the line breaks.