Is there a mac app that can extract clean text from PDF files?

Hi all, first post here after a couple on the Facebook group previously.

I’m hoping that someone here will be able to point me in the direction of a mac app that can extract clean unformatted text from PDF files.

Copy & paste is no good to me, as the files I am extracting text from (magazine layout files, originally made in InDesign) include extra spaces and other formatting glitches due to the justified column layout on the page.

So I’m looking for a program that can take the clean text ready for web upload. Hoping I don’t have to go into every InDesign file again as there are ALOT of them…

Thanks in advance.

Simon.

I don’t know of any Mac apps, free or paid, that do a great automatic job reformatting and removing extra spaces form PDFs on their own. I’ve had good, but hardly perfect, results using Readdle’s PDF Expert.

https://pdfexpert.com/how-to-convert-pdf

You might try a free app like Skim (which I love and used to use as my main PDF reader prior to buying PDF Expert), and see what, if anything, it can do for you.

Cheapest and perhaps best option involves doing it manually - select/copy/paste then using an app like TextSoap (which has been around for decades) to remove extra spaces and garbage characters.

I don’t own TextSoap, but use the powerful GREP functionality in my purchased version of BBEdit to massage text.

Adobe Acrobat Pro does that for me on my PC at work. I haven’t tried it on my Mac at home, but assume it will do the same thing. It does require a little clean-up of text and formatting, however.

Some suggestions:

https://www.abbyy.com/en-us/finereader/pro-for-mac/

https://pandoc.org

Script InDesign using AppleScript

Amazon’s mechanical Turks?

Calibre, Abbyy FineReader, and Adobe Acrobat all do a pretty good job of converting the entire PDF into a plain text document, word document, or epub reader, but if you want to extract your highlights as clean text there are less options. My workflow involves extracting highlights on iOS using PDF Expert, copying and pasting from the email it creates into Drafts 5, and then using a Drafts action to clean up the text with RegEx and then send the text file to Bear. I used to do something similar on the desktop using BBEdit and a saved RegEx search, but I find this workflow easier since I generally read my PDFs on iOS.

https://jon.dehdari.org/tutorials/pdf_tricks.html gives some clues about some things you might be able to install and use.

Thanks very much everyone for the suggestions, I’m going to try some of these now and will let you know which ones did the job. :slight_smile:

Hi Simon

If you want to extract unformatted text from PDF files on Mac operating system, you can try SysTools PDF Text Extractor. There is an option to maintain the page formatting, you can check that option. You can even maintain the page number/bates number.

In this, you get the Page Setting options where you can select the page by giving page number or you can even provide a Page Range from which you want to extract text.