Extracting highlights from PDFs

Gregg_Fox · August 5, 2018, 11:13pm

Hi, all.

As a graduate student in the life sciences, I read and markup a lot of academic papers related to my field. One of the practices that I’ve found helpful for retention is storage of highlights I make from papers I read as plain text, usually paired with the relevant citation and, depending on where I choose to store the text, a copy of the marked-up PDF. This workflow is also quite helpful when it comes time for grant or paper writing.

Despite a great deal of tweaking over the years, alas, I’m still not completely satisfied with this process. In the current incarnation, I import PDFs into PDF Expert on my iPad Pro for reading, highlighting, and annotation using the Apple Pencil. From there, I use one of the sharing features in PDFE to email a summary of the highlights to myself, which is then funneled into Ulysses for storage (currently in the process of switching this latter stage over to KeepIt, which is obviously more suited for archival purposes).

While this works in general, pieces of it often fail in the particular. The text of the extracted highlights especially is often plagued by broken formatting (errant spaces, carriage returns, etc.) that has proven difficult to resolve in an automated fashion, even using tools like Clean Text. It is also much more difficult than I would prefer to quickly generate a citation to append to these highlights for later reference.

I would appreciate any advice from people who have arrived at a satisfying system for this sort of work, particularly as it’s done on iPad.

Thanks!

dfay · August 6, 2018, 12:01am

PDF Expert’s note exporting is kind of awful, and Readdle hasn’t shown any interest in improving it.

Highlights for iOS is currently in beta – I have been using it through several beta versions and it’s totally stable at this point, and the developer has been receptive to some of my suggestions about customising the output. It has the advantage of using standard PDF annotations. And the Mac version can automatically generate a citation if your article has a valid DOI.

The alternative that I’ve been using for the last decade (since before there was native PDF annotation in Preview) is Skim – it’s Mac only with Skim annotations but it imports and exports standard PDF annotations flawlessly. It has the advantages of great AppleScript support and a robust templating system so you can get exactly the output you want.

The missing piece of the system you describe, though, is a citation manager. It may not seem necessary early in grad school but it will make life much easier in the long term if you start working in it now. I use BibDesk, even though I don’t actually write in BibTeX, because I can attach multiple files to each record. It autofiles all my PDFs, and I have scripts to generate a new Ulysses sheet for reading notes from a template, including the citation, and to add corresponding keywords (in Ulysses) and/or MacOS tags (for other files). Once I’m done reading the PDF in Skim, another applescript automatically extracts all the notes to the clipboard, ready to paste into Ulysses.

Ideally, everything I read begins with a BibDesk entry prepared first – in practice I read a lot on iOS first, then convert PDF annotations to Skim annotations & create the citation and links later. But I never leave anything I’ve read (for work – I’m a cultural anthropologist) in a free-floating PDF without getting it into BibDesk eventually.

Unfortunately BibDesk is also Mac-only – I built a custom FileMaker Go database to mirror my library on iOS.

Most of what I describe could be done with Papers or Bookends – I think Bookends has more robust scripting – but I’ve stuck with BibDesk b/c I have a working system.

Why would you move your notes out of Ulysses? I have close to 1,000,000 words in Ulysses and search is still blazing fast on all devices.

Quahog · August 6, 2018, 1:23am

I think some academic publisher really do their darndest to make copying and pasting from PDFs of reprints (does anyone not ancient like me still call them reprints??) difficult. I am a professor and run a senior seminar that involves students giving talks based upon primary literature. They present them as though they were the authors at a scientific meeting.

I have to make a schedule, which includes the titles of the papers. You’d think simply copying and pasting the title from the PDF would work, but noooooo… it is amazing the number and variety of extraneous characters that get carried over when I copy and paste just the title. This wasn’t the case just a few years ago, but this past year it was really really bad. I think it’s a newish thing.

I wish I had a solution for you. I just manually clean up the formatting. But it’s not a lot of work in my situation because I’m pasting only the titles.

One thing that might work is to strip out the OCR and then OCR it again, but in the sciences, I think a lot of the special characters we use would create its own set of problems.

oldblueday · August 6, 2018, 2:36am

I used to use Papers for iOS and it did a pretty good job of this, but I believe they were bought out by a different company and so I don’t know what the new version (ReadCube) will be like.

Papers had the benefit of getting all the metadata (author, journal, issue, pages, etc) used for references automatically.

Just tried it on PDF Expert and it seems to work fine. May need some regex replacing to remove the “highlight [page xxx]:” in front of every line.

BradG · August 6, 2018, 2:44am

Some very helpful suggestions here.

Still blows my mind that this isn’t something that an application/software company has tried to fix, once and for all, and offer a ‘start-to-finish’ product that does this all – which suggests its probably not as easy as one would think…
It’s remarkable how similar this feels to the email application débâcle: there are plenty of them out there, and several that do most of what one wants well, but none that do all one wants, well…

Good news to hear that Highlights’ iOS app is coming along nicely!

BradG · August 6, 2018, 2:55am

My partial “solution” to this, was (a.) using different highlighting colours for different ‘types’ of highlighting (i.e. Red = crux; purple = quotable; electric blue = consider further etc.); and (b.) using an iOS snippet shortcut to prefix any annotation ‘notes/comments’ I wrote to myself.

Goodreader exports a similar summary, but also adds the highlight colour ‘code’ in the summary text. Since it allows you to keep using the same colours (it remembers the 14(?) most recently used – I use 7 of them), those codes are consistent – and allows one to filter/search, based on them, later.

The unique snippet shortcut that prefixes all my comments, also allows for later searching, to only find those comments that I have made on the PDF.

By adding the UUID of the PDF to the comments summary, and to the metadata of the PDF, I was then able to ‘tie-together’ the summary and PDF itself. So, in short, I tried to introduce the link by hard-wiring it into both at the ‘source’, so to speak.

This is not exactly what OP is after, but since it might trigger some thoughts on how to approach this, figured I would pop this up.

JohnAtl · August 6, 2018, 3:03am

Readdle has always been very responsive to my support requests. The latest involves extraneous characters in exported highlights. Turns out this is due to ligatures in the pdf files themselves. Readdle’s support passed this on to th developers, so hopefully it will be resolved soon.

My suggestion: use the highlights that you export as a reference and rewrite everything in your own words. This will help with retention, and will help when you’re presenting, writing your dissertation, etc.

DavidBeck · August 6, 2018, 3:14am

Liquidtext will let you highlight in multiple colors or even relate quotes from different parts of the document. You can export your highlights as a Word document, while you still have the original highlighted PDF available. Each excerpt automatically has listed below it the file name and page number, which would help you with citations.

I’ve not tried it on academic papers (I’m a lawyer), so I don’t know if academic footnotes throw off the excerpting, but I use it to read things on iPad and then export excerpts through WebDav to use on other computers.

–David Beck

warren · August 6, 2018, 8:35am

I use PDF Expert on iPad and Mac. I extract highlights and clean them up with Text Soap. The text files are saved to a single folder. I have The Archive and DevonthinkPro pointed at that folder, which gives me a lot of search power. I have given up on expensive citation managers - Papers and Zotero. I have tried others, but they seem to fail once your library gets too big. I now save the bibliographic information into BibDesk and keep all my PDFs in the one Dropbox folder, which is great for Mac and iPad. I have a HoudahSpot search saved to find the PDFs when I need them.

dfay · August 6, 2018, 11:14am

The only issue I’ve encountered trying to do something similar is that Finder (and iOS Files app) performance can suffer when there are a lot of files in one folder - I have +/- 3500 PDF linked to BibDesk records, & I have BD auto filing into folders by first letter of author’s surname.

Do you use BD linked to files on Dropbox on multiple Macs? Or are your PDFs not linked to BD? I recall there were sporadic issues b/c of the way BD did file aliases a couple years back, & so don’t keep linked PDFs in DB.

dfay · August 6, 2018, 11:16am

I have LiquidText and feel like I should use it more- maybe if I ever get a 12” ipad - it feels like there’s not enough room to fully take advantage on 10.5 .

Gregg_Fox · August 6, 2018, 4:55pm

Thanks for your response and helpful series of suggestions! I actually did use Highlights for a while a few years ago (i.e. much earlier in its development life cycle) but was eventually turned off by its instability. Glad to hear that this has been largely resolved—I really liked the app when it was working.

With regard to citation managers, you’re quite right, of course. I use a combination of Papers (for citation insertion and bibliography generation) and Readcube (for paper organization, reference follow-up, and discovery). This system has been a bit unwieldy, so I’m actually quite excited for the upcoming unified app, the beta for which is launching this week. I had never heard of BibDesk and will definitely give it a look—thanks for the recommendation!

Finally, the Ulysses/Keep It transition is more of an experiment than anything. Similar to your experience, I’ve been nothing but pleased with Ulysses’s performance and search speed, even as the number of sheets balloons. The impetus for switching to Keep It was motivated more by the desire to reduce mental clutter by maintaining a more rigid functional separation between Ulysses as strictly for writing, with a distinct place (Keep It, in this case) for research and archiving. I also really like to take screenshots of key, marked up figures for later reference, and Keep It’s image handling is superior to Ulysses’s, in my opinion. Once I incorporate some of the suggestions from this thread, I’ll definitely need to re-evaluate and see if the new arrangement is actually more effective and friction-free.

Gregg_Fox · August 6, 2018, 4:59pm

Thanks, everyone, for your responses and suggestions! It’s heartening (and frustrating) to hear that I’m not alone in my search for a more streamlined methodology.

I’m eager to experiment with a few of these tools and see how they fit!

dfay · August 6, 2018, 5:25pm

That makes a lot of sense - I generally write in Scrivener and use Ulysses for notes. I hated trying to write in it, but as a repository the combination of keywords, fixed groups & smart filters, plus the inbox model, can’t be beat IMHO. Never going back to NVAlt.

Also looking forward to see Papers/Readcube , just hoping they finally import BibTeX with keywords intact…

warren · August 7, 2018, 8:08am

I have my pdfs in a Dropbox folder (over 5,500: so far so good); but interesting you say that about the iOS files app, because I have noticed some flakiness there too when I am trying to access another folder with large numbers of documents. I keep my BibDesk in Dropbox also, but I don’t link the PDF files to BibDesk. I use a naming convention (first author, year, title) to help me with search. I mostly use Alfred just to search up the PDF, but Houdahspot when I need something more heavy duty.

andreasl · March 27, 2020, 4:35pm

I’m currently in the process of finding a way to extract/export highlights that I have done in PDF Expert on iPadOS.

Ideally I want to export hotness highlights as text or .md. but I can’t figure out how to do it on the iPad and I don’t have a Mac anymore.

Can anyone help out? Perhaps @dfay can shed some light on how to do this in PDF EXPERT on iOS if it’s working alright now?

dfay · March 27, 2020, 5:22pm

I looked briefly at PDF Expert and it seems to still only export in its crazy html format. You could write a script or something to clean that up - I did in Editorial with a series of find replace but it no longer seems to work & I haven’t used it in a few years.

At this point I’m using PDF Viewer instead, and the Highlights TestFlight for the occasional export (which works great for markdown), but mostly using Skim on the Mac.

andreasl · March 27, 2020, 7:05pm

So there is no way of exporting highlights as just text from PDF Expert on iOS?

Can I import the highlighted PDFs to another app in iOS or iPadOS and then export to plain text or text from there? Will PDF Viewer do this?

dfay · March 28, 2020, 4:37am

yes that’s correct as far as I can tell. Back when I used the app I emailed the developers to request it but that was years ago & it didn’t seem to be a priority.

But the PDF Expert annotations are readable by other programs.

PDF Viewer can export the text of annotations but it puts them in a PDF file. You can use Shortcuts to Get File Contents to get the text but there’s no formatting.

Best bet though is to get Highlights. It’s close to release and certainly usable.

ryanjamurphy · March 28, 2020, 6:29am

@andreasl whatcha trying to do with the highlights? I left PDF Expert for PDF Viewer to get at exported highlights as @dfay describes. Share → Annotation Summary → the shortcut, and then it uses the exported text to create a link post on my blog.

Here it is, as-is: https://www.icloud.com/shortcuts/5b0b9634183e46b089c4970e9762f1fd

(It switches back to PDF Viewer in the middle so that you can copy a link to the article, then inserts that link in a multi-markdown header. I found this more reliable than trying to remember to copy the link before exporting the highlights.)