Creating a list of all my PDFs that auto updates?

omarruvalcaba · January 28, 2019, 4:53pm

Hi Everyone,

I was hoping someone would have an idea of how I could solve (or get started on solving) an issue with my PDF collection in my lab folder. I have several individuals working form one Box folder (it’s like Dropbox but mostly focused on big organizations).

I have several project folders with article pdfs that my team has collected. I’d like to create a list of the PDFs we have collected across folders so that new team members can use the list to see whether we have an article they found or are looking for.

Is there a way to create a list that includes the name of the article in the metadata (or pdf name), pulls the meta-data so it can integrate the author and publication year, and include the link location (possibly with a link?). Would it be possible to auto-update based on added pdfs and removed pdfs? Possibly also listing duplicates.

I have these folders synced to my desktop at work so I can use haze scripts and apple scripts. It would have to scan through subfolders.

I know I could use something like DEVONthink, which is fine for personal use, but not all my students have DEVONthink.

Does anyone have any ideas or suggestions?

Scottisloud · January 28, 2019, 4:58pm

Are you certain that the metadata for every PDF is accurate (or exists at all)? I find metadata for academic articles, even newer ones, is hit or miss. How important is it that you have that metadata included in your listing, rather than just a list of file names and paths?

If you can live with just file names and paths, then terminal could do this handily (once I’m back on my Mac I’ll type up some options).

anon41602260 · January 28, 2019, 5:42pm

You can do some of that (probably not the links) with HoudahSpot and export the resulting list.

omarruvalcaba · January 28, 2019, 6:54pm

Nice! I have a license for houdahspot too.I may be able to automate some of it based too https://forums.houdah.com/post/tips-tricks-automate-houdahspot-6829942

The links would be nice, but the file path works too.

anon41602260 · January 28, 2019, 7:37pm

Could probably use the path info in the export and prefix the necessary “file:///” or “box://” URL type – though URL encoding file names would take some extra handling.

dfay · January 28, 2019, 9:53pm

This thread is long but it has a lot of useful info and discussion of pros and cons of using PDF metadata (some by me):

omarruvalcaba · January 28, 2019, 10:09pm

I would love this, you should make a new thread. This will go unnoticed buried here.

omarruvalcaba · January 28, 2019, 10:21pm

Yes please share how this could be done with terminal. The file names should be enough. I’m just being greedy. I was planning on passing on the fun of entering/editing the metadata to my students. Having the author info might be the only thing I enter. That would be another way for them to look up articles.

JohnAtl · January 28, 2019, 11:08pm

Maybe a different tack. What about a central repository of all references, each student using a reference manager that feeds from/to that repo. They can filter the relevant references based on some tag, smart or dumb folders, etc.

I assume there is some common thread weaving through the references. A common repo would facilitate serendipitous discovery among the references.

Scottisloud · January 29, 2019, 11:59am

The most basic terminal search would look something like

find [DIRECTORY] -iname "*.pdf" > ~/Desktop/pdflist.txt

where [DIRECTORY] is the root directory where your PDFs are stored (Find will search subdirectories). This will print out the results to a text file on your desktop, but in theory this file could be located anywhere including in a shared directory. This does include the full path for each PDF which may or may not be desirable.

You could attach this command as a folder action script so that any time a new PDF is added to a given folder it triggers this terminal command and updates the pdflist.txt file. If you change the command so that the pdflist.txt file in written to a shared directory then the file is updated for everyone.

this isn’t especially elegant because I really do feel like good file naming and a good search tool (like Houdaspot) would be more effective than having someone grep/search through a text file.

Lars · January 29, 2019, 6:42pm

Getting a list of filenames via terminal is easy enough.

Now, if you want to pull metadata, you can go 2 routes: pdftk or exiftool.

So, you want to get author and title from the metadata? “exiftool -author -title filename.pdf” will do that for you. Somebody already posted an example of how to recursively get the filenames, just pipe them to exiftool and then you have filename, author, tile. Add whatever you want. Easy enough.

Lars · January 29, 2019, 6:49pm

PS: “exiftool -title -author -directory -r .” adds path and is recursive. No piping involved. List is ugly but contains everything. But since there is a standard formatting in it, converting to csv is trivial.

omarruvalcaba · January 29, 2019, 9:50pm

Thank you Scott, I figured this out. So much text! I didn’t know about command folder action script before, so thanks for teaching me something new! PDFs are spread out across folders by projects. It’s easy enough for me to find my PDFs on my folder, but the students don’t have the option to sync the files to their own desktop. They only have access through the web browser (I removed local download of files from my lab computers due to previous issues with accidental deletion of data/files). Any idea how to shorten the filepath on the text document? I can’t seem to find information on this. I’m probably using the wrong search terms.

I am working on a better naming system, but in the meanwhile (it will take some time to decided then switch) this might be helpful. Any recommendations on useful naming?

omarruvalcaba · January 29, 2019, 9:51pm

Lars, thanks for the suggestion!

I downloaded exiftool. Would you mind given me more indication on how I would use exiftool?

Would it be something like this in terminal?

exiftool find [DIRECTORY] -iname “*.pdf” -title -author -directory -r ~/Desktop/pdflist

As you can see I’m an amateur, but I can at lesat figure it out once I see an example.

Lars · January 29, 2019, 10:47pm

If your PDF are in /Documents/PDF: exiftool -title -author -directory -r ~/Documents/PDF

To get that into a text file:
exiftool -title -author -directory -r ~/Documents/PDF > ~/Desktop/pdflist.txt

This, of course, assuming there are only pdfs in the folder and subfolders.

Lars · January 29, 2019, 11:24pm

Ok, rewrote the thing.

This will recursively run through a folder (all subfolders) and output a tab-separated CSV (filename, subfolder, title from metadata, author from metadata), which can be imported into Numbers or whatever (cd into the folder first):

find . -iname “*.pdf” -exec exiftool -T -filename -directory -title -author {} ; > ~/Desktop/pdf.csv

The shell is your friend!

dfay · January 30, 2019, 12:36am

Yeah in the long term I think that would be the most useful solution for the OP.

Here’s a FileMaker based example - I only know of it because the creator also has written a lot of BibDesk scripts that I use.

http://www.sysecol2.ethz.ch/intranet/How_to_write.html#LiteratureMY

But if I were to do it I’d probably look at Airtable, using BibDesk to collect the bibliographic info, and auto file the files, then exporting to CSV (or automating the updates with Hazel and a bit of python) for the public version on Airtable.

omarruvalcaba · January 30, 2019, 1:14am

Thanks @Lars Looking forward to trying that tonight.

@dfay @JohnAtl Yes that’s ideal. I’ll look into this too. Thanks for sharing the example. I could swing licenses for my lab computers, but there’s no way I could provide student with a license for off-campus work for their own computers (I have 15 students working with me). I allow them to work outside the lab a % of their hours for literature reviews. That’s where this type of solution might be helpful. I don’t want to ask them to pay for their own license either. I did receive positive reviews on a grant I’m resubmitting, so it might be something I include as part of the grant.

omarruvalcaba · January 30, 2019, 6:51am

Nevermind, I reread your entry. I see what you mean now. I haven’t tried out air table in over a year. Great idea! Thanks for sharing the idea.

dfay · January 30, 2019, 8:04am

I forgot I had posted this on github - here are some templates and a script to do the output from
BibDesk to csv that can be input to AirTable (or FileMaker - if you have FM or FM Go I can share an empty version of the database as well if you want to take a look).