2 TB of PDFs - Workflow Recommendations Needed

Hey MPU,

I started a project and I am not sure the best workflow / app to organize and arrange things. I have roughly 2 TBs of PDFs that I need to organize.

Do I just dump it all into a folder on my NAS and let DevonThink index?
Do I arrange it folders by category/topic? (but overlap could occur)
Do I arrange my authors? (but then I don’t think of author names when I think of a book) I typically think of the Book title

Goals

  1. Catalog it all
  2. Ability to search and find later on whether by keyword, topic, title, author

Cool Future Ideas (but probably no time)

  1. Finding any correlations/connections between the PDFs

Have you thought of an academic solutions such as Zotero or Medeley

Both designed to work with lots of PDFs.

You can also connect Zotero and Obsidian through plug-ins such as as Zotero Integration. That allows you to do some of your connections

1 Like

I’'d just put it all in Devonthink, especially since it isn’t prreorganized

4 Likes

+1 for a Zotero or similar Bibliographic solution. That’s what those apps are made for.

As of now I keep them in Dropbox. Any changes I made to these pdfs are synced to other devices.
I’m not sure how best it would be use DEVONthink To Go alongside with DEVONthink.

I would go with DEVONthink 3. That is really what it is for. I find it does the job for you frankly. I just put things in there now and I can usually find it without a lot of ‘hand sorting’ as it were if that makes sense?

2 TB of PDFs though!! wow. I have, over a lifetime now, maybe 4 GB and that includes a fair number of pictures and emails.

3 Likes

That is a lot of PDFs! Over my academic life I’ve acquired perhaps 40GB worth of PDFs, many of which will probably remain unread. I guess you don’t tell us how many files that equates to, which impacts on your workflow.

Anyway – Devonthink 3 all the way, I’d say. Import rather than index for speed and ease of use. A lot depends on the quality of the OCR (and the age of the files), which will massively impact what DT3 (or any other repository) can do with the material.

But I’d also urge some triaging before you dump it all in. It seems key to me that at least the file naming is consistent in some way and allows you to see easily what you are dealing with: e.g. Author - date - Short title.

Then, some pretty clear ideas about what you want to get out of such a database. In my experience, getting your research questions (or aims) clear and then build a collection around those is much the better way than entering into a huge digital archive with only some vague notions of what it might produce for you at some future point. So finding ‘correlations’ is entirely dependent on your questions. The same goes for ‘arrange it in folders’ – you’d need a good taxonomy to start with, and you can only have that with a deep understanding of the material. No amount of AI (DT or otherwise) can really help beyond the basics here, I think. Ditto for keywording, which, personally, I’ve never had great luck with, e.g. by way of topic tagging: this is because I tend to revisit any one source with quite different research questions that were rarely anticipated by past me!.

Tell us more about the nature of the texts: are they VCR manuals or epic poetry? Travel receipts or math publications?

Sounds like a fun project – let us know how it goes!

3 Likes

I’m interested in this too. I am downsizing my office (I’m a professor) and have piles of paper articles to scan. I already use Bookends for the journals and stuff I get online — which is basically everything for the past 15 years.

Maybe it’s time to bite the bullet and get Devonthink, but … $$$ …

I have separated my document management from my bibliography/citations manager (Zotero in my case, but I am very interested in Bookends and have a plan to look into it more, because it seems much more flexible in editing output styles). So all PDFs live in DT3, which I think is much more scaleable and able to put up with robust search routines. I am very glad I have done so, since I have much more flexibility. But I appreciate that was easier done at the start of my career than backwards from the end!

DT3 is worth the money, I’d say. They also have a very generous trial period: ‘30 days or 150 hours of runtime, whichever is later’: see here.

2 Likes

Yes, very much I would like follow up on this one. That is a lot of PDFs.

$$$, really it is cheap for what it is. I know I sound like a fan boi and I am I guess. I use Bookends but just for citations and formatting now, I don’t keep any PDFs in it for @SebMacV 's interest. It is worth it, and a few tools that I add to the cost of my MacBook Air, all I have nowadays, is really very little in terms of cost. Some journals now cost more per year if, like I do, one likes one’s own personal subscription, than the total cost of my DEVONthink 3 type tools.
I use DEVONthink 3 all the time, it is my reason for being on Mac now really, but though I don’t have heavy use out of them I also have Mellel and Bookends and find them well worth the costs. I use Keyboard Maestro all the time too but that is a slightly different category, I have all my snippets in that as a matter of fact…

2 Likes

I’ve just contacted DT about an academic discount. And I really love Bookends — not just as a piece of software (which is very good) but because of the frequent updates that respond to user requests, and the community that is very helpful.

2 Likes

What you propose sounds to me to be akin to creating an electronic library. You do not however suggest that you are doing this exercise to create a library that you will use as a bibliography for citations. So, while apps such as Bookends, Zotero, and Papers used for bibliography databases will provide what you want, they are also loaded with tools for citation management, and you may find using them to be distracting as such.

The next set of apps to consider are database apps. Devonthink (DT) is oft mentioned here, but do not neglect to review apps such as EagleFiler, KeepIt, and others that can provide for your needs to a) catalogue and b) search. Here I would echo the recommendation from @SebMacV – before you start cataloguing, determine what needs the catalogue structure must address. By example, you might create a perfect catalogue structure based on the size of and number of images in the PDF, but what good is such perfection when you discover (later) that your true interests are better served if you would instead catalogue the PDFs by number of words and number of images? With electronic database apps, you can change the catalogue approach “on the fly”, so doing an up front consideration is not critical. It simply helps limit frustration in playing around with only a vague idea about the outcome you want by creating a catalogue.

Given the size of the content you have, you may need to be aware of differences that occur when apps index files in place versus importing files into their own database structure. DT does either approach. With 2 TB of files, importing could likely require at least another 2 TB of storage, while indexing will require significantly less that another 2 TB of storage. Other caveats apply.

Finally, with your desire to “find connections”, you step into the realm of using the bibliography for research. Why and how you do this are personal choices. The one caveat here is … the app that you use to manage the database should not negatively impact your ability to carry out research within the database in a manner that you find to be most effective to your personal style. By example, from experience, I would find some of the apps mentioned here … Mendeley and Zotero … to be cumbersome or entirely ineffective in doing research. On the other side, while DT would be among my top suggestions, I would caution to review the index versus import approaches well enough before you start your project. And in closing, I might suggest the bibliography app Bookends, even with the citation management overhead, simply because it is not encumbered by the need for additional storage (all PDFs remain locally stored), because it more recently offers great options to export “research” documents in markdown for apps such as Obsidian, and because reading/annotating PDFs on the iPad version is a pleasure to do.

Hope this gives some useful insights to start your new adventure.


JJW

3 Likes

Thank you everyone for the insight.

It started as a small personal project and then kept growing with the number of colleagues, priests, etc involved.

@SebMacV If they were travel receipts or VCR manuals, now that would be much easier and perhaps interesting to find correlations. LOL. The nature of the texts are either books, research papers, scanned manuscripts, multiple languages, material from various churches, the Diocese, lots of theology. etc.

In one of the folders I opened, I found an archive of digitized photos and videos, which might explain the 2 TB jump. Not sure how many of them are, but I feel like this could be a life-time project if I continue doing this on my own :rofl:

I do have DT3, but have always used it for small things, nothing of this magnitude yet. My own OCD is at war with me when I perused the folders, there are very broad categories.

It is also likely that all this will need to be shared and available internally since more colleagues are involved, maybe OneDrive?

There’s not much concern (in my head for citations at the moment) My concern is knowing that Book A exists as Book A.pdf and resides in Folder B (if the folder is a topic). OR not worrying so much about a folder hierarchy and focus on what was mentioned earlier, each file is properly named. Author - Title.pdf

Note - When I say personal project and colleagues, and doing things on my own, it may sound contradictory. As it stands, the project library is growing with the number of folders, files that keep getting contributed by colleagues and/or churches. I am the only one brave enough to categorize LOL

1 Like

You might even want to look into DevonThink Server which allows others to view and/or edit your database. I think it’d be loads more powerful than OneDrive—though there may be other specialised server based document management systems out there.

3 Likes

Perhaps it would be worthwhile analysing the sizes of individual files and seeing if you can cut out a lot of the data storage bulk by just storing a few excessively large files elsewhere, or compressing things differently.

1 Like

With these questions, I would go DEVONthink all the way. Overlapping can be managed with replicants, for example, search would not pose a problem, and folder categorisation should be helped by its AI indexing capability once you get a significant corpus catalogued.

3 Likes

Cool!

I think that as @Vincent_Ardern recommends, it would make a lot of sense to sort by file type and consider splitting the image and video data from e.g. PDFs and other document types. Perhaps a job for Daisydisk or similar.

In my experience, DT3 can function as a perfectly adequate storage site but doesn’t offer much benefit to hoarding large amount of video and images: YMMV. DT3 thrives on text, and so that’s were I would focus my efforts, even splitting databases?

But you have a complex job here, with various stakeholders in an ever-growing archive. It sound like you a providing a sterling service to your community and your diocese has just acquired its own archivist :slight_smile: – good luck!

Sory, didn’t read this. I think this makes trickier using DEVONthink.

If you need to use OneDrive or Google Drive (both should find everything easily as long as the names are properly named as you suggest) then the issue is that you need to have a “taxonomy” for stuff and metadata you need for easy retrieval.

Not so sure how I would add DEVOnthink on top of that, perhaps indexing OneDrive folder in your DEVON database? Beware of syncing here!

1 Like

FWIW, OneDrive will index PDFs, Microsoft file types, and possibly other kind of files.
Google Drive automatically indexes most text based, image, and video file types.

OneDrive file indexing