2 TB of PDFs - Workflow Recommendations Needed

pantulis · April 26, 2024, 7:20am

The issue with Google Drive and OneDrive is that while they are powerful for retrieving relevant documents, and have glorious sharing capabilities for teams, they are not exactly comfortable for doing research, annotating, and basically working on a bunch of documents, with everything being on its own browser tab. Now I know that Google Drive for Desktop on a Mac is basically the same thing as iCloud Drive and I guess OneDrive will have similar capabilities so having the doc repository as a network file provider location on the Mac is a possibility.

brookter · April 26, 2024, 6:10pm

I agree with the idea of going with DT3 – it’s what it’s made for.

But you may have to be careful with that amount of data as you almost certainly won’t be able to put them into a single database, so you’ll have to do some thinking about database structure beforehand.

AFAIUI the limiting factor for databases isn’t so much the physical size, but the total number of words (because that’s what’s being indexed). See this from the FAQ:

The suggested size of a DEVONthink database is measured in words, more than in bytes. You can see this in File > Database Properties for each database. DEVONthink indexes the text of files and stores that in an index for each database. When you open a database, its index is loaded into memory for faster operations. The more unique words in the indices, the more memory is required.

Generally speaking, a modern machine with 8 GB RAM can easily hold one million unique words and 100,000 items in the database.

Some thoughts:

this is true even if you index rather than import the data.
your digitised photos and videos have much less impact than the OCR’d PDFs, DOCs etc, but they’ll still have some impact. Consider having them in a separate database that you only open when you need it.
consider having an archive database for items you know you’ll rarely use.

So, the more RAM you have, the bigger the database can be (or you can have more databases open at once), but in any case it’s likely that you’re going to have to split 2TB down into smaller databases unless you want to cripple performance.

Given that you already have a naming convention (Author Title), then you could try starting with an alphabetical arrangement (Database 1 A-G, Database 2 F-K etc) as an initial stab to see what the impact is. You can now search and see similar documents (using the AI) across any open database (you couldn’t in earlier versions), so this isn’t as limiting as it use to be. Once you’ve got all the documents into a database, then you can try using the AI to pick up broad themes, which you could use to decide on thematic databases, if that’s what you want.

There are many ways of going about this, but it’s still a huge amount of data that is going to take some thought to make usable (whatever app you use), so I think your best bet would be to ask a more detailed version of this question on the DT3 forum, because then the real experts will be able to give you a better idea.

HTH

OogieM · April 27, 2024, 1:07am

I need to clarify my comments. Up thread I said I used DT extensively. I said it was effortless and easy.

I no longer use it at all.

I still use Zotero for my PDF library of scientific articles but I no longer trust or use DEVONThink for anything vital or important to me. Everything is now moving into Obsidian.

Just don’t want anyone to go to DT based on old me’s suggestions as I now no longer can ever recommend it at all.

FrMichaelFanous · April 27, 2024, 5:13am

Can you explain more why you can’t recommend it?

OogieM · April 27, 2024, 1:01pm

Long and sordid tale of data loss and denial by DT tech support for months before admitting there was a problem. I quit trusting it with critical data and I love not being denigrated for posting about issues. If the attitude of tech support wasn’t so dismissive and arrogant I would have continued to try to work with it and continued to try to help debug the issue. I loved how that program worked but I finally said it was a dead horse and got off. Never going back either.

chrisecurtis · April 28, 2024, 8:20am

Similar love-hate relationship with Devonthink. I still use it, it’s very useful and there’s very little else that works as a “data bucket”, but I don’t trust it. I verify and repair all my databases regularly. I’ve had too many problems with bugs and other strange issues over the years, where things might go wrong and there’s no way to know until severe damage has been done. I have pretty much zero confidence that their technical support will ever be supportive and hate being blamed too.

iApple · April 28, 2024, 10:33am

+1 with DT support, 1 of them has an attitude problem… Let just say happier after moving to another platform, though still keeping most old data on DT.

Have you try PDF squeezer? I use it for big PDF files.

karlnyhus · April 28, 2024, 12:33pm

+1

PDF Squeezer does the job, no muss, no fuss.

DEVONtech_Jim · April 28, 2024, 3:54pm

How is the data segregated now?

FrMichaelFanous · May 1, 2024, 7:46am

Which other platform have you tried?

FrMichaelFanous · May 1, 2024, 7:47am

It’s currently sitting all in cloud storage (OneDrive), I made some attempts at least giving general categories of folders so I can at least work on Folder # and feel like I am making progress. My main focus right now at the moment is that every file is properly named. I think once that’s done, it will give more rooms and options.

shandy · May 1, 2024, 9:56am

For accessing PDFs via search, Foxtrot Professional Search is excellent.

I’ve plenty of experience with Devonthink (which I still use daily for PDF automation workflows) and Houdah. But my experience has been that Foxtrot helps me find what I’m looking for faster and with less mental friction than any other tool.

sangadi · May 7, 2024, 10:11am

Agree. Foxtrot is my preferred search app on my Mac.