Feeding LLMs with PDFs

Topre · August 27, 2025, 1:24pm

I want to build a local knowledge base for my Buddhist society, to feed it our scriptures so that the users can chat with a LLM about Buddhism. Googling on this subject seems to point to using Ollama, DeepSeek, RAG and some PDF(program) but what I don’t see is the front end for user to upload and train the LLM on the PDFs that was uploaded. These tutorials I see were written in Nov 2024 or Jan 2025 so I’m thinking that surely by now that’s an easier way to do this. I wouldn’t mind paying for a Cloud provider who gave such interface. Have anyone here played with tools like this and can point me in the right direction?

WayneG · August 27, 2025, 1:45pm

Have you looked at NotebookLM?

Topre · August 27, 2025, 2:08pm

I have not. I’ll take a look. Thanks @WayneG

MevetS · August 27, 2025, 2:10pm

Disclaimer: I have not used this:

airwhale · August 27, 2025, 2:22pm

I did use NotebookLM briefly for a Google class, and must say it was a nice experience. Of course, it’s still LLM based, so you shouldn’t trust it entirely. The upside was that it (as far as I can recall) DID keep track of the sources used for its reply, making it fairly easy to either fact-check OR use as a pointer to an interesting passage in the original text.

Topre · August 27, 2025, 2:23pm

Just browsing while in bed (it’s night here) and I like that Myst can run LLM locally but the Studio is just a chat interface (good) but does not seem to come with a way to upload PDFs and train. I’ll read up their blogs and features later.

krocnyc · August 27, 2025, 2:46pm

Another vote for NotebookLM. I’ve built a couple of notebooks to help me probe some key texts relating to the history of and theories about the visual arts, photography in particular. Nothing replaces reading the texts themselves, but NotebookLM has helped me surface connections between texts that I might not have found on my own.

I decided to give it a try after I came across this passage in an article by D. Graham Burnett (Will the Humanities Survive Artificial Intelligence?):

Another example: I’ve spent the past fifteen years studying the history of laboratory research on human attention. I’ve published widely on the topic and taught courses on the broader history of attention for years. Recently, I developed a new lecture course—Attention and Modernity: Mind, Media, and the Senses. It traces shifting modes of attention, from the age of desert monks to that of surveillance capitalism.

It’s a demanding class. To teach it, I assembled a nine-hundred-page packet of primary and secondary sources—everything from St. Augustine’s “Confessions” to a neurocinematic analysis of “The Epic Split” (a highly meme-able 2013 Volvo ad starring Jean-Claude Van Damme). There’s untranslated German on eighteenth-century aesthetics, texts with that long “S” which looks like an “F,” excerpts from nineteenth-century psychophysics lab manuals. The pages are photocopied every which way. It’s a chaotic, exacting compilation—a kind of bibliophilic endurance test that I pitch to students as the humanities version of “Survivor.” Harder than organic chemistry, and with more memorization.

On a lark, I fed the entire nine-hundred-page PDF—split into three hefty chunks—to Google’s free A.I. tool, NotebookLM, just to see what it would make of a decade’s worth of recondite research. Then I asked it to produce a podcast. It churned for five minutes while I tied on an apron and started cleaning my kitchen. Then I popped in my earbuds and listened as a chirpy synthetic duo—one male, one female—dished for thirty-two minutes about my course.

What can I say? Yes, parts of their conversation were a bit, shall we say, middlebrow. Yes, they fell back on some pedestrian formulations (along the lines of “Gee, history really shows us how things have changed”). But they also dug into a fiendishly difficult essay by an analytic philosopher of mind—an exploration of “attentionalism” by the fifth-century South Asian thinker Buddhaghosa—and handled it surprisingly well, even pausing to acknowledge the tricky pronunciation of certain terms in Pali. As I rinsed a pot, I thought, A-minus.

But it wasn’t over. Before I knew it, the cheerful bots began drawing connections between Kantian theories of the sublime and “The Epic Split” ad—with genuine insight and a few well-placed jokes. I removed my earbuds. O.K. Respect, I thought. That was straight-A work.

KVZ · August 27, 2025, 4:48pm

Another vote for NotebookLM – Steven Berlin Johnson, who writes fascinating books, is a key member of Google’s team building NotebookLM. It’s fast, simple to use, and far more robust than most public LLMs,

Katie

beck · August 27, 2025, 5:21pm

I love that essay from D. Graham Burnett. A rare moment of grounded hope in the AI/education conversation.

krocnyc · August 27, 2025, 6:16pm

Agreed! Burnett’s article really did help me to think about how I might use an LLM for my own learning projects. So did Stephen Johnson’s articles about the tool (e.g. see here and here)—as @KVZ mentions above, he was deeply involved in its development.

Here’s a video, too.

Topre · August 28, 2025, 1:44am

I downloaded the MystyStudio and I am having fun with it. Before I posted this thread in MPU, I listened to some YouTube tutorials and have downloaded Ollama, RAG tool and Jupyter Notebook to try to get this running locally. But I failed miserably because there are a lot of command line thingamalid that need to be executed - bash commands, python script. MystyStudio made all these so simple. Just click and go.

It has a nice interface to let me update the DeepSeek model which I had downloaded instead of me having to write down the command line to do so.

It has a Library of Prompts which I can create and choose in my conversation.

Most important for me, it has a Knowledge Stack where I can create group of stacks. Each stack goes by a project name and can contain Files, Folders, Notes, YouTube links, etc

I am having fun exploring this tool, so thanks @MevetS for your recommendation! You should try it yourself

dustying · September 1, 2025, 2:40am

I have used this, mixed but compelling results. I’m going to dig into their training series and try to learn what to tweak.

I had used it to break down hours of transcripts from interviews so I needed a way to ensure the data stayed private, was my main use case. It helped a bit if I recall, but didn’t change my game enough to overhaul my workflow.

johnlaudun · September 2, 2025, 2:51pm

There are a couple of ways to influence and AI/LLM, and what is being described here is more like providing context than training. (And thus also the “middle brow” nature of responses noted in the article quoted above.) The fact is that most LLMs were first trained on Wikipedia entries and on “verified” sources like news reporting. That is the core of the model. They have had extensive additional training, but the core remains that “middle brow” prose.

And I want to emphasize prose here: these are not knowledge machines so much as prose machines. I have no doubt that much of what people do with providing context, through files uploaded to Google Drive and then linked into NotebookLM or Gemini or through the other model interfaces will be largely useful, but I would caution against thinking that errors will not emerged. Such “hallucinations” are simply baked into the model which only knows how to mimic prose. (I can’t say that often enough: the model has no notion of semantics, only pragmatic discourse production.)

krocnyc · September 2, 2025, 4:24pm

I heard Demis Hassabis spin LLM hallucinations as something akin to creativity. Let’s just say that one of my eyebrows shot way, way up.

I think the researchers who liken LLMs to “bullshit machines” are onto something. They’re building off of philosopher Harry Frankfurt’s definition of the term in his book On Bullshit.

An academic paper: Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models.

An online course two professors put together for their students: Modern-Day Oracles or Bullshit Machines?

airwhale · September 5, 2025, 12:49pm

Well yes, but the AI scrapers just consume EVERYTHING in their path. Which includes sources like Facebook, TikTok, Reddit, X, Truth Social, 4Chan and others that might not be entirely accurate in all respects.

If you can get to it, so can the AI scrapers - and whatever they found is embedded somewhere in their models.