LLM question regarding training/indexing my own knowlegebase

iPersuade · November 3, 2024, 9:04pm

I’m not sure if this is a question for our group, but I thought I’d try here. Is it possible to install a large language model on my machine to have it ingest all the materials in my knowledge management system? I’d want this so that I could search through my “knowledge” and find things, maybe find connections I had not necessarily manually indexed, etc. I’m not interested in mining my material to have AI write. I just want to be able to unearth the treasures from this library of material I’ve assembled over the years.

Forgive me if this is a banal question that is LLM 101 and everyone is already doing this.

SpivR · November 3, 2024, 9:44pm

Good question. I haven’t tried running LLM locally, but I do know that most LLM’s at least allow a fairly large “system prompt” that you can configure as starting point for a query.

Something like “Assume you are a [insert quick summary of your skill/background/job] and have the following information [insert local knowledge data]”.

Obviously, there are limits the size of these “system instructions” and the format of the data.

Hopefully, running an LLM locally would allow either a much bigger set of system instructions or a way to scan a local database.

I’m not sure you could do a full training on your own large database - that might be beyond the capacity of an individual computer both in processing power and storage?

MevetS · November 3, 2024, 10:01pm

This part is certainly possible. I run LLMs locally with Ollama. I have a Mac Studio M1 Max with 64 GB RAM and so far so good. The more memory the better for running LLMs.

I suspect it is possible to train a model with your data, but I’ve no experience with that.

Good luck and have fun!

cornchip · November 3, 2024, 10:52pm

A couple GUI options to look at are LM Studio and AnythingLlm. I think both might struggle with huge numbers of documents, though.

MevetS · November 3, 2024, 11:09pm

I use Enchanted, a macOS native gui front end for Ollama.

I’ve set up a Keyboard Maestro macro that launches Enchanted when Ollama is launched.

There are multiple options to run LLMs locally. Hopefully one meets @iPersuade ‘s needs.

cornchip · November 3, 2024, 11:15pm

For sure. I picked those two because they do either RAG or inference tuning on a local store of documents.

iPersuade · November 4, 2024, 12:43am

This is great! Thank you @SpivR, @MevetS, and @cornchip. I’m going to try some of this out. I have an M1 max with 64 GB of RAM; So, I may be in the ballpark. But I’ve seen this advice regarding the copious quantities of RAM that large language models seem to gobble up.

I did read an article about training, and putting aside the machine specs, it seems like a Herculean task: cleaning up and normalizing data; organizing the data in specific ways; etc. Anyway, I’ll let you know what I discover along the way if I discover anything interesting!

mlevison · November 4, 2024, 1:58am

Quick bits to follow-up. i think @cornchip seems to our most experienced player in this regard. Really training a model is beyond the scale of computing any of us have the compute power or data to do.

So we either fine tuning or loading text in the context window of the model.

Context Window - this is what NotebookLLM is doing, they have massive context window 2 million tokens that they load at the front of your chat. It’s like they pre load that the start of the conversation. The challenge for local models is I’ve not seen anything with a content window larger than 128K tokens or the equivalent for approx 50 printed pages. In addition context window is reloaded whenever the model starts. This will result in high GPU and fan usage. (Ask how I know).

Fine Tuning - your worked gets added to the model. I don’t the details, because I’ve not fallen down this rabbit hole yet. The upside would be, pay the Time/GPU/Energy price once and it’s baked into your model. However you would only want to bake in bits of knowledge that were unlikely to change.

Last thoughts. Always assume the LLM (even your own) is mis-represenitng your own notes/facts. Always double check. Also it helps to understand, these things don’t reason, they’re just incredibly good pattern matchers.

iPersuade · November 4, 2024, 3:12am

Thank you. This is quite helpful. I do get the part about pattern-matching vs. actual reasoning. My own legal research tests have shown me that large language models will not be replacing junior associates anywhere in the near term.

mlevison · November 4, 2024, 1:51pm

Be careful what you optimize and watch for 2nd order effects. I wrote a longer article on this last week:

// You don’t care about the role of a ScrumMaster. All that matters is people are suggesting the role could be replaced by an LLM.

Why Your ScrumMaster Shouldn’t Fear AI (Unless They’re Just a Ticket Jockey)

LLMs are all the rage, and people are finding some interesting uses for them. For years we’ve been hearing they will eliminate various knowledge worker jobs. Ten years ago, Geoff Hinton promised us that Radiologists would soon be eliminated. This week, I saw someone say ScrumMasters should be scared.

If your ScrumMaster is a Jira jockey and your Product Owner only creates tickets, then both should probably be scared. However, the risk isn’t the LLM. Neither person has taken on the role. Instead of scare tactics (which get great LinkedIn engagement), it’s more interesting to understand where we might want to use tools that save time and energy.

Optimizing the right thing. Making everything in the system go faster often creates new and unexpected problems. Generating code with CoPilot is speeding a small part of the development process while greatly increasing the number of defects (See: Can GenAI Actually Improve Developer Productivity? | Agile Pain Reliefs Experimental Blog) - this isn’t effective.

A good ScrumMaster (and their team) should regularly study their workflow to understand where their bottlenecks are. The simplest way to do this, look at where and why work is piling up. In front of a workstation (i.e. Analysis or QA)? Do we spend a lot of time fixing defects in the work initially sent for testing? Do we have many items blocked/waiting for people outside the team?

With the bottleneck in hand, we can work on making improvements. Maybe an LLM will be useful here. It depends. The Copilot code generation tool isn’t going to help with a team that is already piling work up at QA. The Theory of Constraints taught us decades ago if a step in your process isn’t the bottleneck, optimizing for it is a waste. If we optimize outside the bottleneck, we make the bottleneck worse.

LLM to generate code? Under most circumstances, this is a waste. Replace Daily Scrum with an LLM waste. (Even worse, it just means the tool vendor thought that Daily Scrum was about reporting and not collaboration). …

When you’re evaluating tools, you should engage in a little Systems Thinking and look for second-order effects:

LLM-generated code has more defects, is often harder to read and so increases our maintenance burden. Eliminating Daily Scrum reduces communication and collaboration, making our work closer to a feature factory.
LLMs use randomness to do their work. Sometimes, they hallucinate and make mistakes. When considering using any tool, ask if you will quickly notice and correct the mistakes. (Even Grammarly’s rephrasing errors take mental energy to catch and this isn’t code).
LLMs don’t reason, the tools that are currently used and touted aren’t reasoning. They’re pattern matchers. A recent study from some Apple Researchers lays bare the problem: LLMs don’t do formal reasoning - and that is a HUGE problem

New tools might be helpful when used in places where we either don’t have deep knowledge or we spend a lot of time. Tools that helped gain insight from your Team’s Scrum or Kanban board. Legacy code - tools to explain parts of it and better write sample Unit Tests for it. Help QA, Developers and Analysts collaborate more effectively from the start to write defect-free code.

cavalierex · November 4, 2024, 1:53pm

What you are trying to do is called “retrieval-augmented generation,” or RAG. Basically, you are NOT training an LLM (which is very compute-intensive); rather, you are using an existing LLM but instructing the program to reference specific documents before responding to a prompt. Slightly more in-depth, the LLM would pre-process the documents, encode them into numbers, and store the encoding in the RAG vector database that it accesses when responding to a prompt.

Many LLMs can now be used for this purpose. As mentioned above, Ollama is a very easy way to run LLMs on your computer. This is essentially a menubar app that runs a localhost server, and you would interact with the LLM via an API. One way to do that programmatically via Python is to use the excellent Llamabot. Llamabot includes a class called “QueryBot” that is exactly what you want to do – point the LLM to a collection of documents that it will use to answer your question. See examples here.

The desktop GUIs that were linked above, such as LMStudio are very nice. They simplify the method of interacting with the LLM through a dedicated application window. (In the Llamabot method above, you would do so programmatically through code or in a Jupyter window – great if you are including LLM functionality in your own personal app, but not so great if you simply want to chat interactively.) The ability to do RAG depends on the model that you choose to run. LMStudio’s blog has recommendations for doing RAG in LMStudio.

Good luck!