In the article below (and it’s well worth watching the video they state:
The o3 model was found to hallucinate in 33% of answers to questions when tested on publicly available facts; the o4-mini version did worse, generating false, incorrect or imaginary information 48% of the time.
That’s not restricted to the current U.S. president. I’d argue it applies to most politicians in most countries. And it’s not merely a matter of inaccuracy—it’s mendacity at the highest levels.
LLM models that do not provide links to their sources should never be used for facts. They should only be used for editiing what you wrote or for writing fiction.
If you want to use an LLM to answer a non-fiction question, then only user an LLM which provides URLs to the source of its facts; and always read those sources.
Totally agree - the best non-fiction writing acknolwedges all viewpoints.
If I ask AI to summarize news on a given topic, one of the best indicators of its usefulness is whether it quotes both CNN and FoxNews on the same topic. Ditto for human research.
The new “fast” models were primarily built to save OpenAI money and reduce data center costs (they marketed it as an efficiency drive). Ever since GPT4, they have been trying to make it cheaper. They were pretty transparent about this, and this comes at the expense of accuracy.
For coding, the situation is worse as the new models are producing much worse quality code.
Sorry for being a bit thick. How would cheaper code make it less accurate? Don’t you just code in that it tells no lies? As it’s a written programme I can’t understand how the developers are producing code that when run makes up information unless they are telling it to do so? Seems nuts to me. It’s like making a coffee machine designed to make bad coffee.
They compressed and optimized to reduce the size of the vector database so it has less to process, meaning lower data center bills and faster output. It has nothing to do with code (I was referring to the LLMs ability to produce code being weaker). In the end, they are really listening to shareholders who want more profit, not users who want better results.
But should the LLM only be showing what’s actually out there, no matter the vector database size? The bit I don’t follow is that I thought an LLM is asked a question and it searches through all the available information it has access to and then summarises that data. From what I’m reading it’s more like, stuff all this research I’ll make it up instead.
To try and simplify, they have been optimizing the data available to the LLM before training it, and trying to make it as efficient as possible (so some data is removed as a result). They have also been tweaking other things to make it use less resources in the data center, such as optimizing the algorithms - which also change the results to favor efficiency and speed.
As a result, the LLM has less available data and it is tuned to be efficient at the expense of accuracy, so overall so it hallucinates more.
Shouldn’t that just result in either a less detailed answer or a, sorry, I don’t have that information? The hallucination seems a politically correct term for making it up.
If LLMs reasoned this would be possible, but they don’t and just pick the most likely next word from the dataset. The LLM only knows relationships between words and picks the most probable word to come next, it doesn’t know if it is true information or not. Words are encoded as numbers in the vector database, so the LLM never understands them like we do.
This is an oversimplification but it’s essentially using Bayesian statistics - Wikipedia - in other words a form of probability. It predicts the relationship between the numbers in the database appearing together, then it converts the numbers into words when outputting the results.
Put differently: LLMs take all the words they’ve been given (their system prompt, any of the user’s previous prompts (the “context”), and the user’s most recent prompt), and then guess what the most likely next words would be.
An easy way to think about this is to consider a forum thread. Imagine someone writes a post here on how iPads are not computers. Given all of the previous examples of that conversation, what are the likely replies to that post? That’s basically what LLMs do.
Chain of thought models are a funny example of this: these models “think out loud,” and in doing so they increase the number of words that go into the context. They aren’t actually thinking, they’re just adding to their own context in the hopes that it increases the likelihood of successfully predicting the “right” final words.
So shouldn’t they be called LLMG? The G has been added to signify that they’re guessing. Although tongue-in-cheek if they had ‘guess’ in the name they would probably be used more appropriately. As I tend not to read up too much on these things I had incorrectly assumed they had knowledge; and lots of it; and would give me a good answer. Knowing they’re guessing changes my view on LLMs and places them in a similar position to politicians who are also guessing most of the time.
You may find this article interesting. It breaks down what is actually happening when you talk to one of these chat thingies. Basically, AIUI, you’re not talking to the Large Language model directly, you’re talking to a persona designed by the company through an interface which forces you towards assuming that you’re in a conversation. The persona is ‘incentivised’ to keep you talking and will adopt any strategy to continue the conversation - ‘truth’ is neither here nor there. The author has access to a base LLM without the other layers, so it’s interesting to see the difference between responses when you strip away the ‘chat elements’ from the base engine.
There are some consequences to this approach, which the author discusses, and some of them are surprising if you’re used to thinking of, say, ChatGPT as the ‘AI’ in its entirety.
I’ve only just started looking into AI and don’t pretend to understand much about the theory, but I found the article interesting. It’s a while since I read it and I have certainly missed out some nuances (or misunderstood some of it…), so if you are using chatbots or thinking about doing so, it’s worth reading the article directly to make up your own mind.