New ChatGPT models more inaccurate

svsmailus · June 11, 2025, 10:28am

In the article below (and it’s well worth watching the video they state:

The o3 model was found to hallucinate in 33% of answers to questions when tested on publicly available facts; the o4-mini version did worse, generating false, incorrect or imaginary information 48% of the time.

Is that a concern? Seems awfully high.

Bmosbacker · June 11, 2025, 11:15am

That’s not restricted to the current U.S. president. I’d argue it applies to most politicians in most countries. And it’s not merely a matter of inaccuracy—it’s mendacity at the highest levels.

rkaplan · June 11, 2025, 12:12pm

The answer is pretty clear -

LLM models that do not provide links to their sources should never be used for facts. They should only be used for editiing what you wrote or for writing fiction.

If you want to use an LLM to answer a non-fiction question, then only user an LLM which provides URLs to the source of its facts; and always read those sources.

krocnyc · June 11, 2025, 8:31pm

Ahem. And also take a moment to evaluate those sources, too. Even “reliable” sources can contain erroneous data or faulty analysis.

Some passages of Herodotus are full of what might described as hallucinations, so we’ve been at this for a while.

Bmosbacker · June 11, 2025, 8:42pm

Were you intending to respond to @rkaplan or to me?

krocnyc · June 11, 2025, 8:55pm

OOPS! @rkaplan — that’s where I got the quote in my post. I’m not sure why it got flagged as a reply to you … probably user error

rkaplan · June 12, 2025, 4:03am

Totally agree - the best non-fiction writing acknolwedges all viewpoints.

If I ask AI to summarize news on a given topic, one of the best indicators of its usefulness is whether it quotes both CNN and FoxNews on the same topic. Ditto for human research.

Rob_Polding · June 12, 2025, 5:44am

The new “fast” models were primarily built to save OpenAI money and reduce data center costs (they marketed it as an efficiency drive). Ever since GPT4, they have been trying to make it cheaper. They were pretty transparent about this, and this comes at the expense of accuracy.

For coding, the situation is worse as the new models are producing much worse quality code.

svsmailus · June 12, 2025, 5:50am

Sorry for being a bit thick. How would cheaper code make it less accurate? Don’t you just code in that it tells no lies? As it’s a written programme I can’t understand how the developers are producing code that when run makes up information unless they are telling it to do so? Seems nuts to me. It’s like making a coffee machine designed to make bad coffee.

Rob_Polding · June 12, 2025, 5:51am

They compressed and optimized to reduce the size of the vector database so it has less to process, meaning lower data center bills and faster output. It has nothing to do with code (I was referring to the LLMs ability to produce code being weaker). In the end, they are really listening to shareholders who want more profit, not users who want better results.

svsmailus · June 12, 2025, 5:57am

But should the LLM only be showing what’s actually out there, no matter the vector database size? The bit I don’t follow is that I thought an LLM is asked a question and it searches through all the available information it has access to and then summarises that data. From what I’m reading it’s more like, stuff all this research I’ll make it up instead.

Rob_Polding · June 12, 2025, 5:59am

To try and simplify, they have been optimizing the data available to the LLM before training it, and trying to make it as efficient as possible (so some data is removed as a result). They have also been tweaking other things to make it use less resources in the data center, such as optimizing the algorithms - which also change the results to favor efficiency and speed.

As a result, the LLM has less available data and it is tuned to be efficient at the expense of accuracy, so overall so it hallucinates more.

svsmailus · June 12, 2025, 6:03am

Shouldn’t that just result in either a less detailed answer or a, sorry, I don’t have that information? The hallucination seems a politically correct term for making it up.

Rob_Polding · June 12, 2025, 6:05am

If LLMs reasoned this would be possible, but they don’t and just pick the most likely next word from the dataset. The LLM only knows relationships between words and picks the most probable word to come next, it doesn’t know if it is true information or not. Words are encoded as numbers in the vector database, so the LLM never understands them like we do.

svsmailus · June 12, 2025, 6:12am

Thank you for your kind explanations. So how does AI give an answer if it doesn’t understand topics or themes?

Rob_Polding · June 12, 2025, 6:22am

This is an oversimplification but it’s essentially using Bayesian statistics - Wikipedia - in other words a form of probability. It predicts the relationship between the numbers in the database appearing together, then it converts the numbers into words when outputting the results.

ryanjamurphy · June 12, 2025, 7:44am

Put differently: LLMs take all the words they’ve been given (their system prompt, any of the user’s previous prompts (the “context”), and the user’s most recent prompt), and then guess what the most likely next words would be.

An easy way to think about this is to consider a forum thread. Imagine someone writes a post here on how iPads are not computers. Given all of the previous examples of that conversation, what are the likely replies to that post? That’s basically what LLMs do.

Chain of thought models are a funny example of this: these models “think out loud,” and in doing so they increase the number of words that go into the context. They aren’t actually thinking, they’re just adding to their own context in the hopes that it increases the likelihood of successfully predicting the “right” final words.

svsmailus · June 12, 2025, 7:56am

So shouldn’t they be called LLMG? The G has been added to signify that they’re guessing. Although tongue-in-cheek if they had ‘guess’ in the name they would probably be used more appropriately. As I tend not to read up too much on these things I had incorrectly assumed they had knowledge; and lots of it; and would give me a good answer. Knowing they’re guessing changes my view on LLMs and places them in a similar position to politicians who are also guessing most of the time.

brookter · June 12, 2025, 8:56am

You may find this article interesting. It breaks down what is actually happening when you talk to one of these chat thingies. Basically, AIUI, you’re not talking to the Large Language model directly, you’re talking to a persona designed by the company through an interface which forces you towards assuming that you’re in a conversation. The persona is ‘incentivised’ to keep you talking and will adopt any strategy to continue the conversation - ‘truth’ is neither here nor there. The author has access to a base LLM without the other layers, so it’s interesting to see the difference between responses when you strip away the ‘chat elements’ from the base engine.

There are some consequences to this approach, which the author discusses, and some of them are surprising if you’re used to thinking of, say, ChatGPT as the ‘AI’ in its entirety.

I’ve only just started looking into AI and don’t pretend to understand much about the theory, but I found the article interesting. It’s a while since I read it and I have certainly missed out some nuances (or misunderstood some of it…), so if you are using chatbots or thinking about doing so, it’s worth reading the article directly to make up your own mind.

krocnyc · June 12, 2025, 1:19pm

If you have the time, I highly recommend 3Blue1Brown’s YouTube series on neural networks. If you don’t have time for the whole series, Large Language Models explained briefly will give you a decent overview of how LLM’s work. They aren’t repositories like a database; they’re prediction engines. They don’t look up the answer, they “simply” (scare quotes because the process isn’t "simple’) predict what the next word in a statement is likely to be.