Why ChatGPT Is Not Reliable
I’ll start with the simple fact: ChatGPT is not a reliable answerer of questions.
To try to explain why from scratch would be a heavy lift, but fortunately, Stephen Wolfram has already done the heavy lifting for us in his article, “What is ChatGPT Doing… and Why Does It Work?” [1] In a PF thread discussing this article, I tried to summarize as briefly as I could the key message of Wolfram’s article. Here is what I said in my post there [2]:
ChatGPT does not make use of the meanings of words at all. All it is doing is generating text word by word based on relative word frequencies in its training data. It is using correlations between words, but that is not the same as correlations in the underlying information that the words represent (much less causation). ChatGPT literally has no idea that the words it strings together represent anything.
In other words, ChatGPT is not designed to actually answer questions or provide information. In fact, it is explicitly designed not to do those things, because, as I said in the quote above, it only works with words in themselves; it does not work with, and does not even have any concept of, the information that the words represent. And that makes it unreliable, by design.
So, to give some examples of misconceptions that I have encountered: when you ask ChatGPT a question that you might think would be answerable by a Google Search, ChatGPT is not doing that. When you ask ChatGPT a question that you might think would be answerable by looking in a database (as Wolfram Alpha, for example, does when you ask it something like “what is the distance from New York to Los Angeles?”), ChatGPT is not doing that. And so on, for any value of “which you might think would be answerable by…”. And the same is true if you substitute “looking for information in its training data” for any of the above: the fact that, for example, there are a huge body of posts on Instagram in ChatGPT’s training data does not mean that if you ask it a question about Instagram posts, it will look at those posts in its training data and analyze them in order to answer the question. It won’t. While there is, of course, voluminous information in ChatGPT’s training data for a human reader, ChatGPT does not use, or even comprehend, any of that information. Literally all it gets from its training data is relative word frequencies.
So why do ChatGPT responses seem like they are reliable? Why do they seem like they must be coming from a process that “knows” the information involved? Because our cognitive systems are designed to interpret things that way. When we see text that looks syntactically, grammatically correct and seems like it is confidently asserting something, we assume that it must have been produced, if not by an actual human, at least by an “AI” that is generating the text based on some kind of actual knowledge. In other words, ChatGPT fools our cognitive systems into attributing qualities to it that it does not actually have.
This security hole, if you will, in our cognitive systems is not a recent discovery. Human con artists have made use of much the same tricks throughout human history. The only difference with the human con artists is that they were doing it intentionally, whereas ChatGPT has no intentions at all and is doing it as a side effect of its design. But the end result is much the same: let the reader beware.
[1] https://writings.stephenwolfram.com/2023/02/what-is-chatgpt-doing-and-why-does-it-work/
- Completed Educational Background: MIT Master’s
- Favorite Area of Science: Relativity
So what? That doesn't make ChatGPT good enough to rely on.
I don't personally intend to, no. But, there are worse ways to get answers.
Could you make the same argument for astrology? Yesterday it told me to talk to a loved one and it worked!
For your sake I sincerely hope you don't try this. Unless, of course, you only ask it questions whose answers you don't really care about anyway and aren't going to use to determine any actions. Particularly any actions that involve risk of harm to you or others.
Sure it is. You can ask it questions in natural language about everyday matters and it gives you answers, if the answers are in its databases. Unlike ChatGPT, it "knows" when it doesn't know an answer and tells you so. ChatGPT doesn't even have the concept of "doesn't know", because it doesn't even have the concept of "know". All it has is the relative word frequencies in its training data, and all it does is produce a "continuation" of the text you give it as input, according to those relative word frequencies.
Granted, Wolfram Alpha doesn't communicate its answers in natural language, but the answers are still understandable. Plus, it also includes in its answers the assumptions it made while parsing your natural language input (which ChatGPT doesn't even do at all–not just that it doesn't include any assumptions in its output, but it doesn't even parse its input). For example, if you ask Wolfram Alpha "what is the distance from New York to Los Angeles", it includes in its answer that it assumed that by "New York" you meant the city, not the state.
Huh? The Insights article under discussion, and the Wolfram article it references, are entirely about what ChatGPT does, and what it doesn't do. Wolfram also goes into some detail about the "how", but the "what" is the key part I focused on.
In a practical sense, you could live according to what answers ChatGPT gives you. Wolfram Alpha is a mathematical engine. It's not able to communicate on practical everyday matters. Nor can a rock.
How any software works is not really the issue if you are an end user. The important thing is what it outputs.
You are too focused, IMO, on how it does things and not what it does.
ChatGPT does not have any "objective assessment of the world". All it has is the relative word frequencies in its training data.
Wolfram Alpha, ironically, would be a much better thing to describe with the phrase you use here. It actually does contain a database (more precisely multiple databases with different entry and lookup criteria) with validated information about the world, which it uses to answer questions.
Only for the same reason a rock doesn't.
Chat GPT, if anything, is more reliable in terms of its objective assessment of the world than the vast majority of human beings.
Chat GPT doesn't have gross political, religious or philosophical prejudices.
If you talked to an Oil Company Executive, then there was no climate change and the biggest threat to humanity was the environmental movement.
Most humans beings deliberately lie if it is in their interests. With Chat GPT at least you know it isn't deliberately lying to you.
I don't know where AI is going, or where we are heading, but I could make a case that Chat GPT is more rational, intelligent and truthful than 99% of the people on this planet.
However, an epistomologist would say that an envelope that contaiend the phrase "It is after 2:30 and before 2:00" does not posess knowledgem eve though it is correct about as often as ChatGPT.
Thanks! It looks like, at the very least, the paid version includes searching the Internet for actual answers to prompts, so it is not the same thing as the free version that my Insights article (and the Wolfram article it references) discuss.
Is there any reference online about this paid version and how it differs from the free version?
It doesn't even "try"–it will always output text in response to a prompt.
It never does, since it has no "idea" of any content at all. All it has any "idea" of is relative word frequencies.
As discussed, ChatGPT doesn't even try to output what is correct. It tries to output what is written often. There is some home that there is a correlation between that and correctness, but that's not always true and it was not hard to come up with examples.
ChatGPT is the love child of Clever Hans and the Mechanical Turk.
Second. the problem with discussing "AI", much less its purpose, is that it is such a huge area, lumping it all together is seldom helpful. Personally I feel that the most interesting work has been done in motion, balance and sensors.
Third, we had this technology almost 40 years ago. That was based on letters, not words, and it was much slower than real-time. And nobody got excited.
….and this Insight addresses an important but not well discussed problem that more to the point is why we frown upon chat-bot questions and answers on PF.
*Edit: Also, this isn't what AI is "for". AI's promise is in being able to solve problems that are currently out of reach of computers but don't even require conscious thought by people. These problems – such as self-driving cars – are often ones where reliability is important.
edit2: Ok, I say that, but I can't be so sure it's true, particularly because of wildcards like Elon Musk who are
eagerwilling to put the public at risk to test experimental software.You're contradicting yourself. The "limitations that are fundamental" are crucial effects on the result. They're not just irrelevant side issues.
If I know that's what your business is doing, you won't get my business.
I suspect that a lot of people feel this way; they just don't know that that's what the business is doing. Certainly OpenAI has not done anything to inform the public of what ChatGPT is actually doing, and not doing. I suspect that is because if they did do so, interest in what OpenAI is doing would evaporate.
I think they already do it the Max Power way:
Is it?
Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.
But for lots of other purposes, it seems wrong. It's not even a matter of percentage accuracy; it's a matter of what the thing is doing and not doing, as compared with what my purpose is. If my purpose is to actually understand the subject matter, I need to learn from a source that actually understands the subject matter. If my purpose is to learn a particular fact, I need to learn from a source that will respond based on that particular fact. For example, if I ask for the distance from New York to Chicago, I don't want an answer from a source that will generate text based on word frequencies in its input data; I want an answer from a source that will look up that distance in a database of verified distances and output what it finds. (Wolfram Alpha, for example, does this in response to queries of that sort.)
I think they are planning to monetize this by first making a name for themselves and then selling a product where "close enough is good enough". For example, customer service chatbots.
"New from OceanGate: now 99% Reliable – Twice as Reliable as our Previous Subs!"
(too soon?)
I go back again to wondering what the creators are thinking about this…
OpenAI's website is really weird. It is exceptionally thin on content and heavy on flash, with most of the front page just being pointless slogans and photos of people doing office things (was it created by ChatGPT?). It even features a video on top that apparently has no sound? All this to sell a predominantly text-based application (ironic)? The first section of the front page, though, contains one actual piece of information, in slogan form:
"Creating safe AGI that benefits all of humanity"
That's quite an ambitious goal/claim. It's not surprising that everyday people believe it's more than it really is, when that's what the company is saying.
The trajectory of the app and the way they've talked about the flaws such as hallucinations does imply they think their approach is viable and that refinements that improve its reliability should result in it becoming "reliable enough". Ironically this may increase the risk/danger of misuse, as people apply it to more and more situations where reliability should matter. I can't see how this approach would ever be acceptable for industrial automation. Maybe for a toy drone it won't matter if it unexpectedly/unpredictably crashes for no apparent reason "only" 0.1% of the time, but that won't ever be acceptable for a self driving car or airplane.
It can't do this. There is no there there.
Or because the testers didn't bother writing a good test, that actually can distinguish between ChatGPT, an algorithm that generates text based on nothing but relative word frequencies in its training data, and an actual human with actual human understanding of the subject matter. The test is supposed to be testing for the latter, so if the former can pass the test, the test is no good.
See above.
Which, as I said, is already well known: that humans can pass SAT tests without having any actual knowledge of the topic areas. For example, they can pass the SAT math test without being able to actually use math to solve real world problems–meaning, by gathering information about the problem, using that information to set up relevant mathematical equations, then solving them. So in this case, ChatGPT is not going beyond human performance in any respect.
ChatGPT is not parsing natural language. It might well give the appearance of doing so, but that's only an appearance. The text it outputs is just a continuation of the text you input, based on relative word frequencies in its training data. It does not break up the input into sentence structures or anything like that, which is what "parsing natural language" would mean. All it does is output continuations of text based on word frequencies.
"ChatGPT Airlines – now 96% of our takeoffs have landings at airports!"
Let's go back to "knowledge". Yes, it's philosophical, but some of the elements can be addressed scientifically. An old-fashioned definition of knowledge was "justified true belief". Let's dispense with "belief" as too fuzzy, Is what ChatGPT says true? Sometimes. As stated, 96% of the time is not very impressive. Is it justified? Absolutely not – it "knows" onlt what words others used, and in what order. That's it.
In no sense is there "knowledge" there.
It's not just unreliable – we have no reason to believe it should be reliable, or that this approach will ever be reliable.
But in post #13 you also said it can "repeatably" give accurate answers to questions. That seems to contradict "unreliable". I asked you about this apparent contradiction in post #15 and you haven't responded.
Exactly!
What you show here is nothing like what AndreasC described.
You could write that on the box of any new piece of software. Otherwise there's no reason to use it. But you're seeing the point now:
Right. What's impressive about it is that it can converse with a human and sound pretty human. But now please reread the title of the thread. "Sounds human" is a totally different accomplishment from "reliable".
Your take is weird to me, but it seems common, especially in the media. Consider this potential headline from 1979:
"New 'Spreadsheet' Program 'VisiCalc' Boasts 96% Accuracy – Might it be the New Killer App?"
[ChatGPT was 96th percentile on the SAT, not accuracy, but close enough.]
That's not impressive, it's a disaster. It's orders of magnitude worse than acceptable accuracy from a computer. It seems that because ChatGPT sounds confidently human people have lowered the bar from "computer" to "human" in judging its intelligence – and don't even realize they've done it. That's a dangerous mistake.
I've just tried it:
Me: What are the customs of the aliens at the planet Xuxor?
ChatGPT:
As an AI language model, I don't have real-time information on specific planets or extraterrestrial civilizations outside of what is available up until my knowledge cutoff in September 2021. Therefore, I'm not aware of any specific customs of aliens on a planet called Xuxor or any other extraterrestrial civilization.
It's important to note that the existence of aliens or their customs on specific planets is purely speculative at this point, as we have not made contact with extraterrestrial life. If and when we do encounter extraterrestrial civilizations, their customs and behaviors would likely be vastly different from our own, given the vastness of the universe and the potential for diverse forms of life.
Until further information becomes available, any details about the customs of aliens on specific planets would be purely fictional or based on imagination, as depicted in science fiction literature, movies, or other creative works.
By asking experts to give their educated guess estimates, say two years ago.
Please give a reference: where has this been done?
Yes, because while the information that is contained in the relative word frequencies in the training data is extremely sparse compared to the information that a human reader could extract from the same data, it is still not zero information. There is information contained in those word frequencies. For example, "Thomas Jefferson" is going to appear correlated with "july 4, 1776" in the training data to a much greater degree than "Will Smith" does.
It's not; it was an intentional feature of the design that only the relative word frequencies in the training data would be used. The designers, from what I can tell, actually believe that piling up enough training data with such word frequencies can lead to actual "knowledge" of subject matter.
This just shows that SAT tests can be gamed. Which we already knew anyway.
Impressive how? Doesn't this just tell you that it doesn't know the difference between fiction and reality, and more to the point, there's no way for you to know if it is providing you fictional or real answers*?
*Hint: always fictional.
https://betonit.substack.com/p/chatgpt-takes-my-midterm-and-gets (GPT-3, D)
https://betonit.substack.com/p/gpt-retakes-my-midterm-and-gets-an (GPT-4, A)
https://scottaaronson.blog/?p=7209 (GPT-4, B)
I think it's just a qualitative feeling, but I feel the same way. When first learning about it, it never occurred to me that it didn't access stored information (either its own or 3rd party) to form its replies*. Now that I know it doesn't, it surprises me that it gets so much right. If it's just doing word association and statistical analysis, I'm surprised that asking about Independence Day doesn't return "On July 4, 1776 Will Smith fought a group of alien invaders before signing the Declaration of Independence in Philadelphia…" It seems that through statistical analysis it is able to build a model that approximates or simulates real information. To me, surprisingly well.
*I don't know the intent of the designers, but I can't imagine this is an oversight. Maybe the intent was always to profit from 3rd parties using it as an interface for their data sources (some of which they are doing it appears)?
But whatever the real goals of the company, I think it is wrong and risky that it's been hyped (whether by the media or the company) to make people think that it is a general purpose AI with real knowledge. As a result, people have their guard down and are likely to mis/over-use it.
I wonder if the developers really believe it qualifies for the title "AI" or that complexity = intelligence?
Edit: Btw, While I'm not certain of this, here's how I can often tell: it's the lack of focus in the responses. When the content is dumped into the middle of an ongoing conversation, it doesn't acknowledge or respond to the ongoing conversation, it just provides generic information that is often not useful for/connected to the discussion.
That is incorrect. "Limited understanding" implies that there is at least SOME understanding but chatGPT has zero understanding of anything.
This does not seem consistent with many posts here at PF by people who have tried ChatGPT and posted the output. The general sense I get from those posts is that ChatGPT was less reliable than they expected–because they did not realize what it is actually doing and not doing. For example, apparently many people expected that when you asked it a factual question about something in its training data, it would go look in its training data to find the answer. But it doesn't, even if the right answer is in its training data. Wolfram's article, referenced in my Insights article, makes all this clear.
Not necessarily pre-existing. We build structures in our brains to represent things in the world as a result of our interactions with them. ChatGPT does not. (Nor does ChatGPT have any "pre-existing" structures that are relevant for this.)
If the reason they passed was that their graders were lazy and didn't actually check the accuracy of the answers, yes. And that is exactly what has happened in cases where ChatGPT supposedly "passed" tests. If you think graders would never be so lazy, you have led a very sheltered life. It's just a more extreme version of students getting a passing grade on a book report without ever having read the book, and I can vouch for that happening from my own personal experience.
Please produce your evidence for this claim. It is contrary to both the analysis of how ChatGPT actually works, which I discuss in the Insights article, and the statements of many, many people who have used it. Including many posts here at PF where people have given ChatGPT output that is confident-sounding but wrong.
Doesn't this contradict your claim quoted above?
The fact that it is observed to be unreliable is just a fact, yes. But in previous discussions of ChatGPT here at PF, it became clear to me that many people do not understand how ChatGPT works and so do not understand both that it is unreliable and why it is unreliable. That is why I wrote this article.
And I have already responded to @Demystifier that such a claim is meaningless unless you can actually quantify what "you would naively expect" and then compare ChatGPT's actual accuracy to that. Just saying that subjectively it seems more accurate than you would expect is meaningless.
Yes, I think one way of describing ChatGPT is that it is crudely simulating a human con artist: it produces statements that seem to come from an entity that is knowledgeable, but actually don't.
Is it? How would one even determine that?
That's because our minds have semantic connections between words and things in the world. When we read words, we make use of those connections–in other words, we know that the words have meanings, and what those meanings are. If we get the meanings of words wrong, we "learn" things that are wrong.
ChatGPT has none of this. It has no connections between words and anything else. It doesn't even have the concept of there being connections between words and anything else. The only information it uses is relative word frequencies in its training data.
No, it can't. It can get lucky sometimes and happen to give an "answer" that happens to be accurate, but, as you will quickly find out if you start looking, it also happily gives inaccurate answers with the same level of confidence. That's because it's not designed to give accurate answers to questions; that's not what it's for.
Only because the "tests" are graded so poorly that even the inaccurate but confident-sounding responses that ChatGPT gives "pass" the tests. That is a reflection of the laziness and ignorance of the test graders, not of the knowledge of ChatGPT.
Sure, because it can generate text in response to any prompt whatever. But the responses it gives will have no reliable relationship to reality. Sometimes they might happen to be right, other times they will be wrong, often egregiously wrong. But all of the responses seem just as confident.
ChatGPT does not and cannot do these things. What it does do is, as a side effect of its design, produce text that seems, to a naive observer, to be produced by something that does these things. But the illusion is quickly shattered when you start actually checking up on its responses.
Perhaps, but they are irrelevant to this article. The article is not about an abstract philosophical concept of "knowledge". It is about what ChatGPT is and is not actually doing when it emits text in response to a prompt.
Not as long as there are no semantic connections between the network and the world. No entity forms "representations" of actual things just by looking at relative word frequencies in texts. There has to be two-way interaction with the actual world. That's how, for example, we humans form our mental representations of things. We interact with them and learn how they work.
…, pre-politically correct characteristics included "often precocious vocabulary with no apparent 'real understanding/ability' for use/application/reasoning." That is my recollection from Googling ten-fifteen years ago; ymmv.
This is
https://www.physicsforums.com/threa…an-appropriate-source-for-discussion.1053525/
another/one more case; some wiki/google sources lack "shelf life."