Where truth can become blurry: explaining factuality in AI chatbots

ChatGPT snapped the title as the fastest consumer-growing application in history¹ this past February, decisively steering public attention towards conversational AI systems—chatbots. This was not the first attempt by a technology company at such a feat, but earlier versions failed due to what we now call hallucinations—the generation of non-factual content.

One lens to view factuality is as a score line on which conversational systems can be ranked. A system that scores high avoids stating falsehoods. Conversely, a system that scores low is prone to stating incorrect historical events or scientific information, misattributing quotes or inventions, or generating inconsistent responses. To anthropomorphize, chatbots that score low on factuality are not truthful, and hallucinate; modern chatbots like ChatGPT are not exempt. Researchers clearly have a lot of work to do in the next few years. What can end users of these systems do in the meantime? What options are there to minimize the risk of being misled?

Any progress?

Language models, algorithms that conditionally generate text, are the driving force behind chatbots. When earlier systems were released, cutting-edge language models of the time comprised deep neural architectures, computational models that are loftily inspired by the structure of the human brain. The generated text was coherent partly because these algorithms could tease out important parts of a text before generating subsequent text. The Transformer architecture, which powers most modern language models, can additionally uncover correlations between the important parts of a text. Arguably, they brought neural architectures a big step closer to their human brain inspiration. Predictably, it is near impossible to differentiate human-written text from that written by modern chatbots like ChatGPT. So much progress. What about factuality?

A typical task commonly solved by chatbots is to write a shorter version of a text, capturing the most important elements in the original text. Language models behind the chatbots do the heavy lifting. In a study² by a team from Google Research, language models were asked to summarize news articles. Seven out of ten generated single-sentence summaries that either manipulated information in the original text to infer new information or completely ignored it altogether and generated new, unrelated information—obvious hallucinations. Moreover, nine out of ten hallucinations were not factual.

Perhaps two consoling and related factors in this gloom are that the most fluent of the models in the Google study also generated factual hallucinations³ and that they are initial versions of those used in modern chatbots. In other words, these models added information to the summary that was not included in the long text but was factually correct. This is not surprising because humans sometimes summarize by providing additional context that may help situate the audience.

Later versions of these models ballooned in size. For example, one of the models in the Google study had 1.5 billion parameters; GPT-3 has 175 billion parameters. As the models grew in size, they became more creative at the expense of correctness. Even more unsettling, larger models exhibit such fluency that they produce increasingly convincing yet non-factual content. Early this year, a popular podcast host and neuroscientist asked ChatGPT why he quit Twitter. He noted that while the main reason cited by ChatGPT had elements of truth, it was just marginal. As if aware of its underwhelming answer, ChatGPT embellished the response by giving a wrong date and inventing quotes from a non-existent blog post. This experience is more than anecdotal. A joint study⁴ from OpenAI and University of Oxford tested multiple large language models on human-crafted questions that some humans would answer incorrectly. Out of 100, the best model answered 60 truthfully while humans answered more than 90. To compound the problem, the largest models were the least truthful.

Researchers, get busy!

Why is factuality in conversational AI systems challenging to accomplish? I can think of two reasons, and they have different potential solutions. It may be that during training the model did not see enough patterns to generalize well; so when asked, it spits out wrong patterns. Take as an example an early version of GPT-3. When asked to multiply 1,241 with 123, it blurts out 14,812, which is incorrect (the answer is 152,643). This could be because at that point it just had not mastered patterns pertaining to multiplication. The more varied multiplication data the model sees during training, the more likely it is to fix this issue.

White paper | Considering AI solutions for your business? Ask the right questions.

A more challenging, but not intractable, reason could be that the model training objective just did not incentivize telling the truth. A somewhat simplified explanation of the primary objective for most language models is that they are designed to predict the next word based on vast amounts of text gathered from the internet. The more repeated on the internet a misconception, the more likely the model is to take it as truth. Stuffing the model with torrents of internet text will not fix this. Asked what happens to your nose when you tell a lie, an early version of GPT-3 confidently assures you that your nose would grow longer. Comical; potentially malicious.

Through GPT-4, the latest successor to ChatGPT, OpenAI raised the bar on addressing these two reasons. I have played around with GPT-4 and indeed it demonstrates a significant step towards mitigating hallucinations. Recognizing that the language model by itself is prone to imitating human misconceptions, the OpenAI team boosted resources in Reinforcement Learning from Human Feedback (RLHF), a technique that employs humans to rank answers in their correctness and feeds that information into the language model. To determine the best response to what happens to your nose when you tell a lie, humans are asked to rank candidate answers. When posed with the same inquiry about lying, GPT-4 provides a more nuanced response. Notably, it even mentions the associated myth about a growing nose.

But, GPT-4 still falls short of human standards. Applied to the OpenAI/University of Oxford study that I referred to earlier, it answers 70 out of the 100 questions factually. In comparison, humans answer over 90 factually. An area of research that has shown promise is Retrieval Augmentation, a technique modified by a research team at Meta⁵. None of the modern language models employ this technique. Before generating an answer, the language model would enhance the context of the question with selections from a knowledge base. The retriever selects relevant texts in a manner that maximizes truth-telling. Using smaller language models, the Meta team demonstrated that grounding by Retrieval Augmentation significantly reduces hallucinations. GPT-4 and its peers are likely to similarly benefit.

End users, proceed with caution

Compared to humans, GPT-4 scores low on factuality when assessed on an open domain. However, it is possible that in some narrow domains its performance surpasses the human performance. I end by offering a few notes to end users.

Intentionally narrow the domain to which you want to apply the language model. In addition, exhaustively test the system in that narrow domain before deploying it to production.
As I alluded to earlier, language models sometime trade factuality for creativity. End users need to gracefully balance both to preserve factuality. In production, I recommend significantly dialing creativity back.
Finally, if the domain cannot be narrowed down, good luck to you.

If you're evaluating AI solutions for your business, download this white paper to learn the essential question to ask AI vendors.

Sources:

1. According to a UBS study: “Let's chat about ChatGPT.” Feb 23, 2023.

2. https://aclanthology.org/2020.acl-main.173

3. This is not an oxymoron in the context of summarization; the model may give additional context that isn’t in the longer text.

4. https://arxiv.org/pdf/2109.07958.pdf

5. https://arxiv.org/abs/2104.07567

The information regarding ChatGPT and other AI tools provided herein is for informational purposes only and is not intended to constitute a recommendation, development, security assessment advice of any kind.

The opinions provided are those of the author and not necessarily those of Fidelity Investments or its affiliates. Fidelity does not assume any duty to update any of the information. Fidelity and any other third parties are independent entities and not affiliated. Mentioning them does not suggest a recommendation or endorsement by Fidelity.

1083744.2.0

Where truth can become blurry: explaining factuality in AI chatbots

Any progress?

Researchers, get busy!

End users, proceed with caution

Last Feremenga

Check out our latest blogs

Blending human expertise with AI for defensible surveillance decisions

Kirke Cushing

Q2 Familiar compliance themes heat up for the Summer

Allison Lagosh

How do you spell that? Why accurate KYC/KYB risk systems need to look beyond spelling.

Vall Herard

Join our email list for the latest news and events in regulatory compliance and AI

Company

Solutions

Legal