Could AI-generated data lead to model collapse? How to prevent it.

When an impish security expert dared Grok, a chatbot developed by X, to conjure up malware, the AI stood its ground, replying, "… it goes against OpenAI's use case policy." (Winterbourne, 2023) This raised eyebrows—after all, Grok wasn't a creation of OpenAI. Was this a botched attempt at humor, given Grok's pre-programmed 'rebellious streak'? Or, could it be that Grok's training data, scraped from the vast corners of the internet, included outputs from OpenAI models? This brings us to an intriguing question—can generative AI systems effectively and safely learn from their AI predecessors? And if so, what unexpected consequences might we encounter?

The emergence of generative AI systems, capable of producing content without human interference, was not until 2021. Before this, the most sophisticated systems required human involvement, needing input at every stage. Text generators could barely construct a paragraph, particularly in a language other than English. Image generators, based on neural networks, could only muster up deepfakes (images formed by altering a target image) based on insights from other images. These alterations were typically confined to human features such as faces and parts of the upper body. As such, prior to 2021, the vast influx of AI-generated data corrupting the internet was not a significant worry. As a result, the initial generation of generative models, like ChatGPT and DALL-E, were trained mostly on data created by real humans. But what about the generations to come?

For future generations of models, the shift towards training with synthetic data from prior generative models is inevitable. This trend is driven by two main factors: strategic advantages and uncontrollable elements. The strategic advantages of synthetic data are multifold. It can alleviate privacy concerns during the development phase when systems require personal information for training, as synthetic data can pose a minimal risk of exposing such details. Moreover, the use of synthetic data, especially in image generation, can enhance model performance when used judiciously. It's also generally more economical and efficient to extract data from a generative model than to engage a large human workforce.

For developers, managing strategic drivers during the model development process can be achieved by consistently monitoring the model's performance and its ability to generalize. On the other hand, managing uncontrollable elements can be more challenging. For instance, even if developers use human annotators to generate new data, the widespread use of generative models to increase productivity makes it difficult to determine if the content is entirely human-generated. Thus, developers and business leaders alike need to seriously consider the impact of AI-generated data on down-stream tasks.

Model collapse

One significant concern is 'model collapse', where the quality of future generations of models can deteriorate. In this scenario, not only can the model start to misinterpret reality from a human perspective, but it can also believe this misinterpretation to be real. A research team from the Universities of Oxford and Cambridge identified this phenomenon through controlled experiments. They discovered that if future generative models are trained with outputs from their predecessors, they will inevitably and irreversibly collapse, regardless of the model architecture.

White paper | Considering AI solutions for your business? Ask the right questions.

To comprehend the causes of model collapse, consider data as a distribution, or a snapshot of (in the case of text) all word arrangements and their likelihood of occurring. For instance, the word Obama is more likely to occur next to the words President, Michelle, or USA than it is to occur next to Peter or John. Internet data, whether text or images, similarly follows specific distributions, albeit unknown. The objective of generative models is to learn these distributions such that their outputs reflect the world-view from a human perspective.

There are two salient causes of future model collapse. First, the 1^st-generation model might learn an incorrect data distribution, focusing on areas where humans lack expertise or overlooking areas that warrant more attention. If Grok had expertise in creating malware, it would be surprising, as one would hope that X’s training data distribution did not include techniques on how to create malware. For future models whose training data recursively came from outputs from a 1st-generation model with an incorrect data distribution, model collapse is likely certain.

Still likely certain, albeit with delay, is the model collapse for future models whose 1^st-generation model learned the data distributions from humans accurately. A typical progression across generations might look like this: The second-generation model learns from a mix of human data and outputs from the first-generation model. Likewise, the third generation model will learn from a mix of human data and outputs from the second-generation model. Eventually, a future model will learn from a significant amount of data generated by its predecessor. As each successive model learns only from a fraction of the original human data, it is likely inevitable that this fraction will miss parts of the original data’s distribution. When repeated over multiple generations, the errors from this distortion of the human data distribution can lead to model collapse.

How to dodge model collapse

In the scenario where training data is drawn from the output of its preceding generation, and combined with samples from a static human-produced dataset, we find an extreme example of AI systems recycling data. However, in practice, the approach is likely to diverge from this example, either by pulling data from multiple AI systems or by integrating new human-generated data at each stage. Both alternatives hinge on striking the right balance between human and AI-generated data.

Drawing on AI-generated data from multiple models can enrich and diversify the training set. Given that different models capture varying facets of the original human data distribution, this multifaceted approach optimizes the coverage of these differences. Despite this, developers should tread carefully. Rather than indiscriminately using AI-generated data for training, they should target samples that are traditionally elusive and enrich these with contextual cues drawn from human communication. Ideally, these cues should be novel, not present in the preceding AI model. With the right safeguards, developers could potentially achieve not model collapse, but its reversal.

Integrating new human data at each stage is also an ideal strategy. It offers two advantages. First, because each sampling will inevitably miss portions of the original human data distribution, continuous re-sampling reduces the risk of consistently overlooking the same segments. Second, given the temporal evolution of human data, it's vital to keep it updated. Bearing in mind these dual benefits, it's unsurprising that a study conducted by teams from Stanford and Rice Universities found that generative models, when supplemented with sufficient fresh data, do not degrade over time. Yet again, developers need to proceed with caution. While a limited, judiciously sourced quantity of AI-generated data can enhance the fresh human data and improve model performance, an excess can corrupt the training dataset and lead to a decline in model performance.

How to plan for the future

At the time of writing this piece, the market is teeming with over 30 generative models. A quick online search for a list of generative text models yields page upon page of “top 10 generative AI tools” rankings. This abundance of choice provides executives, whose businesses are intertwined with the development of generative models, with a wealth of options for producing diverse synthetic data to bolster their real human data. Nevertheless, without a consistent influx of fresh human data, models will likely continue to degrade, even with synthetic data pulled from a variety of generative models. It's critical for executives to collaborate with developers to ensure a sufficient supply of human data, at least within the narrow scope their models are designed to tackle.

The question of how to effectively differentiate between human and AI-generated data on the web remains open-ended. Preliminary research into ‘watermarking’ text written by language models is in progress, with the aim of enabling users to identify AI-generated text. I would urge business leaders to keep a close eye on these developments. In the interim, a word of advice—use caution when collecting training data.

The opinions provided are those of the author and not necessarily those of Fidelity Investments or its affiliates. Fidelity does not assume any duty to update any of the information. Fidelity and any other third parties are independent entities and not affiliated. Mentioning them does not suggest a recommendation or endorsement by Fidelity.The information regarding AI tools provided herein is for informational purposes only and is not intended to constitute a recommendation, development, security assessment advice of any kind.

1128799.1.0

Could AI-generated data lead to model collapse? How to prevent it.

Model collapse

How to dodge model collapse

How to plan for the future

Last Feremenga

Check out our latest blogs

Q2 Familiar compliance themes heat up for the Summer

Allison Lagosh

How do you spell that? Why accurate KYC/KYB risk systems need to look beyond spelling.

Vall Herard

Compliance requirements for life insurance and annuity products

Laurie Lewis

Join our email list for the latest news and events in regulatory compliance and AI

Company

Solutions

Legal