Skip to content

Could AI-generated data lead to model collapse? How to prevent it.

The emergence of generative AI and the use of AI-generated data have raised concerns about model collapse. What's the right mix of human and AI data?

When an impish security expert dared Grok, a chatbot developed by X, to conjure up malware, the AI stood its ground, replying, "… it goes against OpenAI's use case policy." (Winterbourne, 2023) This raised eyebrows—after all, Grok wasn't a creation of OpenAI. Was this a botched attempt at humor, given Grok's pre-programmed 'rebellious streak'? Or, could it be that Grok's training data, scraped from the vast corners of the internet, included outputs from OpenAI models? This brings us to an intriguing question—can generative AI systems effectively and safely learn from their AI predecessors? And if so, what unexpected consequences might we encounter?

The emergence of generative AI systems, capable of producing content without human interference, was not until 2021. Before this, the most sophisticated systems required human involvement, needing input at every stage. Text generators could barely construct a paragraph, particularly in a language other than English. Image generators, based on neural networks, could only muster up deepfakes (images formed by altering a target image) based on insights from other images. These alterations were typically confined to human features such as faces and parts of the upper body. As such, prior to 2021, the vast influx of AI-generated data corrupting the internet was not a significant worry. As a result, the initial generation of generative models, like ChatGPT and DALL-E, were trained mostly on data created by real humans. But what about the generations to come?

For future generations of models, the shift towards training with synthetic data from prior generative models is inevitable. This trend is driven by two main factors: strategic advantages and uncontrollable elements. The strategic advantages of synthetic data are multifold. It can alleviate privacy concerns during the development phase when systems require personal information for training, as synthetic data can pose a minimal risk of exposing such details. Moreover, the use of synthetic data, especially in image generation, can enhance model performance when used judiciously. It's also generally more economical and efficient to extract data from a generative model than to engage a large human workforce.

For developers, managing strategic drivers during the model development process can be achieved by consistently monitoring the model's performance and its ability to generalize. On the other hand, managing uncontrollable elements can be more challenging. For instance, even if developers use human annotators to generate new data, the widespread use of generative models to increase productivity makes it difficult to determine if the content is entirely human-generated. Thus, developers and business leaders alike need to seriously consider the impact of AI-generated data on down-stream tasks.

Model collapse

One significant concern is 'model collapse', where the quality of future generations of models can deteriorate. In this scenario, not only can the model start to misinterpret reality from a human perspective, but it can also believe this misinterpretation to be real. A research team from the Universities of Oxford and Cambridge identified this phenomenon through controlled experiments. They discovered that if future generative models are trained with outputs from their predecessors, they will inevitably and irreversibly collapse, regardless of the model architecture.

White paper | Considering AI solutions for your business? Ask the right questions.

To comprehend the causes of model collapse, consider data as a distribution, or a snapshot of (in the case of text) all word arrangements and their likelihood of occurring. For instance, the word Obama is more likely to occur next to the words President, Michelle, or USA than it is to occur next to Peter or John. Internet data, whether text or images, similarly follows specific distributions, albeit unknown. The objective of generative models is to learn these distributions such that their outputs reflect the world-view from a human perspective.

There are two salient causes of future model collapse. First, the 1st-generation model might learn an incorrect data distribution, focusing on areas where humans lack expertise or overlooking areas that warrant more attention. If Grok had expertise in creating malware, it would be surprising, as one would hope that X’s training data distribution did not include techniques on how to create malware. For future models whose training data recursively came from outputs from a 1st-generation model with an incorrect data distribution, model collapse is likely certain.

Still likely certain, albeit with delay, is the model collapse for future models whose 1st-generation model learned the data distributions from humans accurately. A typical progression across generations might look like this: The second-generation model learns from a mix of human data and outputs from the first-generation model. Likewise, the third generation model will learn from a mix of human data and outputs from the second-generation model. Eventually, a future model will learn from a significant amount of data generated by its predecessor. As each successive model learns only from a fraction of the original human data, it is likely inevitable that this fraction will miss parts of the original data’s distribution. When repeated over multiple generations, the errors from this distortion of the human data distribution can lead to model collapse.

How to dodge model collapse

In the scenario where training data is drawn from the output of its preceding generation, and combined with samples from a static human-produced dataset, we find an extreme example of AI systems recycling data. However, in practice, the approach is likely to diverge from this example, either by pulling data from multiple AI systems or by integrating new human-generated data at each stage. Both alternatives hinge on striking the right balance between human and AI-generated data.

Drawing on AI-generated data from multiple models can enrich and diversify the training set. Given that different models capture varying facets of the original human data distribution, this multifaceted approach optimizes the coverage of these differences. Despite this, developers should tread carefully. Rather than indiscriminately using AI-generated data for training, they should target samples that are traditionally elusive and enrich these with contextual cues drawn from human communication. Ideally, these cues should be novel, not present in the preceding AI model. With the right safeguards, developers could potentially achieve not model collapse, but its reversal.

Integrating new human data at each stage is also an ideal strategy. It offers two advantages. First, because each sampling will inevitably miss portions of the original human data distribution, continuous re-sampling reduces the risk of consistently overlooking the same segments. Second, given the temporal evolution of human data, it's vital to keep it updated. Bearing in mind these dual benefits, it's unsurprising that a study conducted by teams from Stanford and Rice Universities found that generative models, when supplemented with sufficient fresh data, do not degrade over time. Yet again, developers need to proceed with caution. While a limited, judiciously sourced quantity of AI-generated data can enhance the fresh human data and improve model performance, an excess can corrupt the training dataset and lead to a decline in model performance.

How to plan for the future

At the time of writing this piece, the market is teeming with over 30 generative models. A quick online search for a list of generative text models yields page upon page of “top 10 generative AI tools” rankings. This abundance of choice provides executives, whose businesses are intertwined with the development of generative models, with a wealth of options for producing diverse synthetic data to bolster their real human data. Nevertheless, without a consistent influx of fresh human data, models will likely continue to degrade, even with synthetic data pulled from a variety of generative models. It's critical for executives to collaborate with developers to ensure a sufficient supply of human data, at least within the narrow scope their models are designed to tackle.

The question of how to effectively differentiate between human and AI-generated data on the web remains open-ended. Preliminary research into ‘watermarking’ text written by language models is in progress, with the aim of enabling users to identify AI-generated text. I would urge business leaders to keep a close eye on these developments. In the interim, a word of advice—use caution when collecting training data.



The opinions provided are those of the author and not necessarily those of Fidelity Investments or its affiliates. Fidelity does not assume any duty to update any of the information. Fidelity and any other third parties are independent entities and not affiliated. Mentioning them does not suggest a recommendation or endorsement by Fidelity.The information regarding AI tools provided herein is for informational purposes only and is not intended to constitute a recommendation, development, security assessment advice of any kind.


Last Feremenga

Director, Data Science
Last is one of Saifr’s AI experts, specializing in natural language processing (NLP) and regulatory compliance in financial services. He currently heads the AI-Applied Research Team at Saifr. Previously, Last was a senior data scientist at Digital Reasoning and Wells Fargo, where he built NLP-based AI systems for risk mitigation. Last has an extensive background in research, including an ATLAS fellowship and visiting researcher positions at the European Organization for Nuclear Research and Argonne National Laboratory. He holds a doctorate degree in physics from the University of Texas at Arlington and a bachelor’s degree from the University of Chicago.

Check out our latest blogs

Record regulatory fines: two case studies

Record regulatory fines: two case studies

Facing record high fines in 2023, two firms exemplify the consequences of willful non-compliance and the importance of self-reporting in re...

Surprising survey data on who is (and isn’t) using AI

Surprising survey data on who is (and isn’t) using AI

Our research revealed a surprising difference in AI usage between top executives and junior managers. Learn how they're using the technolog...

Here’s why AI should be part of compliance reviews

Here’s why AI should be part of compliance reviews

Enabled by digitization and AI, new RegTech solutions are emerging to modernize compliance reviews to make them faster and a better use of ...