Can AI be Trained on Data Generated by Another AI?

author
By Tanu Chahal

14/10/2024

cover image for the blog

In today's AI-driven landscape, the concept of training AI models on data generated by another AI might seem like a utopian dream. Yet, it's an idea that's been around for quite some time, and as data scarcity looms closer, it's gaining traction. In this article, we'll explore the current state of AI-generated data, its potential benefits, and the challenges that come with relying on synthetic data.

To understand the significance of AI-generated data, let's first delve into why AI models require labeled data in the first place. AI systems are statistical machines that learn patterns from vast datasets. Annotations, which are labels that guide the model to understand the data, play a crucial role in this process. A photo-classifying model, for instance, learns to recognize kitchen photos when labeled with the word 'kitchen.' The importance of good annotation is evident when considering the vast market for annotation services, projected to reach $10.34 billion in the next decade.

However, the industry faces a daunting challenge: data scarcity. With more than 35% of the world's top 1,000 websites blocking OpenAI's web scraper, and around 25% of 'high-quality' sources restricted, the fear of copyright lawsuits and objectionable material in open data sets has led to a reckoning. Synthetic data seems like the ultimate solution to these problems, allowing developers to generate limitless annotations and data.

Os Keyes, a PhD candidate at the University of Washington, likens synthetic data to biofuel, creatable without the negative externalities of real data. Synthetic data can be generated by simulating and extrapolating new entries from a small starting set of data. The AI industry has taken the concept and run with it, with companies like Writer, Microsoft, and Google fine-tuning their models using synthetic data.

Synthetic data offers numerous benefits, including reduced costs, increased scalability, and improved data quality. It can be used to generate training data in formats not easily obtained through scraping or content licensing. For instance, Meta used its Llama 3 model to create captions for footage in the training data, which humans then refined. OpenAI fine-tuned its GPT-4o model using synthetic data to build the sketchpad-like Canvas feature for ChatGPT. Amazon generates synthetic data to supplement real-world data used to train speech recognition models for Alexa.

While synthetic data seems like a panacea, it's not without its limitations. Synthetic data suffers from the same 'garbage in, garbage out' problem as all AI. If the data used to train these models has biases and limitations, their outputs will be similarly tainted. Complex models like OpenAI's o1 can produce harder-to-spot hallucinations in their synthetic data, reducing the accuracy of models trained on the data.

In the near future, it seems we'll need humans in the loop to ensure a model's training doesn't go awry. Researcher Luca Soldaini emphasizes the importance of thoroughly reviewing, curating, and filtering synthetic data, pairing it with fresh, real data to avoid training forgetful chatbots and homogenous image generators.

As AI-generated data continues to gain traction, it's essential to acknowledge both its benefits and challenges. Synthetic data can alleviate data scarcity and reduce costs, but it requires careful oversight to avoid biases and hallucinations. For now, it seems we'll need a combination of real and synthetic data to train high-quality AI models. As the industry continues to evolve, it's crucial to prioritize transparency, data quality, and human oversight to ensure AI-generated data remains a valuable asset, rather than a liability.