slow analysis people synthetic data human data running out

Will Synthetic Data Make AI Dumber?

The warning was launched in chorus by big tech exponents and analysts: human data to train AI is running out and the solution seems to lie in synthetic data, which however presents several risks. What are the problems of training AI with synthetic data? What are the alternatives? What awaits us?
February 26 2025

Human data to train the artificial intelligences they are running out and the solution seems to lie in synthetic data.

This alarming dynamic has been well highlighted by figures such as Elon Musk (there is a Guardian article in this regard) and among others also by Ilya Sutskever, co-founder of OpenAI…so from one of the big tech AI most developed and influential in the world.

This fact should not surprise us, because it was predictable. But is it just alarmism or a real issue?

Are we really running out of human data?

The extraordinary processing capabilities of the models enable very fast training. In essence, within months, it is possible to train an AI on several petabytes of useful data (a petabyte equivalent to one million gigabytes).

However, it is estimated that the Global human data on the web is around 200 zettabytes (every zettabyte equivalent to one million petabytes).

And this might make us think that the news is smoke. In practical terms, 1 zettabytes It's a huge amount of data. For example, if a petabyte is equal to about 500 billion pages of text, zettabyte could contain all the content written in the history of humanity thousands of times.

But let's not let ourselves be impressed. The numbers considered in practice are decidedly lower. In fact, usable data is greatly reduced due to the huge amount of data present on the web:

  • Duplicate data,
  • Private and copyrighted content,
  • Low quality data.

So yes: it is perfectly plausible that useful human data is running out.

no human sign synthetic data

This is why the AI ​​industry is increasingly taking a serious look at the adoption of synthetic data: but what are these synthetic data? What is the problem?

What is synthetic data and why is it used?

The synthetic data are information generated artificially by algorithms, simulations or other artificial intelligence, instead of being collected from human sources. They are used to:

  • Extend existing datasets when real-world data is limited.
  • Avoid privacy concerns by eliminating the need to use sensitive data.
  • Improve data diversity to reduce model bias.

Companies like OpenAI, Google and Meta are already experimenting with the use of synthetic data to train advanced models: in a few years, therefore, most of the datasets used for AI could be made up of entirely synthetic data.

The risk of the “model that trains itself”

One of the main dangers in the excessive use of synthetic data is the phenomenon known as Model Collapse. If an AI is trained on data generated by other AIs, it could develop “recycled” knowledge, progressively losing the ability to produce innovative and accurate content.

A report published by ArXiv highlights that the continuous reuse of synthetic data leads to a gradual degradation of model quality, making the answers increasingly distorted and less adherent to reality.

So what are the concrete risks of this mechanism?

  • Amplified Bias: without human intervention, a model risks reinforcing its errors and distorting reality.
  • Loss of creativity and innovation: AI could generate increasingly predictable and less original responses.
  • “Closed loop” effect: If an AI is trained only on data generated by other AIs, it may develop a language that is less realistic.

As we can see – especially regarding biases – some contradictions emerge with the very purpose behind a certain use of synthetic data.

Hallucinations and loss of reliability

AIs already suffer from hallucinations, or incorrect or invented answers.

hypnotic face AI hallucination synthetic data

Using synthetic data could exacerbate the problem, making it increasingly difficult to distinguish between real and artificial information.

The Future of AI Training: What Lies Ahead?

With the depletion of quality human data and the growing reliance on synthetic data, the future of AI training is at a crossroads. Experts are wondering what strategies will ensure AI progress without compromising its reliability and accuracy.

Alternative strategies to synthetic data

Because synthetic data can be useful but controversial, many companies are looking for new ways to collect real data without violating privacy or running into legal constraints. Some possible solutions include:

  • Direct interactions with users: Tech giants are exploring ways to collect data directly from users through conversations with chatbot and interactive models.
  • Partnerships with companies and institutions: Access to scientific, journalistic and academic databases could provide new sources of high-quality data.
  • Federated approach: Google has experimented with federated learning, which allows models to train on decentralized data without transferring it to central servers.

Regulation and ethics in data use

With the growing concern for privacy and data security, governments and institutions are working on regulations to limit the uncontrolled use of synthetic data. The European Union, with theAI Act, is setting stricter criteria for training models, while in the United States there is a debate about the need to label artificially generated content.

Organizations like DeepMind and OpenAI are promoting the use of traceable and verifiable data, in order to ensure that the AIs of the future are more transparent, reliable and less subject to manipulation.

And here we understand that we are faced with another contradiction that generates controversy: how can synthetic data be used if they do not comply with regulations? Are these regulations really able to distinguish human data from synthetic data?

So will training on synthetic data make AIs dumber or not?

From what we have seen, we know that the answer is not so simple and clear-cut. On the one hand, using synthetic data can be practical and in some respects has a positive impact on the performance of models; on the other hand, it has been shown that a massive and exclusive use of synthetic data makes AI responses even more unreliable than they already often are. You just have to try any model and go deeper to realize this.

For these reasons it is important for users to learn to generate accurate prompts first. AI, of course, is gradually being refined to approach human levels of understanding natural language, through neural networks and advanced techniques of machine learning algorithm e deep learning.

But it is always us humans – today – who must adequately guide the models to give us the answers we want and to verify them.

And this is just one step that we can take ourselves to test directly.

For the rest, we will continue to observe what happens and of course update you!

Share the article:

kilobit web agency Turin logo

we at Kilobit we are a network of professionals with experience in programming, in digital marketing, in the creation of multimedia products and in the drafting of content in key SEO.

We provide our decades of experience in various sectors for a personalized service a Inline. We have the tools to make your business land and take off on the website.

Tel. 0110268815
info@kilobit.it
Via Luigi Cibrario, 40 10144 TURIN

Maybe they could interested also

Trust ours experts

Our estimate is always free.

Check out our Portfolio

Follow us on ours Social

Tags: