Recently, Elon Musk expressed concerns about the depletion of artificial intelligence training data in a live broadcast. He believes that the data used to train AI models in the real world is close to exhaustion. This view echoes the views of other experts in the AI field, triggering the industry's thinking on future AI development models. Musk believes that synthetic data will be a key solution to the data shortage problem, noting that many technology companies have begun to adopt this approach, which will have a profound impact on the way and cost of training AI models.
In a recent live broadcast conversation, Tesla and SpaceX CEO Elon Musk said that the real-world data available for training artificial intelligence models has almost been exhausted. The person who had the conversation with him was Mark Payne, chairman of Stagwell's board of directors. Musk mentioned, "We have now basically consumed all the accumulation of human knowledge...the data used for artificial intelligence training. This phenomenon basically happened last year."
Musk’s view is similar to the “data peak” theory proposed by former OpenAI chief scientist Ilya Sutskov at the NeurIPS conference in December last year. Suzkofer said that the AI industry is facing the challenge of data shortage, and the lack of sufficient training data in the future will force changes in the way AI models are developed.
To solve this problem, Musk believes that synthetic data will become a viable alternative. He points out that the only way to supplement real-world data is through synthetic data, where the AI generates its own training data. Musk said that AI can improve performance by self-evaluating and continuously optimizing itself.
Currently, many technology companies such as Microsoft, Meta, OpenAI and Anthropic have begun to use synthetic data to train their main AI models. Gartner predicts that by 2024, 60% of the data used in artificial intelligence and data analysis projects will be synthetically generated.
A significant advantage of synthetic data is that development costs can be significantly reduced. However, Musk and other experts also point out that synthetic data is not without risks. Research shows that synthetic data can cause model performance to degrade, and outputs can be less innovative and potentially affected by bias. If the synthetic data itself has limitations, the output of the final model will also be plagued by these problems.
Highlight:
Musk is concerned that real-world data available for training AI is almost exhausted.
Synthetic data is considered an important solution for the future and many technology companies are already adopting it.
Using synthetic data can significantly reduce development costs, but it also carries the risk of degrading model performance.
All in all, the problem of running out of artificial intelligence training data is imminent. Although synthetic data brings new opportunities, it also presents challenges. The future direction of AI development will depend on how to effectively utilize and improve synthetic data, balance its costs and risks, and ultimately achieve continued progress in AI technology.