Recently, the Austrian Complexity Science Institute (CSH) evaluated the historical knowledge of three top large-scale language models (LLMs), GPT-4, Llama and Gemini, and the results were surprising. The research team used a benchmark testing tool called "Hist-LLM" to test the model's accuracy in answering historical questions based on the Seshat global historical database. The research results have been announced at the NeurIPS Artificial Intelligence Conference, and the results have triggered people's deep thinking on the application capabilities of LLMs in complex fields.
In order to evaluate the performance of these models on historical knowledge, the researchers developed a benchmark tool called "Hist-LLM". This tool is based on the Seshat global historical database and is designed to verify the accuracy of AI answers to historical questions. The research results were announced at the well-known artificial intelligence conference NeurIPS. The data showed that the accuracy of the best-performing GPT-4Turbo was only 46%. This result shows that the performance is only slightly better than random guessing.
Maria del Rio-Chanona, associate professor of computer science at University College London, said: "While large language models are impressive, their depth of understanding for high-level historical knowledge falls short. They are good at handling simple facts, but struggle with more complex ones. For example, when asked whether scale armor existed in ancient Egypt at a specific time, GPT-4Turbo incorrectly answered "yes," when in fact this technology did not appear until 1,500 years ago. In addition, when researchers asked whether ancient Egypt had a professional standing army, GPT-4 also incorrectly answered "yes" when the actual answer was no.
The study also revealed that the model performed poorly in certain regions, such as sub-Saharan Africa, suggesting that its training data may be biased. Study leader Peter Turchin pointed out that these results reflect that in some areas, LLMs are still unable to replace humans.
Highlight:
- GPT-4Turbo performed poorly on the advanced history exam with an accuracy of only 46%.
- Research shows that large language models are still insufficient in understanding complex historical knowledge.
- The research team hopes to improve the model's application potential in historical research by improving testing tools.
The results of this study remind us that although large-scale language models have made significant progress in many aspects, they still have limitations when dealing with complex problems that require deep understanding and meticulous analysis. Future research needs to focus on how to improve the training data and algorithms of the model to enhance its application capabilities in various fields and ultimately achieve true general artificial intelligence.