Artificial intelligence has demonstrated powerful capabilities in various fields, but its limitations in dealing with complex historical issues have become increasingly prominent. Recent research shows that even state-of-the-art large-scale language models have significant deficiencies in handling nuanced historical details. This research poses new challenges to the reliability and application scope of existing AI models, and also provides valuable reference for the improvement direction of future AI models.
New research shows that although artificial intelligence excels in areas such as programming and content creation, it still falls short when it comes to dealing with complex historical issues. A recent study published at the NeurIPS conference showed that even the most advanced large language models (LLM) struggle to achieve satisfactory results in historical knowledge tests.
The research team developed a test benchmark called Hist-LLM to evaluate three top language models: OpenAI's GPT-4, Meta's Llama, and Google's Gemini. The test was conducted on the Seshat global historical database, and the results were disappointing: the best-performing GPT-4Turbo had an accuracy of only 46%.
Maria Del Rio-Chanona, an associate professor at University College London, explained: "These models perform well when it comes to basic historical facts, but fall short when it comes to in-depth historical research at the PhD level." Research has found that AI often gets it wrong in details, such as Misjudgment of whether ancient Egypt had certain military technologies or standing armies during certain periods.
Researchers believe that this poor performance stems from the tendency of AI models to infer from mainstream historical narratives and difficulty in accurately grasping finer historical details. In addition, the study found that these models performed worse when dealing with historical issues in regions such as sub-Saharan Africa, exposing possible bias issues in the training data.
Peter Turchin, head of research at the Complexity Science Center (CSH), said that this finding shows that in some professional fields, AI is not yet able to replace human experts. However, the research team remains optimistic about the application prospects of AI in historical research, and they are improving the test benchmark in order to help develop better models.
The results of this study remind us that although artificial intelligence technology is developing rapidly, in some specific fields, the knowledge and judgment of human experts are still irreplaceable. In the future, the AI model needs to be further improved so that it can better handle complex historical information and provide more effective auxiliary tools for historical research.