The editor of Downcodes learned that the latest research from OpenAI shows that even with the rapid advancement of AI technology, the accuracy of the current most advanced language models in answering factual questions is still worrying. This study used OpenAI's own SimpleQA benchmark. The test results showed that the accuracy of even OpenAI's best models was far lower than expected, which triggered a re-examination of the knowledge acquisition capabilities of AI models.
The study used OpenAI's own SimpleQA benchmark test. This test contains 4,326 questions, covering multiple fields such as science, politics, and art. Each question has a clear correct answer.
After verification by two independent reviewers, the results show that the accuracy of OpenAI's best model o1-preview is only 42.7%, while GPT-4o is slightly lower, only 38.2%. As for the smaller GPT-4o-mini, the accuracy is even only 8.6%. In comparison, Anthropic's Claude model performed even worse, with the Claude-3.5-sonnet achieving an accuracy of only 28.9%.
The key to this research lies in the design of the test, not only to test the performance of AI, but also to make everyone aware of the limitations of AI models in knowledge acquisition. The researchers emphasize that when users use these models, they should treat them as information processing tools rather than completely dependent sources of knowledge. In order to obtain more accurate answers, it is best to provide the AI with reliable data rather than relying solely on its built-in knowledge.
It’s worth noting that AI models often have overly optimistic estimates of their capabilities. The researchers found that when these models were asked to rate confidence in their answers, they often gave inflated accuracy scores. In tests where the same questions were answered repeatedly, even if the models gave the same answer multiple times, their actual success rate was still lower than their self-assessed accuracy. This is consistent with outside criticism that language models often produce ridiculous answers but appear confident.
Researchers believe that the current AI system has obvious gaps in factual accuracy and urgently needs improvement. They also raised the open question of whether an AI's performance at answering short factual questions predicts its performance at processing longer, more complex responses. In order to support the development of more reliable language models, OpenAI has publicly released the SimpleQA benchmark data to Github.
This research sounds a warning for the reliability of AI models and points out the direction for future improvements. We need to use AI tools more carefully and look forward to greater breakthroughs in factual accuracy of AI models in the future. OpenAI's publicly released SimpleQA benchmark data will help promote the development of the entire AI field.