A new study from Harvard Medical School and Stanford University shows that OpenAI's o1-preview artificial intelligence system performs well in diagnosing complex medical cases and may even surpass human doctors. The study comprehensively tested o1-preview, and the results were impressive, with accuracy and medical reasoning capabilities significantly outperforming previous models and outperforming experienced physicians and medical residents. This research provides a new direction for the application of artificial intelligence in the medical field, and also triggers discussions on the ethical and practical issues of the application of artificial intelligence in medical practice.
OpenAI's o1-preview artificial intelligence system may be better than human doctors at diagnosing complex medical cases, a new study suggests. Research teams from Harvard Medical School and Stanford University conducted comprehensive medical diagnostic tests on o1-preview, and the results showed that the system has significantly improved compared to earlier versions.
According to the study results, o1-preview achieved a correct diagnosis rate of 78.3% among all tested cases. In a direct comparison of 70 specific cases, the system's accurate diagnosis rate reached 88.6%, significantly surpassing the 72.9% of its predecessor GPT-4. In terms of medical reasoning, o1-preview's performance is equally impressive. Using the R-IDEA scale, a medical reasoning quality assessment standard, the AI system received a perfect score of 78 out of 80 cases. In comparison, experienced physicians achieved perfect scores in only 28 cases, and medical residents achieved perfect scores in only 16 cases.
The researchers also acknowledge that o1-preview may have included some test cases in the training data. However, when they tested the system on new cases, performance dropped only slightly. Dr. Adam Rodman, one of the study authors, emphasized that although this is a benchmark study, the results have important implications for medical practice.
o1-preview performed particularly well when dealing with complex management cases specially designed by 25 experts. "Human beings are powerless in the face of these problems, but O1's performance is amazing," Rodman explained. In these complex cases, o1-preview achieved a score of 86%, while doctors using GPT-4 only achieved 41%, and traditional tools only achieved 34%.
However, o1-preview is not without its flaws. In terms of probability assessment, the system's performance did not improve significantly. For example, when assessing the likelihood of pneumonia, o1-preview gave an estimate of 70%, which is well above the scientific range of 25%-42%. The researchers found that o1-preview performed well on tasks that required critical thinking, but fell short on more abstract challenges, such as estimating probabilities.
Additionally, o1-preview often provides detailed answers, which may have boosted its rating. However, the study only focused on o1-preview working alone and did not evaluate its effect in collaboration with doctors. Some critics point out that the diagnostic tests suggested by o1-preview are often costly and impractical.
Although OpenAI has released new versions of o1 and o3 and performed well on complex inference tasks, these more powerful models still fail to solve the practical application and cost issues raised by critics. Rodman called on researchers to need better ways to evaluate medical AI systems to capture the complexity in real-life medical decisions. He emphasized that this research is not meant to replace doctors, and actual medical treatment still requires human participation.
Paper: https://arxiv.org/abs/2412.10849
Highlight:
o1-preview surpassed doctors in diagnosis rate, reaching an accuracy rate of 88.6%.
In terms of medical reasoning, o1-preview achieved 78 perfect scores out of 80 cases, far exceeding the performance of doctors.
Despite its excellent performance, o1-preview's high cost and unrealistic testing recommendations in practical applications still need to be addressed.
All in all, this study demonstrates the great potential of artificial intelligence in the field of medical diagnosis, but it also reminds us that we need to be cautious about the application of AI in medical practice and pay attention to its limitations and potential risks. Further research and improvement are needed in the future to Ensure that AI can safely and effectively assist medical work and better serve human health.