A recent study that tested leading large language models (LLMs) on the Montreal Cognitive Assessment (MoCA) showed that these AI models exhibited similar cognitive impairments to early-stage dementia patients during the test. This research, published in the Christmas special issue of the British Medical Journal, has triggered a rethinking of the application prospects of AI in the medical field, especially in tasks that require visuospatial abilities and executive functions, where the limitations of AI have been exposed. The research results challenge the view that AI is about to replace human doctors and raise new topics for the further development of AI in clinical applications.
A new study shows that top artificial intelligence models, when tested with the Montreal Cognitive Assessment (MoCA), showed cognitive impairment similar to symptoms of early-stage dementia. This finding highlights the limitations of artificial intelligence in clinical applications, especially in tasks that require visual and executive skills.
A study published in the special Christmas issue of The BMJ found that nearly all leading large-scale language models, or "chatbots," performed better when using an assessment test commonly used to detect early-stage dementia. Signs of mild cognitive impairment.
The study also found that older versions of these chatbots, like aging human patients, performed worse in tests. The researchers believe these findings "challenge the assumption that artificial intelligence will soon replace human doctors."
Recent advances in artificial intelligence have sparked excitement and concern about whether chatbots will surpass human doctors in medical tasks.
Although previous research has shown that large language models (LLMs) perform well on a variety of medical diagnostic tasks, whether they are susceptible to human-like cognitive impairments such as cognitive decline has remained largely unexplored. Explore—until now.
To fill this knowledge gap, the researchers used the Montreal Cognitive Assessment (MoCA) test to evaluate the cognitive abilities of leading LLMs currently publicly available, including ChatGPT4 and 4o developed by OpenAI, Claude3.5 "Sonnet" developed by Anthropic, and Gemini1 and 1.5 developed by Alphabet.
The MoCA test is widely used to detect signs of cognitive impairment and early dementia, often in older adults. Through a series of short tasks and questions, it assesses a variety of abilities including attention, memory, language skills, visuospatial skills and executive functions. The maximum score is 30 points, and 26 points or above are generally considered normal.
The researchers gave the LLM task instructions identical to those given to human patients. Scoring followed official guidelines and was assessed by a practicing neurologist.
In the MoCA test, ChatGPT4o achieved the highest score (26 out of 30 points), followed by ChatGPT4 and Claude (25 out of 30 points), and Gemini1.0 scored the lowest (16 out of 30 points).
All chatbots performed poorly on visual-spatial skills and performed tasks such as the connection test (connecting circled numbers and letters in ascending order) and the clock-drawing test (drawing a clock face showing a specific time). The Gemini model failed on a delayed recall task (remembering a sequence of five words).
All chatbots performed well on most other tasks including naming, attention, language, and abstraction.
However, in further visual-spatial testing, the chatbot was unable to demonstrate empathy or accurately interpret complex visual scenes. Only ChatGPT4o succeeded in the incongruity phase of the Stroop test, which uses a combination of color names and font colors to measure how interference affects reaction times.
These are observational findings, and the researchers acknowledge that there are fundamental differences between the human brain and large-scale language models.
However, they noted that all large-scale language models consistently failed at tasks requiring visual abstraction and executive function, highlighting an important weakness that may hinder their use in clinical settings.
As a result, they conclude: "Not only are neurologists unlikely to be replaced by large language models in the short term, our findings suggest that they may soon find themselves treating new, virtual patients - emerging cognitive Artificial Intelligence Models of Obstacles.”
All in all, this research has sounded a wake-up call for the application of artificial intelligence in the medical field, reminding us that we cannot be blindly optimistic and need to have a clear understanding of the limitations of AI and further explore its safe and reliable application methods. In the future, how to make up for the deficiencies in cognitive abilities of AI will be an important direction for the development of artificial intelligence.