Recently, the performance evaluation of multimodal large language models (MLLM) has become a research hotspot in the field of artificial intelligence. The Mementos benchmark jointly launched by the University of Maryland and North Carolina Chapel Hill provides a new standard for evaluating the ability of MLLM in processing image sequences. This test covers image sequences from a variety of scenes such as the real world, robots, and animations, and is designed to comprehensively examine the reasoning capabilities of MLLM. The release of test results provides valuable data for our understanding of the advantages and limitations of MLLM.
Recently, the University of Maryland and North Carolina-Chapel Hill collaborated to release Mementos, an image sequence benchmark specially designed for multi-modal large language models. It is designed to comprehensively test the reasoning capabilities of these models for real-world, robot and animation image sequences. However, the test results are shocking, with MLLMs such as GPT-4V and Gemini achieving less than 20% accuracy on the comics dataset. This reveals a clear inadequacy of these models in handling illusions, objects, and behavioral understanding in image sequences.The Mementos benchmark test results show that the current mainstream MLLM still has significant deficiencies in processing complex image sequences, especially animation image sequences. This provides an important reference for the future research direction of MLLM, and also reminds us that we need to be careful about the reliability of MLLM in various application scenarios. Future research needs to focus on how to improve MLLM's ability to understand image sequences, reduce hallucinations, and improve its generalization ability on different types of image sequences.