Apple has released a major upgrade to its multi-modal artificial intelligence model MM1 - MM1.5. This upgrade is not a simple version iteration, but an all-round improvement of the model's capabilities, significantly enhancing its performance in image understanding, text recognition, and visual command execution. The editor of Downcodes will explain in detail the improvements of MM1.5 and its significance in the field of multi-modal artificial intelligence.
Apple recently launched a major update to its multi-modal artificial intelligence model MM1, upgrading it to version MM1.5. This upgrade is not just a simple version number change, but a comprehensive capability improvement, allowing the model to show more powerful performance in various fields.
The core upgrade of MM1.5 lies in its innovative data processing method. The model adopts a data-centric training approach, and the training data set is carefully screened and optimized. Specifically, MM1.5 uses high-definition OCR data and synthetic image descriptions, as well as optimized visual instructions to fine-tune the data mix. The introduction of these data has significantly improved the model's performance in text recognition, image understanding, and execution of visual instructions.
In terms of model size, MM1.5 covers multiple versions ranging from 1 billion to 30 billion parameters, including intensive and mixture of experts (MoE) variants. It is worth noting that even smaller scale 1 billion and 3 billion parameter models can achieve impressive performance levels with carefully designed data and training strategies.
The capability improvements of MM1.5 are mainly reflected in the following aspects: text-intensive image understanding, visual reference and positioning, multi-image reasoning, video understanding and mobile UI understanding. These capabilities allow MM1.5 to be applied to a wider range of scenarios, such as identifying performers and instruments from concert photos, understanding chart data and answering related questions, locating specific objects in complex scenes, etc.
To evaluate the performance of MM1.5, the researchers compared it with other advanced multimodal models. The results show that MM1.5-1B performs well in a model with a scale of 1 billion parameters, significantly better than other models of the same level. MM1.5-3B outperforms MiniCPM-V2.0 and is on par with InternVL2 and Phi-3-Vision. In addition, the study also found that whether it is a dense model or a MoE model, the performance will significantly improve as the scale increases.
The success of MM1.5 not only reflects Apple’s research and development strength in the field of artificial intelligence, but also points the way for the future development of multi-modal models. By optimizing data processing methods and model architecture, even smaller-scale models can achieve strong performance, which is of great significance for deploying high-performance AI models on resource-constrained devices.
Paper address: https://arxiv.org/pdf/2409.20566
All in all, the release of MM1.5 marks a significant advancement in multi-modal artificial intelligence technology. Its innovations in data processing and model architecture provide new ideas and directions for the development of future AI models. We look forward to Apple continuing to bring more breakthrough results in the field of artificial intelligence.