The editor of Downcodes learned that the Chinese scientific research team has launched the Infinity-MM ultra-large-scale multi-modal data set and the Aquila-VL-2B AI model based on its training. This data set contains massive image descriptions, visual instruction data, etc., and uses advanced image analysis and information extraction technology to ensure data quality and diversity. The Aquila-VL-2B model performed well in multiple benchmark tests, surpassing similar systems and demonstrating China's significant progress in the field of multi-modal AI. Its open source nature will greatly promote academic research and technological development.
The scale of the Infinity-MM dataset is staggering, containing four major categories of data: 10 million image descriptions, 24.4 million general visual instruction data, 6 million selected high-quality instruction data, and 3 million AI models such as GPT-4 generated data. The research team uses the open source AI model RAM++ for image analysis and information extraction, and ensures the quality and diversity of the generated data through a unique six-category classification system.
In terms of model architecture, Aquila-VL-2B is built on LLaVA-OneVision and integrates the Qwen-2.5 language model and SigLIP image processing technology. The research team adopted a four-stage progressive training method: starting from basic image-text association learning, gradually transitioning to general visual tasks, specific instruction processing, and finally incorporating synthetic data, while gradually increasing the upper limit of image resolution.
Despite only having a parameter scale of 2 billion, Aquila-VL-2B performed well in various benchmark tests. It achieved the best score of 54.9% in the multi-modal understanding ability test MMStar, and even reached a high score of 59% in the mathematical ability test MathVista, significantly surpassing similar systems. In the general image understanding test, the model achieved excellent results of 43% and 75.2% in HallusionBench and MMBench respectively.
Research has found that the introduction of synthetic data contributes significantly to model performance improvement. Experiments show that without using this additional data, model performance drops by an average of 2.4%. Starting from the third stage, the performance of Aquila-VL-2B has significantly surpassed reference models such as InternVL2-2B and Qwen2VL-2B. Especially in the fourth stage, as the amount of data increases, the performance improvement becomes more obvious.
It is worth mentioning that the research team has opened the data set and model to the research community, which will greatly promote the development of multi-modal AI technology. The model is not only trained on Nvidia A100 GPU, but also supports China's self-developed chips, demonstrating strong hardware adaptability.
The launch of the Aquila-VL-2B model marks a major breakthrough in the field of multi-modal AI in China. Its open source nature and powerful performance will promote technological development and application innovation in this field, injecting new vitality into the future development of artificial intelligence. The editor of Downcodes looks forward to more similar breakthrough developments in the future.