The Chinese scientific research team launched the Infinity-MM super-large-scale multimodal data set and the Aquila-VL-2B AI model trained based on the data set, bringing significant breakthroughs to the field of multimodal AI. The Infinity-MM data set contains massive image descriptions, visual instruction data and data generated by GPT-4 models, and uses RAM++ models for image analysis and a unique six-category classification system to ensure data quality. The Aquila-VL-2B model is based on the LLaVA-OneVision architecture, integrates the Qwen-2.5 language model and SigLIP image processing technology, adopts a four-stage progressive training method, which performs excellently in multiple benchmark tests, surpassing similar systems.
The Infinity-MM dataset is amazing in scale, and contains four categories of data: 10 million image descriptions, 24.4 million general visual instruction data, 6 million selected high-quality instruction data, and 3 million AI models such as GPT-4 The generated data. The research team used open source AI model RAM++ for image analysis and information extraction, and ensured the quality and diversity of generated data through a unique six-category classification system.
In terms of model architecture, Aquila-VL-2B is built based on LLaVA-OneVision and integrates the Qwen-2.5 language model and SigLIP image processing technology. The research team adopted a four-stage gradual training method: starting from basic graphic-text correlation learning, gradually transitioning to general visual tasks and specific instruction processing, and finally integrating synthetic data, while gradually increasing the upper limit of image resolution.
Despite only 2 billion parameters, Aquila-VL-2B performed well in various benchmark tests. The best score in the multimodal understanding ability test MMStar achieved 54.9%, and the high score in the math ability test MathVista achieved 59%, significantly surpassing similar systems. In the general image understanding test, the model achieved excellent results of 43% and 75.2% in HallusionBench and MMBench, respectively.
The study found that the introduction of synthetic data contributed significantly to the improvement of model performance. Experiments show that without using these additional data, the model performance will drop by an average of 2.4%. Starting from the third stage, the performance of Aquila-VL-2B significantly surpassed reference models such as InternVL2-2B and Qwen2VL-2B, especially in the fourth stage, the performance improvement is more obvious as the amount of data increases.
It is worth mentioning that the research team has opened data sets and models to the research community, which will greatly promote the development of multimodal AI technology. This model not only completes training on the Nvidia A100GPU, but also supports Chinese self-developed chips, demonstrating strong hardware adaptability.
The success of the Aquila-VL-2B model, as well as the open source of data sets and models, marks a significant progress in China's multimodal artificial intelligence field, provides a solid foundation for future AI development, and also indicates multimodal AI. Technology will usher in broader application prospects.