Apple and the Ecole Polytechnique Fédérale de Lausanne (EPFL) in Switzerland have collaborated to develop a groundbreaking multi-modal model, 4M-21. The model can be trained on 21 different modalities, its performance is significantly better than existing models, and it realizes multiple functions such as cross-modal retrieval and controllable generation. This research greatly improved the performance and adaptability of the model by expanding the model and data set size, increasing the type and number of training modalities, and adopting a joint training strategy. The 4M-21 model adopts a Transformer-based encoder-decoder architecture and adds additional modal embeddings to adapt to new modalities. Its training process fully considers the characteristics of different modalities and uses specific Tokenization method for modal.
This study adopted a 4M pre-training solution, which can improve the performance and adaptability of the model by expanding the size of the model and data sets, increasing the type and number of modalities involved in training the model, and conducting joint training on multiple data sets. Researchers use different tokenization methods to discretize modalities with different features, such as global image embeddings, human poses, and semantic instances. In terms of architecture selection, this study adopts the 4M encoder-decoder architecture based on Transformer and adds additional modal embeddings to adapt to new modalities.
This model can not only perform a series of common vision tasks out of the box, such as DIODE surface normal and depth estimation, COCO semantic and instance segmentation, 3DPW3D human pose estimation, etc., it can also generate arbitrary training modalities and support several methods. Fine-grained and multi-modal generation is performed, and RGB images or other modalities can be retrieved by using other modalities as queries. In addition, researchers also conducted multi-modal transmission experiments on NYUv2, Hypersim semantic segmentation and ARKitScenes.
Its important functional features include:
Any-to-any modalities: Increased from 7 modalities of the best existing any-to-any model to 21 different modalities, enabling cross-modal retrieval, controllable generation, and powerful out-of-the-box performance.
Diversity Support: Add support for more structured data such as human poses, SAM instances, metadata, and more.
Tokenization: Study discrete tokenization for different modalities using modality-specific methods, such as global image embeddings, human poses, and semantic instances.
Scaling: Expand model size to 3B parameters and dataset to 0.5B samples.
Collaborative training: collaborative training on vision and language at the same time.
Paper address: https://arxiv.org/pdf/2406.09406
Highlight:
- Researchers from Apple and the Ecole Polytechnique Fédérale de Lausanne (EPFL) in Switzerland jointly developed a single any-to-any modality model that can be trained in 21 different modalities.
- The model can perform a range of common vision tasks out of the box and is also capable of generating arbitrary training modalities, supporting several methods to perform fine-grained and multi-modal generation.
- Researchers also conducted multi-modal transmission experiments on NYUv2, Hypersim semantic segmentation and ARKitScenes.
All in all, the 4M-21 model has made significant progress in the field of multi-modal research. Its powerful performance and broad application prospects provide a new direction for the future development of multi-modal artificial intelligence. The open source and future applications of this model are worth looking forward to.