The editor of Downcodes will take you to learn about LLM2CLIP: an innovative technology that improves the performance of CLIP models! As an important multi-modal basic model, CLIP performs well in tasks such as image text retrieval, but has shortcomings in processing long texts. Researchers from Microsoft and Tongji University proposed the LLM2CLIP method, which cleverly uses large language models (LLMs) to enhance the visual representation learning capabilities of CLIP and overcome the limitations of the original CLIP model.
CLIP, as a search engine, can support various tasks such as zero-shot classification, detection, segmentation and image-text retrieval. At the same time, as a feature extractor, it dominates almost all cross-modal representation tasks, such as image understanding, video understanding, and text-to-image or video generation. The power of CLIP lies in its ability to connect images with natural language and capture human knowledge, thanks to its training on large-scale web data containing detailed textual descriptions.
However, CLIP has certain limitations in handling long and complex text descriptions. To overcome this problem, researchers from Microsoft and Tongji University proposed the LLM2CLIP method, which aims to enhance visual representation learning by integrating large language models (LLMs). This method boldly replaces the original CLIP text encoder and uses the rich knowledge of LLMs to improve the performance of CLIP's visual encoder. Research has found that integrating LLMs directly into CLIP results in performance degradation, so this challenge needs to be addressed.
The LLM2CLIP method greatly improves LLM's ability to separate image captions by introducing "caption contrast fine-tuning" technology, thereby achieving significant performance improvements.
The researchers conducted fine-tuning experiments using data sets of different sizes, including small CC-3M, medium-sized CC-3M and CC-12M, and large-sized CC-3M, CC-12M, YFCC-15M and Recaption-1B. The results show that the model trained using LLM2CLIP performs better than the traditional CLIP and EVA models in image-to-text and text-to-image retrieval tasks.
By combining with models such as Llava1.5 for multi-modal training, LLM2CLIP performed well in almost all benchmark tests, especially when processing long and short text retrieval tasks, improving the performance of the previous EVA02 model by 16.5%. This innovative approach not only transforms CLIP from just processing English data into a powerful cross-language model, but also lays the foundation for future research on CLIP training.
Model: https://huggingface.co/collections/microsoft/llm2clip-672323a266173cfa40b32d4c
Code: https://github.com/microsoft/LLM2CLIP/
Paper: https://arxiv.org/abs/2411.04997
The emergence of LLM2CLIP has brought a new direction to the development of multi-modal models, and its breakthroughs in processing long texts and cross-language tasks are worthy of attention. For more information please visit the links provided in the article. Looking forward to more applications based on LLM2CLIP appearing in the future!