Meta AI researchers and academic partners have developed an innovative system, MILS (Multimodal Iterative LLM Solver), which teaches large language models to process images, videos and audio without specialized training. . MILS relies on the natural problem-solving ability of language models rather than large amounts of data training, showing its unique advantages.
MILS works by pairing two AI models to perform task solutions: one is a "generator", responsible for proposing task solutions, and the other is a "ratingr", used to evaluate the effectiveness of the generated solution. The feedback provided by the scorer can help the generator continuously optimize the answer until it reaches a satisfactory result. For example, in the image description task, MILS can gradually refine image descriptions, thereby accurately describing image details at different levels.
MILS performs particularly well in image description. By using the Llama-3.1-8B model as a generator and the CLIP model as a scorer, MILS is able to create image descriptions comparable to those of the current leading methods, although CLIP is not specifically trained for image description tasks. In addition, MILS also enhances text-to-image generation capabilities by fine-tuning text prompts, and can combine AI-generated prompts with image processing tools to handle image editing tasks such as style conversion.
The accuracy of image description increases with the number of steps between the generator and the scorer. | Photo: Ashutosh, etc.
MILS's capabilities are not limited to images, it also extends to the video and audio fields. When tested using the MSR-VTT video dataset, MILS outperforms existing models in video content description. Since MILS does not modify model parameters during operation, it can convert different types of data into readable text, supporting the merge and conversion of information from multiple sources such as images and audio into the desired format, thus making multimodal information Converged applications open up new possibilities.
Tests show that using larger generators and scoring models can produce more accurate results, and increasing the number of potential solutions can significantly improve performance. The researchers also found that extending to a larger language model not only improves the quality of the results, but also significantly improves performance.
Landscapes evolve from simple basic descriptions to complex landscape representations with more precise details and more natural elements. | Photo: Ashutosh, etc.
This innovative strategy adopted by MILS is in line with the current trend of the field of artificial intelligence towards smarter reasoning capabilities. The Meta team also said that MILS may show great potential in the future in fields such as 3D data processing, further promoting the development of multimodal AI.
With the rapid development of OpenAI's GPT-4 and other open source alternatives, such as Meta's Llama 3.2, Mistral's Pixtral, and DeepSeek's Janus Pro, these emerging multimodal AI systems are accelerating their application to everyday life. And lays an important foundation for the future development of artificial intelligence.