Alibaba launches multi-modal large model mPLUG-Owl3, which can watch a 2-hour movie in 4 seconds

Author：Eve Cole Update Time：2024-12-22 11:32:01

Alibaba’s latest general-purpose multimodal large model mPLUG-Owl3 has set off a storm in the field of artificial intelligence with its powerful multimodal understanding capabilities and amazing reasoning efficiency. It can understand 2 hours of video content in 4 seconds and accurately answer various questions raised by users, demonstrating excellent performance in image, video and text understanding. This technological breakthrough is not only a milestone in academia, but also heralds a future change in the way AI interacts with humans.

In this era of information explosion, we use pictures and videos to record our lives and share our happiness every day. But have you ever thought about what would happen if there was a technology that allowed machines to not only understand these pictures and videos like humans, but also communicate with us in depth?

The latest general-purpose multi-modal large model mPLUG-Owl3 released by the Alibaba team, with its amazing efficiency and understanding ability, allows us to watch a 2-hour movie in 4 seconds! This is not just a model, but more like It is an AI assistant that can see, listen, speak, and think.

mPLUG-Owl3, the name sounds like an owl wearing glasses, smart and alert. Its core capability is the understanding of long image sequences. Whether it’s a series of photos or a video, it can understand the content and even understand the storyline.

In order to allow mPLUG-Owl3 to process so much information, the researchers equipped it with a super brain-hyper-attention module. This module is like a super brain for AI, capable of processing visual and language information at the same time, allowing AI to understand both images and related text information.

The mPLUG-Owl3 model has made a major breakthrough in the field of multi-modal understanding with its excellent reasoning efficiency. It not only reaches SOTA (State of the Art) in multi-scenario benchmarks such as single image, multi-image, video, etc., but also reduces the First Token Latency by 6 times, and the number of images that can be processed by a single A100 graphics card increases by 8 times, reaching 400 sheets.

mPLUG-Owl3 can accurately understand the incoming multi-modal knowledge and use it to answer questions. It can even tell you which piece of knowledge it bases its judgment on, as well as the detailed basis for its judgment.

mPLUG-Owl3 can correctly understand the content relationships in different materials and conduct in-depth reasoning. Whether it's stylistic differences or character recognition, it handles it all with ease.

mPLUG-Owl3 is able to watch and understand videos up to 2 hours long and can start answering user questions within 4 seconds, no matter which part of the video the question involves.

mPLUG-Owl3 uses a lightweight Hyper Attention module to expand the Transformer Block into a new module capable of graphic and text feature interaction and text modeling. This design greatly reduces the number of additional new parameters introduced, making the model easier to train, and the training and inference efficiency is also improved.

Experimenting on a wide range of data sets, mPLUG-Owl3 achieves SOTA results on most single-image multi-modal benchmarks. In multi-image evaluations, it surpasses models specifically optimized for multi-image scenarios. On LongVideoBench, it surpassed existing models, showing its excellent ability in long video understanding.

The release of Alibaba mPLUG-Owl3 is not only a technological leap, but also provides new possibilities for the application of multi-modal large models. As the technology continues to improve, we look forward to mPLUG-Owl3 bringing more surprises in the future.

Paper address: https://arxiv.org/pdf/2408.04840

Code: https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl3

Online experience: https://huggingface.co/spaces/mPLUG/mPLUG-Owl3

The emergence of mPLUG-Owl3 marks a new stage in the development of multi-modal large model technology. Its efficient processing capabilities and accurate understanding capabilities open up broad prospects for future AI technology applications. I believe that as the technology continues to mature, mPLUG-Owl3 will bring more convenience and surprises to people’s lives. Looking forward to more innovative applications based on mPLUG-Owl3.