A research team from the National University of Singapore has developed an advanced audio-visual large language model (av-LLM) called video-SALMONN, which is capable of understanding the visual, audio and speech content in videos. The model connects pre-trained audio and video encoders with large language models through an innovative multi-resolution causal Q-Former structure to achieve a comprehensive understanding of video content. This breakthrough technology has achieved remarkable results in tasks such as video question answering, opening up a new path for the application of artificial intelligence in video understanding and reasoning, and is expected to bring widespread applications in education, medical and other fields.
Recently, Wenyi Yu and his team at the National University of Singapore proposed a new technology called video-SALMONN, which is not only able to understand visual frame sequences, audio events, and music in videos; voice content. The introduction of this technology marks an important step in allowing machines to understand video content.
Video-SALMONN is an end-to-end audio-visual large language model (av-LLM) that combines pre-trained audio and video encoders with a novel multi-resolution causal Q-Former (MRC Q-Former) structure. Connect with the body of a large language model. This structure not only captures the fine-grained temporal information required for speech understanding, but also ensures efficient processing of other video elements.
In order to improve the model's balanced processing of different video elements, the research team proposed specialized training methods, including diversity loss and unpaired audio and video hybrid training strategies to avoid the dominance of video frames or modalities.
On the newly introduced Speech-Audio-Visual Evaluation Benchmark (SAVE), Video-SALMONN achieved an absolute accuracy improvement of more than 25% on the video question answering (video-QA) task, and achieved an absolute accuracy improvement of more than 25% on the audio and video question answering task involving human speech. An absolute accuracy improvement of more than 30% was achieved. In addition, Video-SALMONN demonstrates excellent video understanding and reasoning capabilities on tasks unprecedented for other av-LLMs.
The core of video-SALMONN is the multi-resolution causal (MRC) Q-Former structure, which aligns synchronized audio and video input features and text representation space on three different time scales to meet the dependence of different tasks on different video elements. need. In addition, in order to strengthen the temporal causal relationship between consecutive video frames, a causal self-attention structure with a special causal mask is included in MRC Q-Former.
The proposal of Video-SALMONN not only brings new research tools to the academic community, but also provides broad possibilities for practical applications. It makes the interaction between technology and humans more natural and intuitive, reducing the difficulty for users, especially children and the elderly, to learn to use technology. At the same time, it also has the potential to improve the accessibility of technology, including for people with movement disabilities.
The proposal of video-SALMONN is an important step towards realizing general artificial intelligence (AGI). By integrating speech input as well as existing non-speech audio and visual input, such models will gain a comprehensive understanding of human interactions and environments, allowing them to be applied to a wider range of domains.
The development of this technology will undoubtedly have a profound impact on video content analysis, educational applications, and improving people's quality of life. As technology continues to advance, we have reason to believe that future AI will be more intelligent and closer to human needs.
Paper address: https://arxiv.org/html/2406.15704v1
The breakthrough progress of video-SALMONN technology indicates that artificial intelligence has reached a new milestone in the field of video understanding, and its broad application prospects are worth looking forward to. In the future, the continuous development of similar technologies will further promote the deep integration of artificial intelligence and human society.