Li Feifei's team recently released a breakthrough research result: a new multi-modal model that can understand and generate human actions, and cleverly combines language models to achieve unified processing of verbal and non-verbal language. This innovation not only enables machines to understand human instructions, but also interpret the emotions behind actions, thereby achieving more natural and smoother human-computer interaction. The core of the model lies in its multi-modal language model framework, which can integrate multiple inputs such as audio, motion and text, and output corresponding modal data. It performs well on tasks such as collaborative speech gesture generation, significantly reduces the amount of data required for model training, and expands new application scenarios such as editable gesture generation and emotion prediction through actions.
Li Feifei's team launched a new multi-modal model that can understand and generate human actions, and by combining language models, it achieves unified processing of verbal and non-verbal language. This breakthrough research enables machines to not only understand human instructions, but also read the emotions contained in actions, allowing for more natural human-computer interaction.
The core of the model lies in its multi-modal language model framework, which can receive multiple forms of input such as audio, motion and text, and output the required modal data. Combined with a generative pre-training strategy, the model exhibits excellent performance on multiple tasks. For example, in collaborative speech gesture generation, the model not only surpasses the state of the art but also significantly reduces the amount of data required for training. In addition, the model also unlocks new application scenarios, such as editable gesture generation and emotion prediction through actions.
Human communication is multimodal in nature and includes verbal and non-verbal cues such as speech, facial expressions, and body posture. This model's ability to understand these multimodal behaviors is critical for creating virtual characters that communicate naturally in applications such as games, movies, and virtual reality. However, existing action generation models are often limited to specific input modalities (speech, text, or action data) and fail to fully exploit the diversity of available data.
This model utilizes language models to unify verbal and non-verbal language for three main reasons:
Language models naturally connect different modalities.
Speech is highly semantic, and tasks such as modeling responses to jokes require strong semantic reasoning capabilities.
The language model acquires strong semantic understanding capabilities through extensive pre-training.
To achieve this, the research team first divided the body into different parts (face, hands, upper body, lower body) and labeled each part individually for motion. Combining text and speech tokenizers, input in any modality can be represented as a series of tokens for use by language models. The model adopts a two-stage training process: first pre-training to achieve alignment of various modalities with combined body movements, as well as alignment of audio and text. Afterwards, downstream tasks are converted into instructions and the model is trained on these instructions so that it can follow various task instructions.
The model performed well on the BEATv2 collaborative speech gesture generation benchmark, far exceeding existing models. The effect of the pre-training strategy has also been verified, especially when data is scarce, showing strong generalization ability. By post-training on speech-action and text-action tasks, the model can not only follow audio and text prompts, but also achieve new functions such as predicting emotions from action data.
In technical details, the model employs modality-specific tokenizers to handle various input modalities. Specifically, the model trains a combined body motion VQ-VAE that converts facial, hand, upper body, and lower body movements into discrete markers. These modality-specific vocabularies (audio and text) are combined into a unified multimodal vocabulary. During training, mixed tokens of different modalities are used as input and the output is generated by an encoder-decoder language model.
The model also utilizes a multimodal vocabulary to convert different modal data into a unified format for processing. In the pre-training stage, the model learns the correspondence between different modalities by performing conversion tasks between modalities. For example, a model can learn to translate upper body movements into lower body movements, or convert audio into text. Additionally, the model learns the temporal evolution of actions by randomly masking certain action frames.
In the post-training phase, the model is fine-tuned using paired data to perform downstream tasks such as collaborative speech gesture generation or text-to-action generation. To enable the model to follow natural human instructions, the researchers built a multi-task instruction following template that converts tasks such as audio-to-action, text-to-action, and emotion-to-action into instructions. The model also has the ability to edit gestures to generate coordinated full-body movements based on text and audio cues.
Finally, the model also unlocks new capabilities for predicting emotions from actions. This has important implications for fields such as mental health or psychiatry. This model is able to more accurately predict the emotions expressed in actions than other models, showing strong body language understanding capabilities.
The research shows that unifying verbal and non-verbal language of human actions is critical for practical applications, and language models provide a powerful framework for this.
Paper address: https://arxiv.org/pdf/2412.10523v1
All in all, this research has brought significant progress to the field of multi-modal artificial intelligence. Its application potential in human-computer interaction, virtual character creation, and emotion recognition is huge and deserves further attention and research. In the future, this model is expected to play a role in more fields and promote the development of artificial intelligence technology.