Alibaba Damo Academy and MoDa community ModelScope jointly open sourced a new multi-language benchmark test set P-MMEval, which aims to more comprehensively evaluate the multi-language capabilities of large language models (LLM) and compare their cross-language transfer capabilities. The test set covers an efficient data set of basic and specialized abilities, ensuring consistent multi-language coverage and providing parallel samples across multiple languages, supporting up to 10 languages from 8 different language families. P-MMEval was launched to address shortcomings in current LLM evaluation, such as the lack of accurate and parallel multi-language evaluation results and the inconsistency in multi-language coverage of existing benchmark sets.
Alibaba Damo Academy and MoDa community ModelScope jointly open sourced a new multi-language benchmark test set P-MMEval, which aims to more comprehensively evaluate the multi-language capabilities of large language models (LLM) and compare their cross-language transfer capabilities. The test set covers an efficient data set of basic and specialized abilities, ensuring consistent multi-language coverage and providing parallel samples across multiple languages, supporting up to 10 languages from 8 different language families. P-MMEval was launched to address shortcomings in current LLM evaluation, such as the lack of accurate and parallel multi-language evaluation results and the inconsistency in multi-language coverage of existing benchmark sets.
P-MMEval selects available and reasonable benchmark test sets based on a significance test-based method, integrates basic natural language processing tasks and ability-specific evaluation tasks, ensures the consistency in language coverage of each task, and provides cross- Parallel samples of languages to allow for consistent comparisons. For task diversity, P-MMEval covers two key basic NLP tasks (generation and understanding) as well as the five core capabilities of current LLM. In terms of linguistic diversity, P-MMEval unifies ten different languages covering eight language families.
The P-MMEval data set has been integrated into the Sinan evaluation system OpenCompass and EvalScope evaluation frameworks, and evaluation tasks can be performed using these two frameworks. OpenCompass provides an open source, efficient, and comprehensive large model evaluation open platform that supports one-stop evaluation of large language models, multimodal models, and various models, and regularly publishes evaluation results lists. P-MMEval has also been connected to the OpenCompass evaluation system for the first time, and can use the Sinan OpenCompass open source tool to complete evaluation tasks.
The researchers evaluated the performance of several representative instruction tuning models, including closed-source models GPT-4o, Claude-3.5 and open-source models LLaMA3.1, LLaMA3.2, Qwen2.5, etc. Experimental results show that, except for the LLaMA3.2 series, the multilingual capabilities of all models improve as the model size increases. Qwen2.5 shows strong multilingual performance on comprehension and specialization tasks, while Gemma2 performs well on generation tasks. Closed source models are generally better than open source models.
The launch of P-MMEval provides new tools and methods for multilingual ability assessment of large models, helping to promote the development and application of multilingual NLP technology.
Dataset link:
https://www.modelscope.cn/datasets/modelscope/P-MMEval
The open source of P-MMEval provides a more comprehensive and standardized benchmark for multi-language capability evaluation of large language models. It covers a wide range of languages and diverse task types, providing valuable resources for researchers and developers and promoting the development of the field of multilingual NLP. We look forward to P-MMEval being continuously improved in the future to better serve the evaluation and improvement of multilingual LLM.