The latest Multi-IF benchmark released by Meta aims to more comprehensively evaluate the instruction following ability of large language models (LLM) in multi-turn conversations and multi-language environments. Different from previous evaluation standards that focused more on a single round and a single language, Multi-IF covers eight languages and contains 4501 three-round dialogue tasks. It strives to simulate the complexity of real application scenarios and provide a more rigorous and challenging assessment for LLM. sexual standards. The benchmark test results reveal many shortcomings of current LLM in multi-turn dialogue and multi-language tasks, providing a valuable reference for future model improvement directions.
Meta recently released a new benchmark test called Multi-IF, which is designed to evaluate the instruction following ability of large language models (LLM) in multi-turn conversations and multi-language environments. This benchmark covers eight languages and contains 4501 three-turn dialogue tasks, focusing on the performance of current models in complex multi-turn and multi-language scenarios.
Among the existing evaluation standards, most focus on single-turn dialogue and single-language tasks, which are difficult to fully reflect the performance of the model in practical applications. The launch of Multi-IF is to fill this gap. The research team generated complex dialogue scenarios by extending a single round of instructions into multiple rounds of instructions, and ensured that each round of instructions was logically coherent and progressive. In addition, the data set also achieves multi-language support through steps such as automatic translation and manual proofreading.
Experimental results show that the performance of most LLMs drops significantly over multiple dialogue rounds. Taking the o1-preview model as an example, its average accuracy in the first round was 87.7%, but dropped to 70.7% in the third round. Especially in languages with non-Latin scripts, such as Hindi, Russian and Chinese, the performance of the model is generally lower than that of English, showing limitations in multi-language tasks.
In the evaluation of 14 cutting-edge language models, o1-preview and Llama3.1405B performed best, with average accuracy rates of 78.9% and 78.1% in three rounds of instructions respectively. However, across multiple rounds of dialogue, all models showed a general decline in their ability to follow instructions, reflecting the challenges faced by the models in complex tasks. The research team also introduced the "Instruction Forgetting Rate" (IFR) to quantify the model's instruction forgetting phenomenon in multiple rounds of dialogue. The results show that high-performance models perform relatively well in this regard.
The release of Multi-IF provides researchers with a challenging benchmark and promotes the development of LLM in globalization and multilingual applications. The launch of this benchmark not only reveals the shortcomings of current models in multi-round and multi-language tasks, but also provides a clear direction for future improvements.
Paper: https://arxiv.org/html/2410.15553v2
Highlight:
The Multi-IF benchmark covers eight languages, contains 4501 three-turn dialogue tasks, and evaluates the performance of LLM in complex scenarios.
Experiments show that the accuracy of most LLMs drops significantly in multiple rounds of dialogue, especially in languages with non-Latin scripts.
The o1-preview and Llama3.1405B models performed best, with average accuracy rates of 78.9% and 78.1% for three rounds of instructions respectively.
All in all, the emergence of the Multi-IF benchmark provides a new direction for the evaluation of large language models. Its focus on multi-turn dialogue and multi-language support will help promote the application and development of LLM technology on a global scale and promote The birth of models that are smarter and closer to practical applications.