The editor of Downcodes learned that Beijing Zhiyuan Artificial Intelligence Research Institute (BAAI) has launched the world's first Chinese large model debate platform FlagEval Debate! With model debate as its core, the platform provides a new measurement method for large language model capability assessment, aiming to more effectively distinguish the capability differences of different large models. It cleverly uses debate, a language-based intellectual activity, to comprehensively examine the model's capabilities in information understanding, logical reasoning, language generation, etc., and ensures the scientificity and authority of the evaluation results through a combination of public testing and expert review. sex. This move marks a new milestone in large model evaluation and provides valuable reference and reference for the industry.
Beijing Zhiyuan Artificial Intelligence Research Institute (BAAI) recently launched FlagEval Debate, the world's first Chinese large-model debate platform. This new platform aims to provide a new measurement method for the ability evaluation of large language models through the competition mechanism of model debate. It is an extension of the Intelligent Source model battle evaluation service FlagEval large model arena, and its goal is to identify the capability differences between large language models.
There are some problems in existing large model battles. For example, the results of model battles are often tied and it is difficult to distinguish the differences between models; test content relies on user voting and requires the participation of a large number of users; existing battle methods lack interaction between models. In order to solve these problems, Intellectual Property Institute adopted the form of large model debate for evaluation.
As a language-based intellectual activity, debate can reflect the participants' logical thinking, language organization, information analysis and processing abilities. Model debate can demonstrate the level of large models in information understanding, knowledge integration, logical reasoning, language generation and dialogue capabilities, while testing their information processing depth and migration adaptability in complex contexts.
Zhiyuan Research Institute found that interactive battles such as debates can highlight the gaps between models and calculate effective rankings of models based on a small number of data samples. Therefore, they launched FlagEval Debate, a Chinese large-model debate platform based on public testing.
The platform supports two models to conduct debates around debate topics. The debate topics are randomly selected by the platform. The debate topic database is mainly composed of hot search topics, evaluation experts, and debate topics ordered by top debate experts. Every debate can be judged on the platform by all users to enhance user experience.
Each model debate includes 5 rounds of opinion presentation, with each side having one opportunity. In order to avoid the deviation caused by the position of the positive and negative squares, both models will do one square and one negative square each. Each large model competes in multiple debates against other models, with the final model ranking calculated based on winning points.
The model debate competition adopts two methods: open public testing and expert evaluation. The expert jury is composed of players and judges from professional debate competitions. Open public testing audiences can freely appreciate and vote.
Zhiyuan Research Institute stated that it will continue to explore the technical path and application value of model debate, adhere to the principles of science, authority, fairness, and openness, continuously improve the FlagEval large model evaluation system, and provide new insights and thinking for the large model evaluation ecology.
FlagEval Debate official website:
https://flageval.baai.org/#/debate
The launch of FlagEval Debate provides new ideas and methods for large model evaluation, and also contributes to the development of large model technology. The editor of Downcodes hopes that the platform will continue to improve in the future and bring more innovations and breakthroughs to the field of large models.