Large model evaluation platform CompassArena upgrades and launches new Judge Copilot function

Author：Eve Cole Update Time：2024-12-24 19:00:01

The CompassArena large model evaluation platform launched by the OpenCompass team of Shanghai Artificial Intelligence Laboratory Sinan and the ModelScope platform has recently undergone a major upgrade. This upgrade aims to improve user experience and provide more scientific and comprehensive model evaluation. By collecting a large amount of user data and continuously optimizing it, CompassArena has added the Judge Copilot function, improved the ranking algorithm, and incorporated more than 20 new models, covering domestic and foreign business models and open source models, to provide users with richer choices and better Accurate model ranking.

The Judge Copilot function uses the powerful evaluation model Compass-Judger-1-32B-Instruct to provide users with the ability to comprehensively compare and analyze the performance of dialogue models, from multi-dimensional evaluation, real-time comparison to intelligent decision-making assistance, to improve evaluation efficiency and accuracy. At the same time, the upgraded ranking algorithm effectively reduces the impact of confounding factors by improving the Bradley-Terry statistical algorithm and introducing control variables, making the model ranking more scientific and accurate. The platform also actively collects user feedback to continuously improve the comprehensive capabilities and alignment effects of the Judge model.

微信截图_20241219174613.png

CompassArena attaches great importance to the performance of the Judge model in practical applications and actively collects user feedback to further improve the comprehensive capabilities and alignment effects of the Judge model. Users can express their evaluation of the Judge model by clicking the "Like" and "Dislike" buttons. By fitting a Bradley-Terry statistical model that includes control variables, CompassArena can estimate the impact of many external factors. The specific impact can be expressed in the form of odds ratios.

With this upgrade, CompassArena welcomes domestic business models including 360gpt2-pro, deep-seek-v2.5-chat, doubao-pro-32k-240828, as well as claude-3.5-sonnet-20241022, gemini-exp-1121, etc. The addition of foreign business models and a series of open source models. The new models belong to organizations including 360, DeepSeek, Doubao, etc., providing users with richer battle options.

Experience address: https://www.modelscope.cn/studios/opencompass/CompassArena

This upgrade of CompassArena not only improves the scientificity and accuracy of model evaluation, but also provides users with richer model choices and a more convenient experience, marking a new stage for the large model evaluation platform. Welcome to visit the experience address, participate in model evaluation, and jointly promote the development of large model technology.