Recently, a unique AI capability evaluation has been launched in "Minecraft", attracting widespread attention. Different AI models compete in the game, and decide on survival of the fittest through player voting, providing new ideas for AI capabilities evaluation. This test initiated by developer Adi is considered to be an effective supplement to the current AI evaluation. Its core lies in combining aesthetic ability with intellectual level and has received strong support from the open source community. The relevant code has been made public.
Recently, a unique AI capability evaluation was launched on the "Minecraft" platform, attracting a lot of attention. The new and old versions of Claude 3.5Sonnet have launched architectural PK in the game, showing obvious abilities differences, and the performance of the new version (tentatively called "Sonnet 3.6") is particularly impressive.
This test, initiated by developer Adi, is nicknamed "the only reliable evaluation benchmark." Evaluation benchmark researcher Aidan McLau believes that this method just meets the current needs of AI evaluation and points out that aesthetic ability is closely related to intellectual level. The project quickly received support from the open source community, and the relevant code has been launched on GitHub.
The test results show that all major models show unique "personality":
Sonnet3.6 is slightly better in terms of creativity, and has won the votes of more than 2,000 netizens
Although the o1-preview of OpenAI is slow to build, it performs well when restoring real buildings (such as the Taj Mahal).
o1-mini cannot complete related tasks
Llama3405B builds "diamond walls on fire pits" that symbolizes itself
Alibaba's Qwen2.5-14B also showed outstanding strength
It is worth noting that the construction process of AI in the game does not rely on visual understanding or directly control the input device, but provides context and generates operation instructions through text, similar to playing blind chess. The technology implementation mainly relies on:
mineflayer open source library: converts AI-generated instructions into executable API calls
Mindcraft open source library: provides common prompt words and examples, and supports various models to access games
The project team plans to further improve this evaluation mechanism, create a scoring system similar to the Lmsys arena, and use the Elo algorithm to rank according to human user voting. It is reported that the complete test environment can be completed in just 15 minutes.
This novel evaluation method not only demonstrates the creativity of AI, but also provides a new perspective for objective evaluation of large-scale model capabilities. Just as o1-preview chooses to build a robot and spell out the word "GPT" when it is free to play, AI seems to have begun to show its "personality" in this virtual world. As more models are added to the test, this classic game is becoming a unique platform to witness the development of AI.
Video tutorial:
https://x.com/mckaywrigley/status/1849613686098506064
Open source code:
https://github.com/kolbytn/mindcraft
https://github.com/mc-bench/orchestrator
The AI model building capability evaluation conducted through the Minecraft platform provides a novel perspective for evaluating AI's creativity and intelligence level, and also demonstrates the continuous development potential of AI in the virtual world. In the future, with more models participating and evaluation mechanisms improving, this evaluation will provide more valuable references for the development of the AI field.