Recently, a unique AI capability evaluation was launched on the "Minecraft" platform, attracting a lot of attention. This test, initiated by developer adi and nicknamed "the only reliable AI evaluation benchmark", allowed the new and old versions of the Claude3.5 Sonnet model to start a construction PK in the game. The new version of Sonnet3.6 showed impressive performance. strength. The editor of Downcodes will give you an in-depth understanding of this unique competition in AI capabilities, as well as the technical details and future prospects behind it.
Recently, a unique AI capability evaluation was launched on the "Minecraft" platform, attracting a lot of attention. The old and new versions of Claude 3.5 Sonnet started building PK in the game, showing obvious differences in capabilities. The performance of the new version (tentatively called Sonnet 3.6) was particularly eye-catching.
This test initiated by developer adi is dubbed as the only reliable evaluation benchmark. Evaluation benchmark researcher Aidan McLau believes that this method just meets the needs of current AI evaluation, and points out that aesthetic ability is closely related to intelligence level. The project quickly gained support from the open source community, and the relevant code has been online on GitHub.
The test results show that each major model shows a unique personality:
Sonnet3.6 is slightly better in terms of creativity and received votes from more than 2,000 netizens.
Although OpenAI's o1-preview is slow to build, it performs well when restoring real buildings (such as the Taj Mahal)
o1-mini is unable to complete related tasks
Llama3405B built a diamond wall over a fire pit that symbolizes self
Alibaba’s Qwen2.5-14B also showed great strength
It is worth noting that the construction process of AI in the game does not rely on visual understanding or direct control of input devices, but provides context and generates operation instructions in the form of text, similar to playing blind chess. Technical implementation mainly relies on:
mineflayer open source library: Convert AI-generated instructions into executable API calls
mindcraft open source library: provides common prompt words and examples, and supports various models to be connected to the game
The project team plans to further improve this evaluation mechanism and create a scoring system similar to Lmsys Arena, using the Elo algorithm to rank based on human user votes. It is reported that the complete test environment can be set up in just 15 minutes.
This novel evaluation method not only demonstrates the creativity of AI, but also provides a new perspective for the objective evaluation of large model capabilities. Just as o1-preview chose to build a robot and spell out the words GPT during free play, AI seems to have begun to show its own personality in this virtual world. As more models are added to the test, this classic game is becoming a unique platform to witness the development of AI.
Video tutorial:
https://x.com/mckaywrigley/status/1849613686098506064
Open source code:
https://github.com/kolbytn/mindcraft
https://github.com/mc-bench/orchestrator
Through this unique Minecraft AI construction competition, we saw the different performances of AI in creativity and problem-solving abilities. This test provides a new idea for AI capability assessment, and also indicates that AI technology will have broader development space in the future. We look forward to more models joining in to witness the miracles created by AI in "Minecraft"!