Zhipu AI has open sourced its CogAgent-9B model based on GLM-4V-9B training. This is an Agent task model that can understand user instructions through screenshots and predict the next GUI operation. This model has strong universality and is suitable for various GUI interaction scenarios such as personal computers, mobile phones, and cars. Compared with the previous version, CogAgent-9B-20241220 has been significantly improved in many aspects, supports bilingual Chinese and English, and can output detailed thinking processes, action descriptions and sensitivity judgments. It has achieved leading results on multiple data sets, demonstrating its advantages in GUI positioning, single-step and multi-step operations. The open source CogAgent-9B not only promotes the development of large model technology, but also provides new possibilities for the visually impaired.
Compared with the first version of the CogAgent model that was open sourced in December 2023, CogAgent-9B-20241220 has significantly improved in terms of GUI perception, inference prediction accuracy, action space completeness, task universality and generalization. And supports bilingual screenshots and language interaction in Chinese and English. The input of CogAgent only includes the user's natural language instructions, executed historical action records and GUI screenshots, without any textual representation of layout information or additional element label information. The output covers the thinking process, natural language description of the next action, structured description of the next action, and sensitivity judgment of the next action.
In the performance test, CogAgent-9B-20241220 achieved leading results on multiple data sets, demonstrating its advantages in GUI positioning, single-step operations, Chinese step-wise lists, and multi-step operations. This move by Smart Spectrum Technology not only promotes the development of large model technology, but also provides new tools and possibilities for visually impaired IT practitioners.
Code:
https://github.com/THUDM/CogAgent
Model:
Huggingface: https://huggingface.co/THUDM/cogagent-9b-20241220
Cogagent Community: https://modelscope.cn/models/ZhipuAI/cogagent-9b-20241220
The open source of CogAgent-9B marks an important step in the large-model Agent ecosystem. Its efficient GUI interaction capabilities and wide applicability provide a new direction for the future development of intelligent interaction technology, and also heralds the coming of more convenient and smarter future application scenarios. We look forward to seeing more innovative applications based on CogAgent-9B.