Anthropic's Claude "Computer Usage" feature released in October gave AI agents unprecedented capabilities to interact with humans through a graphical user interface (GUI), which attracted widespread attention. This function breaks through the limitations of traditional API interfaces and allows Claude to directly control the computer to complete more complex tasks. Research by the National University of Singapore Show Lab conducted a comprehensive test on Claude to evaluate its performance in different scenarios, showing us the potential and limitations of this technology.
Since Anthropic launched Claude's "Computer Use" feature in October, the AI agent's capabilities have attracted widespread attention. This feature makes Claude the first cutting-edge model to interact through the same graphical user interface (GUI) as a human.
Claude provides users with a convenient way to automate operations without the need for an API interface by accessing desktop screenshots and completing tasks through keyboard and mouse operations.
In a study conducted by the National University of Singapore's Show Lab, researchers tested Claude on a variety of tasks, including web searches, workflow completion, office productivity and video games. These tasks tested Claude's ability in different scenarios, such as searching for and purchasing items on the web, or extracting information from a website and inserting it into a spreadsheet. Through these tests, the researchers evaluated Claude's performance along three dimensions: planning, action, and evaluation.
Claude's performance is impressive when it comes to executing complex tasks. It is the ability to formulate a clear plan, follow it step by step, and evaluate its progress at each step. In addition, it can coordinate between multiple applications, such as copying information web pages into a spreadsheet. In some cases, Claude is even able to review the results at the end of the mission to make sure everything is on target.
However, Claude also makes some simple mistakes that the average user can easily avoid. For example, in one task, it failed to complete the subscription because there was no scrolling down the page to find the corresponding button.
There were also cases where it was clunky when performing obvious tasks, like selecting and replacing text or changing bullets to numbers. Additionally, Claude sometimes does not realize his mistakes or makes incorrect assumptions about why he failed to achieve his goals.
The researchers pointed out that Claude's deficiencies in self-assessment mechanisms may be the cause of these errors, and that the GUI agent framework may need to be improved in the future to add more rigorous self-assessment modules. The results also show that existing GUI agents do not fully replicate the fundamental nuances of how humans use computers.
For businesses, the potential to use simple text to describe automated tasks is enticing, but the technology is not yet ready for large-scale adoption. The model's behavior is erratic, which can lead to unpredictable consequences in sensitive applications. At the same time, performing operations through a human-designed interface is not the fastest way to complete a task.
Before widespread deployment, enterprises also need to be concerned about the security risks posed by entrusting large language models (LLMs) to mice and keyboards. For example, research has shown that network proxies are vulnerable to adversarial attacks that humans can easily ignore. Still, tools like Claude can help product teams explore ideas and iterate on solutions, saving time and money before developing new features or services.
Highlight:
1. Claude excels in his ability to automate complex tasks through a graphical user interface.
2. Claude makes mistakes when performing simple tasks, reflecting the inadequacy of his self-evaluation mechanism.
3. At this stage, this technology is not suitable for large-scale application, and enterprises need to be cautious about potential security risks.
All in all, Claude's "Computer Usage" feature demonstrates the great potential of AI in the field of automation, but also exposes areas that still need improvement in terms of stability and security. In the future, with the continuous development and improvement of technology, AI agents like Claude will play an important role in more fields.