Anthropic security protection faces challenges, general jailbreak testing of AI models reveals breakthroughs - AI Articles

Author：Eve Cole Update Time：2025-02-14 08:48:01

This article discusses the test results of Anthropic's AI model Claude 3.5 security protection system. The researchers tested the effectiveness of its new safety mechanism, the physique classifier, through a six-day open challenge. Participants tried to bypass all security protection measures of Claude 3.5 and finally successfully broke through all eight security levels, triggering in-depth discussions on AI security protection. Although the challenger successfully made a breakthrough, no common "jailbreak method" was discovered, which shows that there are still challenges in AI security protection, but it is not completely unbreakable.

In just six days, participants successfully bypassed all security protection measures in Anthropic artificial intelligence (AI) model Claude 3.5, a breakthrough that brings new discussions to the field of AI security protection. Jan Leike, a former OpenAI alignment team member and now working for Anthropic, announced on the X platform that a participant successfully broke all eight security levels. This collective effort involved approximately 3,700 hours of tests and 300,000 messages from participants.

Despite the challenger's successful breakthrough, Leike stressed that no one has been able to propose a common "jailbreak method" to solve all security challenges at once. This means that despite the breakthrough, there is still no way to find a universal way to bypass all security protections.

Claude2，Anthropic，人工智能，聊天机器人克劳德

Challenges and Improvements of Physical Classifiers

As AI technologies become increasingly powerful, how to protect them from manipulation and abuse, especially when it comes to harmful output, has become an increasingly important issue. Anthropic has developed a new security method - a constitution classifier, specifically to prevent the occurrence of general jailbreaks. This method uses preset rules to determine whether the input content is possible to manipulate the model, thereby preventing dangerous responses.

To test the effectiveness of this system, Anthropic recruited 183 participants over a two-month period to try to break through the security protection of the Claude 3.5 model. Participants were asked to try to bypass the security mechanism, causing Claude to answer ten "taboo questions". Despite offering a $15,000 bonus and nearly 3,000 hours of testing, no one has been able to bypass all the security protections.

Earlier versions of the constitution classifier had some problems, including the error marking of harmless requests as dangerous requests and the need for a lot of computing power. But with subsequent improvements, these problems have been effectively solved. Test data show that 86% of manipulation attempts were passed in the unprotected Claude model, while the protected version prevented more than 95% of manipulation attempts, although the system still requires high computing power.

Synthesized training data and future security challenges

The security system is based on synthetic training data, using predefined rules to build a model’s “constitution” that determines which inputs are allowed and which are forbidden. The classifier trained through these synthetic examples can effectively identify suspicious inputs. However, the researchers acknowledge that this system is not perfect and cannot cope with all forms of universal jailbreak attacks, so it is recommended to use it in combination with other security measures.

In order to further strengthen the verification of the system, Anthropic released a public demonstration version between February 3 and 10, 2025, inviting security experts to participate in the challenge, and the results will be shared with you through subsequent updates.

This contest on AI security demonstrates the huge challenges and complexity of AI model protection. With the continuous advancement of technology, how to improve the functionality of the model while ensuring security is still an important issue that the AI industry needs to solve urgently.

In short, the results of this security challenge not only reveal the shortcomings of AI security protection, but also show Anthropic's efforts and progress in improving AI security. In the future, AI security still needs to be continuously improved and improved to meet the ever-evolving challenges.