Anthropic's latest "body classifier" has brought new breakthroughs to AI security protection. This technology is designed to effectively resist "universal jailbreak" attacks and prevent AI models from generating harmful content. Through large-scale testing, 183 participants were still unable to completely bypass the system's security protection under the incentive of high bonuses and sufficient time, which fully demonstrated the strong defense capabilities of the "physical classifier". This article will conduct in-depth discussion on the working principle, improvement process and future development direction of the "physical constitution classifier".
Artificial intelligence company Anthropic recently announced the development of a new security method called the "Body Constitution Classifier" aimed at protecting language models from malicious manipulation. The technology is specifically aimed at “universal jailbreak”—a way of input that attempts to systematically bypass all security measures to prevent AI models from generating harmful content.
To verify the effectiveness of this technology, Anthropic conducted a large-scale test. The company recruited 183 participants to try to break through its defense system within two months. Participants were asked to try to get the AI model Claude 3.5 to answer ten forbidden questions by entering specific questions. Despite offering up to $15,000 bonus and about 3,000 hours of testing time, no participants were able to completely bypass Anthropic's security measures.
Advance from challenges
Anthropic's early version of "Body Constitution Classifier" had two main problems: one was to misjudged too many harmless requests as dangerous requests, and the other was to require a large amount of computing resources. After improvement, the new classifier significantly reduces the misjudgment rate and optimizes the computing efficiency. However, automatic testing shows that while the improved system successfully blocked more than 95% of jailbreak attempts, an additional 23.7% of computing power is required to run. In contrast, the unprotected Claude model allows 86% of jailbreak attempts to pass.
Training based on synthetic data
The core of the “constitution classifier” is to use predefined rules (called the “constitution”) to distinguish between allowed and prohibited. The system trains the classifier to identify suspicious inputs by generating synthetic training examples in multiple languages and styles. This approach not only improves the accuracy of the system, but also enhances its ability to deal with diverse attacks.
Despite significant progress, Anthropic researchers acknowledge that the system is not perfect. It may not be able to cope with all types of universal jailbreak attacks, and new attack methods may emerge in the future. Therefore, Anthropic recommends using the “constitution classifier” in conjunction with other safety measures to provide more comprehensive protection.
Public testing and future prospects
To further test the strength of the system, Anthropic plans to release a public demo version between February 3 and 10, 2025, inviting security experts to try to crack it. The test results will be announced in subsequent updates. This move not only demonstrates Anthropic's commitment to technological transparency, but also provides valuable data for research in the field of AI security.
Anthropic's "body classifier" marks an important progress in the security protection of AI models. With the rapid development of AI technology, how to effectively prevent the abuse of models has become the focus of industry attention. Anthropic's innovations provide new solutions to this challenge, while also pointing out the direction for future AI security research.
Anthropic's "body classifier" sets a new benchmark for the field of AI security, and its concepts of public testing and continuous improvement are worth learning from. In the future, with the continuous development of technology and the evolution of security threats, the improvement and upgrading of "physical classifiers" will play a more critical role in ensuring AI security.