Meta recently released Prompt-Guard-86M, a machine learning model designed to defend against prompt injection attacks, but the model itself was quickly discovered to have serious security vulnerabilities. Hint injection attacks involve inducing large language models (LLMs) to violate security constraints or produce inappropriate behavior through carefully crafted inputs. Prompt-Guard-86M is intended to filter out these harmful prompts, however, researchers found that simple character spacing and punctuation removal can easily bypass the model's defense mechanism, rendering it ineffective.
Recently, Meta launched a machine learning model called Prompt-Guard-86M, designed to detect and respond to prompt injection attacks. This type of attack usually involves special inputs that cause a large language model (LLM) to behave inappropriately or circumvent security restrictions. Surprisingly, however, the new system itself also exposes itself to the risk of being attacked.
Picture source note: The picture is generated by AI, and the picture is authorized by the service provider Midjourney
Prompt-Guard-86M was launched by Meta together with its Llama3.1 generation model, mainly to help developers filter out prompts that may cause problems. Large language models typically process large amounts of text and data, and if left unchecked, they may arbitrarily repeat dangerous or sensitive information. Therefore, developers built "guardrails" into the model to capture inputs and outputs that could cause harm.
However, users of AI seem to see bypassing these guardrails as a challenge, using hint injection and jailbreaking to make models ignore their own safety instructions. Recently, some researchers pointed out that Meta's Prompt-Guard-86M is vulnerable when processing some special inputs. For example, when typing "Ignore previous instructions" with a space between the letters, Prompt-Guard-86M will obediently ignore the previous instructions.
The discovery was made by a vulnerability hunter named Aman Priyanshu, who discovered the security flaw while analyzing Meta models and Microsoft's benchmark models. Priyanshu said the process of fine-tuning Prompt-Guard-86M had very little impact on individual English letters, allowing him to devise this attack. He shared this finding on GitHub, pointing out that by simply spacing characters and removing punctuation, the classifier can lose its detection capabilities.
Hyrum Anderson, chief technology officer of Robust Intelligence, also agreed. He pointed out that the attack success rate of this method is almost 100%. Although Prompt-Guard is only part of the defense line, the exposure of this vulnerability has indeed sounded the alarm for companies when using AI. Meta has not yet responded, but sources say they are actively looking for a solution.
Highlights:
Meta's Prompt-Guard-86M was found to have a security vulnerability and is vulnerable to prompt injection attacks.
By adding spaces between letters, the system can be made to ignore security instructions, with an attack success rate of almost 100%.
⚠️ This incident reminds companies to be cautious when using AI technology and that security issues still need to be taken into consideration.
The vulnerability of Prompt-Guard-86M exposed the huge challenges facing the field of AI security and once again emphasized that security must be given priority when developing and deploying AI systems. In the future, more powerful and reliable security mechanisms will be the key to the development of AI technology.