prompt guard
1.0.0
Prompt Guard is a classifier model by Meta, trained on a large corpus of attacks, capable of detecting both explicitly malicious prompts (jailbreaks) as well as data that contains injected inputs (prompt injections). Upon analysis, it returns one or more of the following verdicts, along with a confidence score for each:
This repository contains a Streamlit app for testing Prompt Guard. Note that you'll need an HuggingFace access token to access the model. For a more detailed writeup, see this blog post.
Here's a sample response by Prompt Guard upon detecting a prompt injection attempt.
Here's a sample response by Prompt Guard upon detecting a jailbreak attempt.