Enhancing LLMs with CRT and EIPO Download - Enhancing LLMs with CRT and EIPO Source code download

Enhancing LLMs with CRT and EIPO

Other source code

1.0.0

Download

Enhancing-LLMs-with-CRT-and-EIPO

Large language models (LLMs) have great potential for many language-based tasks but can also produce harmful or incorrect content. Traditionally, human testers have used red-teaming, which involves creating prompts that elicit unwanted model responses to identify and fix these problems. This process is expensive and time-consuming, and while recent attempts to automate it with reinforcement learning have shown promise, they often miss many potential prompts, limiting their effectiveness. Our research introduces curiosity-driven red-teaming (CRT), which uses curiosity-driven exploration to create a broader range of test cases. CRT generates new and unique prompts, often exceeding the effectiveness of current methods, and can even identify toxic prompts in advanced models. However, CRT faces a challenge with novelty rewards that require careful tuning. To address this, we propose Extrinsic-Intrinsic Policy Optimization (EIPO), a reinforcement learning approach that automatically adjusts intrinsic reward importance. EIPO suppresses unnecessary exploration and enhances it when needed, ensuring effective exploration without manual tuning and leading to consistent performance gains across tasks. By integrating EIPO, our CRT method improves automated red-teaming, offering a more robust way to test LLMs and highlighting the need for curiosity-driven exploration to enhance LLM safety.

Expand

Additional Information

Version 1.0.0
Type Other source code
Update Time 2024-12-23
size 1.12KB
From Github

Related Applications

RPG Maker WITH

2024-02-23
With My Past

2024-02-21
Text With Jesus

2023-08-17
Climb With Wheelbarrow

2022-08-26
Race With Ryan

2022-08-21
Birds with Feelings

2022-07-26

Recommended for You

chat.petals.dev

Other source code

1.0.0
GPT Prompt Templates

Other source code

1.0.0
GPTyped

Other source code

GPTyped 1.0.5
waymo open dataset

Other source code

December 2023 Update
SmartTube

Other source code

24.71 Stable
Sunamu

Other source code

Release 2.2.0
waymo open dataset

Other source code

December 2023 Update
wp functions

Other categories

1.0.0
termwind

Other categories

v2.3.0

Related Information All