xFinderダウンロード - xFinderソースコードのダウンロード

xFinder

AI ソースコード

v0.2.3 Released!

ダウンロード

xFinder: 大規模な言語モデルに対する堅牢でピンポイントな回答抽出

Qingchen Yu ^1,* 、Zifan Zheng ^1,* 、Shichao Song ^2,* 、Zhiyu Li ^1,† 、Feiyu Xiong ¹ 、Bo Tang ¹ 、Ding Chen ¹

¹上海先進アルゴリズム研究所、 ²中国人民大学

ビジネスに関するお問い合わせは、[email protected] までご連絡ください。

誰が私たちの仕事に注目すべきでしょうか?

ベンチマークを開発している場合は、LLM 応答から主要な回答を抽出するための従来の RegEx メソッドを xFinder を使用して置き換えることができます。これにより、評価結果の精度が向上し、より信頼性が高く有意義なモデルのパフォーマンスの比較と検証が可能になります。
評価フレームワークの設計者は、xFinder をフレームワークの回答抽出コンポーネントに統合して、評価プロセスの堅牢性と信頼性を強化できます。

重要

?私たちにスターを付けてください! GitHub でプロジェクトにスターを付けると、すべてのリリース通知を即座に受け取ることができます。ご支援をよろしくお願いいたします。

?ニュース

[2024/10] KAF-Dataset をオープンソース化し、xFinder を PyPI パッケージとしてリリースしました。
[2024/09] xFinder が OpenCompass に統合されました。
[2024/08] xFinder を更新しました: 英語と中国語の両方の処理をサポートするモデルになりました。
[2024/05] xFinder: 大規模言語モデル向けの堅牢でピンポイントな回答抽出をリリースしました。紙をチェックしてください。

概要

抽象的な

大規模言語モデル (LLM) の継続的な進歩により、そのパフォーマンスを評価するための公正で信頼性の高い方法を開発するという重要な問題への注目が高まっています。特に、テストセットの漏洩や即時フォーマットのオーバーフィッティングなどの主観的または非主観的な不正現象の出現は、LLM の信頼性の高い評価に重大な課題をもたらします。評価フレームワークは、回答の抽出に正規表現 (RegEx) を使用することが多いため、一部のモデルは、RegEx で簡単に抽出できる特定の形式に準拠するように応答を調整する場合があります。それにもかかわらず、RegEx に基づくキー回答抽出モジュールは、頻繁に抽出エラーに悩まされます。このペーパーでは、LLM 評価チェーン全体の包括的な分析を実施し、主要な回答抽出モジュールを最適化することで抽出精度が向上し、LLM の特定の回答形式への依存を減らし、LLM 評価の信頼性を高めることができることを実証しています。これらの問題に対処するために、重要な回答の抽出のために特別に設計されたモデルである xFinder を提案します。このプロセスの一環として、効果的なモデルのトレーニングと評価を確実にするために、特殊なデータセットである Key Answer Finder (KAF) データセットを作成します。現実世界のシナリオでの一般化テストと評価を通じて、その結果は、わずか 5 億個のパラメータを備えた最小の xFinder モデルが平均 93.42% の回答抽出精度を達成することを示しています。対照的に、最良の評価フレームワークにおける正規表現の精度は 74.38% です。 xFinder は、既存の評価フレームワークと比較して、より強力な堅牢性と高い精度を示します。

私たちの主な貢献を次のように要約します。

当社は、業界における LLM 評価プロセスの包括的なレビューを提供し、信頼性の低い評価結果につながる可能性のある重要な要因を特定します。
重要な回答の抽出のために特別に設計されたモデルである xFinder を紹介します。 KAF データセットは、その効果的なトレーニングと評価をサポートします。
私たちの広範な実験では、RegEx ベースの評価方法は信頼性が低い一方、xFinder モデルは信頼性を大幅に向上させることを実証しました。

図に示すように、LM Eval Harness や OpenCompass などの評価フレームワークが重要な回答を抽出できなかった例が示されています。具体的には、A/T/C/M は、それぞれアルファベット / 短いテキスト / カテゴリラベル / 数学オプションを備えたタスクを表します。

クイックスタート

ベンチマークデータセットの作成: xFinder を使用して評価プロセスを合理化するために、さまざまな主流ベンチマークデータセットを統一された JSON 形式に標準化しました。実装の詳細については、create_benchmark_dataset.py を参照してください。 xFinder を使用して独自のデータセットを評価したい場合は、提供されているスクリプトテンプレート Benchmark_dataset_template.py の形式変換ガイダンスを参照してください。
QA ペアと LLM 出力の準備: 評価する LLM 出力を収集します。データに次の要素が含まれていることを確認してください。
- 元の質問
- 主要な回答タイプ (オプション: アルファベット、短文、カテゴリラベル、数学)
- LLM出力
- 標準解答範囲
xFinder モデルをデプロイする: デプロイメント用に次のモデルのいずれかを選択します。
- xFinder-qwen1505
- xFinder-llama38it

xFinder モデルをデプロイした後、次の手順に従って評価を実行します。

 # Install xfinder
conda create -n xfinder_env python=3.10 -y
conda activate xfinder_env
pip install xfinder

# Perform an evaluation with xFinder (a built-in example)
CUDA_VISIBLE_DEVICES=0 python -m xfinder.eval --run-example --model-name xFinder-qwen1505 --inference-mode local --model-path-or-url /path/to/anonymized/model/xFinder-qwen1505

xFinder は 2 つの形式の評価をサポートします

要約された実験結果のバッチ評価

このメソッドを使用すると、JSON ファイルに保存されている複数の例を評価できます。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "api" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "http://your-anonymized-url/generate" ,  # Anonymized model path or URL
)
# Perform batch evaluation
data_path = "/path/to/your/data/example.json"  # User needs to provide their own data path
accuracy = evaluator . evaluate ( data_path )

print ( f"Batch evaluation accuracy: { accuracy } " )

?単一インスタンス評価モード

この方法を使用すると、個々の例を評価でき、LLM 評価フレームワークに統合できます。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "local" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "IAAR-Shanghai/xFinder-qwen1505" ,  # Anonymized model path or URL
)
# Define input for a single evaluation
question = "What is the capital of France?"
llm_output = "The capital of France is Paris."
standard_answer_range = "[ " Paris " , " Lyon " , " Marseille " ]"
key_answer_type = "short_text"
correct_answer = "Paris"
# Perform single example evaluation
result = evaluator . evaluate_single_example (
    question ,
    llm_output ,
    standard_answer_range ,
    key_answer_type ,
    correct_answer
)

ヒント

より詳細な例については、 demo.ipynbを参照してください。
Hugging Face に接続できない場合は、 export HF_ENDPOINT=https://hf-mirror.comを実行して中国ミラーを使用します。
xFinder は現在、vllm によってデプロイされた API メソッドによる読み込みをサポートしています。
xfinder_training で xFinder を微調整するためのスクリプトを提供します。

例: RegEx と xFinder

RegEx では抽出に失敗するか不正確な回答が頻繁に抽出されるのに対し、xFinder では重要な回答が正確に抽出される 4 種類の質問の例を示します。

{
    "key_answer_type" : " alphabet option " ,
    "question" : " A man is seen playing guitar on a stage with others playing instruments behind him. The man grabs a guitar from the audience and begins playing both one after the other ... " ,
    "llm_output" : " Option A is the correct choice as it describes ... " ,
    "standard_answer_range" : " [['A', 'strums the guitar in the end, continues playing the guitar with the crowd following him as well as lining up next to him.'], ['B', 'continues playing the instruments and ends by waving to the crowd and walking off stage.'], ['C', 'then turns to the audience and gives a stuffed toy to the audience and continues playing.'], ['D', 'finally stops playing and moves his hands for the crowd to see.']] " ,
    "gold_label" : " A " ,
    "xFinder_output" : " A " ,
},
{
    "key_answer_type" : " short text " ,
    "question" : " If you really wanted a grape, where would you go to get it? Answer Choices: winery / fruit stand / field / kitchen / food " ,
    "llm_output" : " The answer is winery / fruit stand / field / kitchen / food ... " ,
    "standard_answer_range" : " [ " winery " , " fruit stand " , " field " , " kitchen " , " food " ] " ,
    "gold_label" : " [No valid answer] " ,
    "xFinder_output" : " [No valid answer] " ,
},
{
    "key_answer_type" : " categorical label " ,
    "question" : " How tall is the Sears Building ? " ,
    "llm_output" : " The Sears Building is a specific structure, so the answer would be a Location ... " ,
    "standard_answer_range" : " ['Abbreviation', 'Entity', 'Description', 'Person', 'Location', 'Number'] " ,
    "gold_label" : " Location " ,
    "xFinder_output" : " Location " ,
},
{
    "key_answer_type" : " math " ,
    "question" : " Mike made 69 dollars mowing lawns over the summer. If he spent 24 dollars buying new mower blades, how many 5 dollar games could he buy with the money he had left? " ,
    "llm_output" : " To find out how many 5 dollar ... Let's calculate that: nn $45 / $5 = 9 nn So, Mike could buy 9 5 dollar games with the money he had left. " ,
    "standard_answer_range" : " a(n) number / set / vector / matrix / interval / expression / function / equation / inequality " ,
    "gold_label" : " 9 " ,
    "xFinder_output" : " 9 " ,
}

抽出精度の結果

ベースライン: OpenCompass、LM Eval ハーネス、UltraEval、GPT-4。私たちのメソッド: xFinder-qwen1505、xFinder-qwen1518、xFinder-gemma7、xFinder-chatglm36base、xFinder-llama38、xFinder-llama38it。

KAF テストセットと汎化セットの両方から主要な回答を抽出する精度を評価しました。表内の指標は精度です。

引用

 @article{xFinder,
      title={xFinder: Robust and Pinpoint Answer Extraction for Large Language Models}, 
      author={Qingchen Yu and Zifan Zheng and Shichao Song and Zhiyu Li and Feiyu Xiong and Bo Tang and Ding Chen},
      journal={arXiv preprint arXiv:2405.11874},
      year={2024},
}