xFinder下載 - xFinder原始碼下載

xFinder

Ai源碼

v0.2.3 Released!

下載

xFinder：針對大型語言模型的穩健且精確的答案提取

於慶辰^1,* 、鄭子凡^1,* 、宋世超^2,* 、李志宇^1,† 、熊飛宇¹ 、唐博¹ 、丁晨¹

¹先進演算法研究院，上海， ²中國人民大學

如有業務諮詢，請聯絡我們：[email protected]。

誰該關注我們的工作？

如果您正在開發 Benchmark ，您可以使用我們的 xFinder 代替傳統的 RegEx 方法，從 LLM 回應中提取關鍵答案。這將幫助您提高評估結果的準確性，從而對模型效能進行更可靠、更有意義的比較和驗證。
如果您是評估框架的設計者，您可以將我們的xFinder整合到您框架的答案提取元件中，以增強評估過程的穩健性和可靠性。

重要的

？為我們加星！透過在 GitHub 上為我們的專案加註星標，您將立即收到所有發布通知。感謝您的支持！

？訊息

[2024/10]我們開源了 KAF-Dataset，並以 PyPI 套件的形式發布了 xFinder。
[2024/09] xFinder已成功整合到OpenCompass！
[2024/08]我們更新了xFinder：這個模型現在支援處理英文和中文。
[2024/05]我們發布了 xFinder：針對大型語言模型的穩健且精確的答案提取。查看論文。

概述

抽象的

大型語言模型（LLM）的不斷進步使人們越來越關注開發公平可靠的方法來評估其性能的關鍵問題。特別是測試集洩漏、提示格式過度擬合等主觀或非主觀作弊現象的出現，給法學碩士的可靠評估帶來了重大挑戰。由於評估框架通常利用正規表示式 (RegEx) 進行答案提取，因此某些模型可能會調整其回應以符合 RegEx 可以輕鬆提取的特定格式。然而，基於正規表示式的關鍵答案擷取模組經常出現擷取錯誤。本文對整個LLM評估鏈進行了全面分析，證明優化關鍵答案提取模組可以提高提取準確性，減少LLM對特定答案格式的依賴，增強LLM評估的可靠性。為了解決這些問題，我們提出了 xFinder，一個專門為關鍵答案擷取而設計的模型。作為此過程的一部分，我們創建了一個專門的資料集，即關鍵答案查找器 (KAF) 資料集，以確保有效的模型訓練和評估。透過真實場景的泛化測試和評估，結果表明，只有 5 億個參數的最小 xFinder 模型的平均答案提取準確率達到 93.42%。相比之下，最佳評估框架中的 RegEx 準確率為 74.38%。與現有評估框架相比，xFinder 表現出更強的穩健性和更高的準確性。

我們的主要貢獻總結如下：

我們對業界的法學碩士評估流程進行全面審查，找出可能導致評估結果不可靠的關鍵因素。
我們引入 xFinder，一個專門為關鍵答案提取而設計的模型。 KAF資料集支援其有效的訓練和評估。
在我們廣泛的實驗中，我們證明基於 RegEx 的評估方法是不可靠的，而我們的 xFinder 模型顯著提高了可靠性。

如圖所示，展示了 LM Eval Harness 和 OpenCompass 等評估框架未能提取關鍵答案的實例。具體來說，A/T/C/M 分別表示帶有字母/短文字/分類標籤/數學選項的任務。

快速入門

建立基準資料集：為了簡化使用xFinder的評估流程，我們將各種主流基準資料集標準化為統一的JSON格式。有關實作細節，請參閱create_benchmark_dataset.py。如果您希望使用 xFinder 評估您自己的資料集，請參閱我們提供的腳本範本 benchmark_dataset_template.py 以取得格式轉換指南。
準備 QA 對和 LLM 輸出：收集您想要評估的 LLM 輸出。確保您的資料包含以下元素：
- 原問題
- 關鍵答案類型（選項：字母、短文字、分類標籤、數學）
- 法學碩士輸出
- 標準答案範圍
部署 xFinder 模型：選擇下列模型之一進行部署：
- xFinder-qwen1505
- xFinder-llama38it

部署 xFinder 模型後，請依照下列步驟執行評估：

 # Install xfinder
conda create -n xfinder_env python=3.10 -y
conda activate xfinder_env
pip install xfinder

# Perform an evaluation with xFinder (a built-in example)
CUDA_VISIBLE_DEVICES=0 python -m xfinder.eval --run-example --model-name xFinder-qwen1505 --inference-mode local --model-path-or-url /path/to/anonymized/model/xFinder-qwen1505

xFinder 支援兩種評估形式

實驗結果總結的批量評估

此方法可讓您評估儲存在 JSON 檔案中的多個範例。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "api" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "http://your-anonymized-url/generate" ,  # Anonymized model path or URL
)
# Perform batch evaluation
data_path = "/path/to/your/data/example.json"  # User needs to provide their own data path
accuracy = evaluator . evaluate ( data_path )

print ( f"Batch evaluation accuracy: { accuracy } " )

？單一實例評估模式

此方法可讓您評估單一範例，這些範例可以整合到法學碩士評估框架中。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "local" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "IAAR-Shanghai/xFinder-qwen1505" ,  # Anonymized model path or URL
)
# Define input for a single evaluation
question = "What is the capital of France?"
llm_output = "The capital of France is Paris."
standard_answer_range = "[ " Paris " , " Lyon " , " Marseille " ]"
key_answer_type = "short_text"
correct_answer = "Paris"
# Perform single example evaluation
result = evaluator . evaluate_single_example (
    question ,
    llm_output ,
    standard_answer_range ,
    key_answer_type ,
    correct_answer
)

提示

有關更詳細的範例，請參閱demo.ipynb 。
如果無法連接Hugging Face，請執行export HF_ENDPOINT=https://hf-mirror.com以使用中文鏡像。
xFinder目前支援透過vllm部署的API方法載入。
我們在 xfinder_training 中提供了微調 xFinder 的腳本。

範例：RegEx 與 xFinder

我們示範了四種類型問題的實例，其中 RegEx 無法提取或經常提取錯誤答案，而 xFinder 可以準確地提取關鍵答案。

{
    "key_answer_type" : " alphabet option " ,
    "question" : " A man is seen playing guitar on a stage with others playing instruments behind him. The man grabs a guitar from the audience and begins playing both one after the other ... " ,
    "llm_output" : " Option A is the correct choice as it describes ... " ,
    "standard_answer_range" : " [['A', 'strums the guitar in the end, continues playing the guitar with the crowd following him as well as lining up next to him.'], ['B', 'continues playing the instruments and ends by waving to the crowd and walking off stage.'], ['C', 'then turns to the audience and gives a stuffed toy to the audience and continues playing.'], ['D', 'finally stops playing and moves his hands for the crowd to see.']] " ,
    "gold_label" : " A " ,
    "xFinder_output" : " A " ,
},
{
    "key_answer_type" : " short text " ,
    "question" : " If you really wanted a grape, where would you go to get it? Answer Choices: winery / fruit stand / field / kitchen / food " ,
    "llm_output" : " The answer is winery / fruit stand / field / kitchen / food ... " ,
    "standard_answer_range" : " [ " winery " , " fruit stand " , " field " , " kitchen " , " food " ] " ,
    "gold_label" : " [No valid answer] " ,
    "xFinder_output" : " [No valid answer] " ,
},
{
    "key_answer_type" : " categorical label " ,
    "question" : " How tall is the Sears Building ? " ,
    "llm_output" : " The Sears Building is a specific structure, so the answer would be a Location ... " ,
    "standard_answer_range" : " ['Abbreviation', 'Entity', 'Description', 'Person', 'Location', 'Number'] " ,
    "gold_label" : " Location " ,
    "xFinder_output" : " Location " ,
},
{
    "key_answer_type" : " math " ,
    "question" : " Mike made 69 dollars mowing lawns over the summer. If he spent 24 dollars buying new mower blades, how many 5 dollar games could he buy with the money he had left? " ,
    "llm_output" : " To find out how many 5 dollar ... Let's calculate that: nn $45 / $5 = 9 nn So, Mike could buy 9 5 dollar games with the money he had left. " ,
    "standard_answer_range" : " a(n) number / set / vector / matrix / interval / expression / function / equation / inequality " ,
    "gold_label" : " 9 " ,
    "xFinder_output" : " 9 " ,
}

擷取精度結果

基線：OpenCompass、LM Eval Harness、UltraEval、GPT-4。我們的方法：xFinder-qwen1505、xFinder-qwen1518、xFinder-gemma7、xFinder-chatglm36base、xFinder-llama38、xFinder-llama38it。

我們評估了它們從 KAF 測試集和泛化集提取關鍵答案的準確性。表中的指標是準確性。

引文

 @article{xFinder,
      title={xFinder: Robust and Pinpoint Answer Extraction for Large Language Models}, 
      author={Qingchen Yu and Zifan Zheng and Shichao Song and Zhiyu Li and Feiyu Xiong and Bo Tang and Ding Chen},
      journal={arXiv preprint arXiv:2405.11874},
      year={2024},
}