xFinder下载 - xFinder源码下载

xFinder

Ai源码

v0.2.3 Released!

下载

xFinder：针对大型语言模型的稳健且精确的答案提取

于庆辰^1,* 、郑子凡^1,* 、宋世超^2,* 、李志宇^1,† 、熊飞宇¹ 、唐博¹ 、丁晨¹

¹先进算法研究院，上海， ²中国人民大学

如有业务咨询，请联系我们：[email protected]。

谁应该关注我们的工作？

如果您正在开发 Benchmark ，您可以使用我们的 xFinder 代替传统的 RegEx 方法，从 LLM 响应中提取关键答案。这将帮助您提高评估结果的准确性，从而对模型性能进行更可靠、更有意义的比较和验证。
如果您是评估框架的设计者，您可以将我们的xFinder集成到您框架的答案提取组件中，以增强评估过程的稳健性和可靠性。

重要的

？为我们加星！通过在 GitHub 上为我们的项目加注星标，您将立即收到所有发布通知。感谢您的支持！

？消息

[2024/10]我们开源了 KAF-Dataset，并以 PyPI 包的形式发布了 xFinder。
[2024/09] xFinder已成功集成到OpenCompass！
[2024/08]我们更新了xFinder：该模型现在支持处理英文和中文。
[2024/05]我们发布了 xFinder：针对大型语言模型的稳健且精确的答案提取。查看论文。

概述

抽象的

大型语言模型（LLM）的不断进步使人们越来越关注开发公平可靠的方法来评估其性能的关键问题。特别是测试集泄漏、提示格式过拟合等主观或非主观作弊现象的出现，给法学硕士的可靠评估带来了重大挑战。由于评估框架通常利用正则表达式 (RegEx) 进行答案提取，因此某些模型可能会调整其响应以符合 RegEx 可以轻松提取的特定格式。然而，基于正则表达式的关键答案提取模块经常出现提取错误。本文对整个LLM评估链进行了全面分析，证明优化关键答案提取模块可以提高提取准确性，减少LLM对特定答案格式的依赖，增强LLM评估的可靠性。为了解决这些问题，我们提出了 xFinder，一个专门为关键答案提取而设计的模型。作为此过程的一部分，我们创建了一个专门的数据集，即关键答案查找器 (KAF) 数据集，以确保有效的模型训练和评估。通过真实场景的泛化测试和评估，结果表明，只有 5 亿个参数的最小 xFinder 模型的平均答案提取准确率达到 93.42%。相比之下，最佳评估框架中的 RegEx 准确率为 74.38%。与现有评估框架相比，xFinder 表现出更强的鲁棒性和更高的准确性。

我们的主要贡献总结如下：

我们对行业内的法学硕士评估流程进行全面审查，找出可能导致评估结果不可靠的关键因素。
我们引入 xFinder，一个专门为关键答案提取而设计的模型。 KAF数据集支持其有效的训练和评估。
在我们广泛的实验中，我们证明基于 RegEx 的评估方法是不可靠的，而我们的 xFinder 模型显着提高了可靠性。

如图所示，展示了 LM Eval Harness 和 OpenCompass 等评估框架未能提取关键答案的实例。具体来说，A/T/C/M 分别表示带有字母/短文本/分类标签/数学选项的任务。

快速入门

创建基准数据集：为了简化使用xFinder的评估过程，我们将各种主流基准数据集标准化为统一的JSON格式。有关实现细节，请参阅create_benchmark_dataset.py。如果您希望使用 xFinder 评估您自己的数据集，请参阅我们提供的脚本模板 benchmark_dataset_template.py 以获取格式转换指南。
准备 QA 对和 LLM 输出：收集您想要评估的 LLM 输出。确保您的数据包含以下元素：
- 原问题
- 关键答案类型（选项：字母表、短文本、分类标签、数学）
- 法学硕士输出
- 标准答案范围
部署 xFinder 模型：选择以下模型之一进行部署：
- xFinder-qwen1505
- xFinder-llama38it

部署 xFinder 模型后，请按照以下步骤运行评估：

 # Install xfinder
conda create -n xfinder_env python=3.10 -y
conda activate xfinder_env
pip install xfinder

# Perform an evaluation with xFinder (a built-in example)
CUDA_VISIBLE_DEVICES=0 python -m xfinder.eval --run-example --model-name xFinder-qwen1505 --inference-mode local --model-path-or-url /path/to/anonymized/model/xFinder-qwen1505

xFinder 支持两种评估形式

实验结果汇总的批量评估

此方法允许您评估存储在 JSON 文件中的多个示例。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "api" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "http://your-anonymized-url/generate" ,  # Anonymized model path or URL
)
# Perform batch evaluation
data_path = "/path/to/your/data/example.json"  # User needs to provide their own data path
accuracy = evaluator . evaluate ( data_path )

print ( f"Batch evaluation accuracy: { accuracy } " )

？单实例评估模式

此方法允许您评估单个示例，这些示例可以集成到法学硕士评估框架中。

 # Initialize Evaluator object
evaluator = Evaluator (
    model_name = "xFinder-qwen1505" ,   # Model name
    inference_mode = "local" ,            # Inference mode, 'local' or 'api'
    model_path_or_url = "IAAR-Shanghai/xFinder-qwen1505" ,  # Anonymized model path or URL
)
# Define input for a single evaluation
question = "What is the capital of France?"
llm_output = "The capital of France is Paris."
standard_answer_range = "[ " Paris " , " Lyon " , " Marseille " ]"
key_answer_type = "short_text"
correct_answer = "Paris"
# Perform single example evaluation
result = evaluator . evaluate_single_example (
    question ,
    llm_output ,
    standard_answer_range ,
    key_answer_type ,
    correct_answer
)

提示

有关更详细的示例，请参阅demo.ipynb 。
如果无法连接Hugging Face，请运行export HF_ENDPOINT=https://hf-mirror.com以使用中文镜像。
xFinder目前支持通过vllm部署的API方法加载。
我们在 xfinder_training 中提供了用于微调 xFinder 的脚本。

示例：RegEx 与 xFinder

我们演示了四种类型问题的实例，其中 RegEx 无法提取或经常提取错误答案，而 xFinder 可以准确提取关键答案。

{
    "key_answer_type" : " alphabet option " ,
    "question" : " A man is seen playing guitar on a stage with others playing instruments behind him. The man grabs a guitar from the audience and begins playing both one after the other ... " ,
    "llm_output" : " Option A is the correct choice as it describes ... " ,
    "standard_answer_range" : " [['A', 'strums the guitar in the end, continues playing the guitar with the crowd following him as well as lining up next to him.'], ['B', 'continues playing the instruments and ends by waving to the crowd and walking off stage.'], ['C', 'then turns to the audience and gives a stuffed toy to the audience and continues playing.'], ['D', 'finally stops playing and moves his hands for the crowd to see.']] " ,
    "gold_label" : " A " ,
    "xFinder_output" : " A " ,
},
{
    "key_answer_type" : " short text " ,
    "question" : " If you really wanted a grape, where would you go to get it? Answer Choices: winery / fruit stand / field / kitchen / food " ,
    "llm_output" : " The answer is winery / fruit stand / field / kitchen / food ... " ,
    "standard_answer_range" : " [ " winery " , " fruit stand " , " field " , " kitchen " , " food " ] " ,
    "gold_label" : " [No valid answer] " ,
    "xFinder_output" : " [No valid answer] " ,
},
{
    "key_answer_type" : " categorical label " ,
    "question" : " How tall is the Sears Building ? " ,
    "llm_output" : " The Sears Building is a specific structure, so the answer would be a Location ... " ,
    "standard_answer_range" : " ['Abbreviation', 'Entity', 'Description', 'Person', 'Location', 'Number'] " ,
    "gold_label" : " Location " ,
    "xFinder_output" : " Location " ,
},
{
    "key_answer_type" : " math " ,
    "question" : " Mike made 69 dollars mowing lawns over the summer. If he spent 24 dollars buying new mower blades, how many 5 dollar games could he buy with the money he had left? " ,
    "llm_output" : " To find out how many 5 dollar ... Let's calculate that: nn $45 / $5 = 9 nn So, Mike could buy 9 5 dollar games with the money he had left. " ,
    "standard_answer_range" : " a(n) number / set / vector / matrix / interval / expression / function / equation / inequality " ,
    "gold_label" : " 9 " ,
    "xFinder_output" : " 9 " ,
}

提取精度结果

基线：OpenCompass、LM Eval Harness、UltraEval、GPT-4。我们的方法：xFinder-qwen1505、xFinder-qwen1518、xFinder-gemma7、xFinder-chatglm36base、xFinder-llama38、xFinder-llama38it。

我们评估了它们从 KAF 测试集和泛化集提取关键答案的准确性。表中的指标是准确性。

引文

 @article{xFinder,
      title={xFinder: Robust and Pinpoint Answer Extraction for Large Language Models}, 
      author={Qingchen Yu and Zifan Zheng and Shichao Song and Zhiyu Li and Feiyu Xiong and Bo Tang and Ding Chen},
      journal={arXiv preprint arXiv:2405.11874},
      year={2024},
}