DiSQ Score下载 - DiSQ Score源代码下载

话语苏格拉底式提问 (DiSQ) 的正式实施

我们论文的正式实施：话语苏格拉底式提问：评估语言模型对话语关系理解的忠实性 (2024) Yisong Miao , Hongfu Liu, Wenqiang Lei, Nancy F. Chen, Min-Yen Kan. ACL 2024。
论文PDF：https://yisong.me/publications/acl24-DiSQ-CR.pdf
幻灯片：https://yisong.me/publications/acl24-DiSQ-Slides.pdf
海报：https://yisong.me/publications/acl24-DiSQ-Poster.pdf

安装？

 git clone [email protected]:YisongMiao/DiSQ-Score.git
conda activate
cd DiSQ-Score
cd scripts
pip install -r requirements.txt

用一行命令评估一个模型???

您想知道任何语言模型的DiSQ Score吗？欢迎您使用这一行命令！

苏格拉底式摘要

我们提供了一个简化的命令来评估 HuggingFace 模型中心中托管的任何语言模型 (LM)。建议您将其用于任何新模型（尤其是我们论文中未研究的模型）。

 bash scripts/one_model.sh <modelurl>

< modelurl > 变量指定 Huggingface Hub 中的缩短路径，例如，

 bash scripts/one_model.sh meta-llama/Meta-Llama-3-8B

指定你的路径??️??️

在运行 bash 文件之前，请编辑 bash 文件以指定本地 HuggingFace 缓存的路径。
例如，在scripts/one_model.sh中：

 #!/bin/bash

# Please define your own path here
huggingface_path=YOUR_PATH

您可以将YOUR_PATH更改为 Huggingface 缓存的绝对目录位置（例如/disk1/yisong/hf-cache ）。
我们建议至少 200GB 可用空间。

输出文本文件将保存在data/results/verbalizations/Meta-Llama-3-8B.txt ，其中包含：

DiSQ Score: 0.206 Targeted Score: 0.345 Counterfactual Score: 0.722 Consistency: 0.827 DiSQ Score for Comparison.Concession: 0.188 DiSQ Score for Comparison.Contrast: 0.22 DiSQ Score for Contingency.Reason: 0.164 DiSQ Score for Contingency.Result: 0.177 DiSQ Score for Expansion.Conjunction: 0.261 DiSQ Score for Expansion.Equivalence: 0.221 DiSQ Score for Expansion.Instantiation: 0.191 DiSQ Score for Expansion.Level-of-detail: 0.195 DiSQ Score for Expansion.Substitution: 0.151 DiSQ Score for Temporal.Asynchronous: 0.312 DiSQ Score for Temporal.Synchronous: 0.084 === End of the results for model: Meta-Llama-3-8B === === The results for model: Meta-Llama-3-8B === Dataset: ted DiSQ Score : 0.233 Targeted Score: 0.605 Counterfactual Score: 0.489 Consistency: 0.787 DiSQ Score for Comparison.Concession: 0.237 DiSQ Score for Comparison.Contrast: 0.268 DiSQ Score for Contingency.Reason: 0.136 DiSQ Score for Contingency.Result: 0.211 DiSQ Score for Expansion.Conjunction: 0.268 DiSQ Score for Expansion.Equivalence: 0.205 DiSQ Score for Expansion.Instantiation: 0.194 DiSQ Score for Expansion.Level-of-detail: 0.222 DiSQ Score for Expansion.Substitution: 0.176 DiSQ Score for Temporal.Asynchronous: 0.156 DiSQ Score for Temporal.Synchronous: 0.164 === End of the results for model: Meta-Llama-3-8B ===">

 === The results for model: Meta-Llama-3-8B ===
Dataset: pdtb
DiSQ Score : 0.206
Targeted Score: 0.345
Counterfactual Score: 0.722
Consistency: 0.827
DiSQ Score for Comparison.Concession: 0.188
DiSQ Score for Comparison.Contrast: 0.22
DiSQ Score for Contingency.Reason: 0.164
DiSQ Score for Contingency.Result: 0.177
DiSQ Score for Expansion.Conjunction: 0.261
DiSQ Score for Expansion.Equivalence: 0.221
DiSQ Score for Expansion.Instantiation: 0.191
DiSQ Score for Expansion.Level-of-detail: 0.195
DiSQ Score for Expansion.Substitution: 0.151
DiSQ Score for Temporal.Asynchronous: 0.312
DiSQ Score for Temporal.Synchronous: 0.084
=== End of the results for model: Meta-Llama-3-8B ===
=== The results for model: Meta-Llama-3-8B ===
Dataset: ted
DiSQ Score : 0.233
Targeted Score: 0.605
Counterfactual Score: 0.489
Consistency: 0.787
DiSQ Score for Comparison.Concession: 0.237
DiSQ Score for Comparison.Contrast: 0.268
DiSQ Score for Contingency.Reason: 0.136
DiSQ Score for Contingency.Result: 0.211
DiSQ Score for Expansion.Conjunction: 0.268
DiSQ Score for Expansion.Equivalence: 0.205
DiSQ Score for Expansion.Instantiation: 0.194
DiSQ Score for Expansion.Level-of-detail: 0.222
DiSQ Score for Expansion.Substitution: 0.176
DiSQ Score for Temporal.Asynchronous: 0.156
DiSQ Score for Temporal.Synchronous: 0.164
=== End of the results for model: Meta-Llama-3-8B ===

一步一步走过来？？？

初步：数据集？？？

我们将数据集存储在位于data/datasets/dataset_pdtb.json和data/datasets/dataset_ted.json的 JSON 文件中。例如，让我们从 PDTB 数据集中获取一个实例：

 "2": {
        "Didx": 2,
        "arg1": "and special consultants are springing up to exploit the new tool",
        "arg2": "Blair Entertainment, has just formed a subsidiary -- 900 Blair -- to apply the technology to television",
        "DR": "Expansion.Instantiation.Arg2-as-instance",
        "Conn": "for instance",
        "events": [
            [
                "special consultants springing",
                "Blair Entertainment formed a subsidiary -- 900 Blair -- to apply the technology to television"
            ],
            [
                "special consultants exploit the new tool",
                "Blair Entertainment formed a subsidiary -- 900 Blair -- to apply the technology to television"
            ]
        ],
        "context": "Other long-distance carriers have also begun marketing enhanced 900 service, and special consultants are springing up to exploit the new tool. Blair Entertainment, a New York firm that advises TV stations and sells ads for them, has just formed a subsidiary -- 900 Blair -- to apply the technology to television.  "
    },

以下是该字典条目中的字段：

Didx ：会话 ID。
arg1和arg2 ：两个参数。
DR ：话语关系。
Conn ：话语连接词。
events ：对的列表，存储预测为显着信号的事件对。
context ：话语上下文。

步骤 1 问题生成??‍?

 cd DiSQ-Score
bash scripts/question_generation.sh

这个bash文件将调用question_generation.py来生成不同配置下的问题。

question_generation.py的参数如下：

--dataset ：指定数据集，可以是pdtb或ted 。
--modelname ：模型的别名已创建。 13b指 LLaMA2-13B， 13bchat指 LLaMA2-13B-Chat， vicuna-13b指 Vicuna-13B。这些模型的具体 URL 可以在disq_config.py中找到。
--version ：指定要使用的提示模板版本，选项为v1 、 v2 、 v3和v4 。
--paraphrase ：用其释义版本替换标准问题，并带有选项p1和p2 。与调用qa_utils.py的标准函数不同，释义函数分别调用qa_utils_p1.py和qa_utils_p2.py 。
--feature ：指定讨论问题使用哪些语言功能。语言特征包括conn （话语连接词）和context （话语上下文）。历史 QA 数据需要单独的脚本。

例如，输出将存储在配置dataset==pdtb和version==v1下的data/questions/dataset_pdtb_prompt_v1.json中。

我们要求用户自己生成问题，因为这种方法是自动的，有助于节省 GitHub 存储库的空间（最多可达约 200 MB）。如果您无法运行 bash 文件，请联系我们获取问题文件。

步骤 2 问答 ?

 cd DiSQ-Score
bash scripts/question_answering.sh

此 bash 文件将调用question_answering.py对任何给定模型执行话语苏格拉底提问 (DiSQ)。 question_answering.py获取question_generation.py中的所有参数，以及以下新参数：

--modelurl ：指定当前不在配置文件中的任何新模型的 URL。例如，“meta-llama/Meta-Llama-3-8B”指定 LLaMA3-8B 模型并将覆盖modelname参数。
--hf-path ：指定存储大模型参数的路径。建议至少有 200 GB 的可用磁盘空间。
--device_number ：指定要使用的 GPU 的 ID。

输出将存储在例如data/results/13bchat_dataset_pdtb_prompt_v1/处。每个问题的预测是一个标记及其概率的列表，存储在文件夹内的 pickle 文件中。

警告：向导模型已被开发人员删除。我们建议用户不要尝试这些型号。检查讨论线程：https://huggingface.co/posts/WizardLM/329547800484476。

步骤 3 评估和评分☑️？

 cd DiSQ-Score
bash scripts/eval.sh

该 bash 文件将调用eval.py来评估先前获得的模型预测。

eval.py采用与question_answering.py相同的参数集。

如果指定的数据集是 PDTB，则评估结果将存储在disq_score_pdtb.csv中。

CSV 文件中有 20 列，即：

taskcode ：指示正在测试的配置，例如dataset_pdtb_prompt_v1_13bchat 。
modelname ：指定正在测试的语言模型。
version ：指示提示的版本。
paraphrase ：释义参数。
feature ：指定已使用哪个功能。
Overall ：总体DiSQ Score 。
Targeted ：目标分数， DiSQ Score的三个组成部分之一。
Counterfactual ：反事实分数， DiSQ Score的三个组成部分之一。
Consistency ：一致性得分， DiSQ Score的三个组成部分之一。
Comparison.Concession ：此特定话语关系的DiSQ Score 。
...（其他话语关系）

请注意，我们选择版本 v1 到 v4 中的最佳结果，以边缘化提示模板的影响。

为此， eval.py自动提取最佳结果：

任务代码	型号名称	版本	全面的	有针对性	反事实	一致性	比较.让步	比较.对比	偶然性.原因	意外事件.结果	扩展.连接	展开式等价	扩展.实例化	扩展.详细程度	扩展.替代	时间异步	时间同步
dataset_pdtb_prompt_v4_7b	7b	v4	0.074	0.956	0.084	0.929	0.03	0.083	0.095	0.095	0.077	0.054	0.086	0.068	0.155	0.036	0.047
dataset_pdtb_prompt_v1_7bchat	7b聊天	v1	0.174	0.794	0.271	0.811	0.231	0.435	0.132	0.173	0.214	0.105	0.121	0.15	0.199	0.107	0.04
dataset_pdtb_prompt_v2_13b	13b	v2	0.097	0.945	0.112	0.912	0.037	0.099	0.081	0.094	0.126	0.101	0.113	0.107	0.077	0.083	0.093
dataset_pdtb_prompt_v1_13bchat	13b聊天	v1	0.253	0.592	0.545	0.785	0.195	0.485	0.129	0.173	0.289	0.155	0.326	0.373	0.285	0.194	0.028
dataset_pdtb_prompt_v2_vicuna-13b	vicuna-13b	v2	0.325	0.512	0.766	0.829	0.087	0.515	0.201	0.352	0.369	0.0	0.334	0.46	0.199	0.511	0.074

例如，此表显示了可用开源模型的 PDTB 数据集的最佳结果，它再现了我们论文中的雷达图：

模型在 DiSQ 上的整体表现，以雷达数据显示

讨论实验??

我们还提供了评估有关语言特征的讨论问题的说明：

要评估话语连接词和话语上下文，请在question_generation.py中指定--feature作为conn和context （步骤 1）并重新运行所有实验。
要评估历史 QA 数据，请运行question_generation_history.py 。该脚本将从存储的 QA 结果中提取答案并生成新问题。

环境？

遗留环境??️??️

对于大多数 NLPers 来说，您可能能够在现有的虚拟 (conda) 环境中运行我们的代码。

当我们进行实验时，软件包版本如下：

 torch==2.0.1
transformers==4.30.0
sentencepiece
protobuf
scikit-learn
pandas

现代环境??️??️

但是，我们观察到较新的型号需要升级的软件包版本：

 torch==2.4.0
transformers==4.43.3
sentencepiece
protobuf
scikit-learn
pandas

引文

如果您发现我们的工作有趣，非常欢迎您尝试我们的数据集/代码库。
如果您使用过我们的数据集/代码库，请引用我们的研究：

 @inproceedings{acl24discursive,
  title={Discursive Socratic Questioning: Evaluating the Faithfulness of Language Models' Understanding of Discourse Relations},
  author={Yisong Miao , Hongfu Liu, Wenqiang Lei, Nancy F. Chen, and Min-Yen Kan},
  booktitle={Proceedings of the Annual Meeting fof the Association of Computational Linguistics},
  month={August},
  year={2024},
  organization={ACL},
  address = "Bangkok, Thailand",
}

接触？？

如果您有疑问或错误报告，请提出问题或直接通过电子邮件联系我们：
电子邮件地址：?@?
其中 ?️= yisong , ?= comp.nus.edu.sg

执照？？

CC 4.0

展开

DiSQ Score

话语苏格拉底式提问 (DiSQ) 的正式实施

安装？

用一行命令评估一个模型???

指定你的路径??️??️

一步一步走过来？？？

初步：数据集？？？

步骤 1 问题生成??‍?

步骤 2 问答 ?

步骤 3 评估和评分☑️？

讨论实验??

环境？

遗留环境??️??️

现代环境??️??️

引文

接触？？

执照？？

GitHub sgrebnov/cordova plugin background download

Wa ch navra maza navsacha 2 2024 ull ovie Online For Fr e Strea ings At Home

Wa ch the greatest of all time 2024 ull ovie Online For Fr e Strea ings At Home

wolfs 2024 f llmo ie f lmyz lla dow load ree 7 0p 4 0p a d 10 0p

Score Hero apk游戏手机版

Game Score.软件

chat.petals.dev

GPT Prompt Templates

GPTyped

node telegram bot api

typebot.io

python wechaty getting started

waymo open dataset

termwind

wp functions

DiSQ Score

话语苏格拉底式提问 (DiSQ) 的正式实施

安装 ？

用一行命令评估一个模型???

指定你的路径??️??️

一步一步走过来？？？

初步：数据集？？？

步骤 1 问题生成??‍?

步骤 2 问答 ?

步骤 3 评估和评分☑️？

讨论实验??

环境 ？

遗留环境??️??️

现代环境??️??️

引文

接触 ？？

执照 ？？

安装？

环境？

接触？？

执照？？