ice score下载 - ice score源代码下载

ice score

其他源码

1.0.0

下载

ICE-Score：指导大型语言模型评估代码

2024 年 1 月- ICE-Score 已被 EACL 2024 接受？？？

例子
环境设置
文件夹说明
用法
引文
致谢

例子

环境设置

我们的实验主要建立在 codegen-metrics 和 code-bert-score 存储库上。要复制所有实验，请按照他们的说明设置环境。

要运行compute_results.ipynb和llm-code-eval文件夹中的模块，请使用以下命令安装所有依赖项：

pip install -r requirements.txt

文件夹说明

data/包含论文中使用的所有已处理数据。
- data/conala/包含带有所有自动评估结果的 CoNaLa 数据集。
- data/humaneval/包含包含所有自动评估结果的 HumanEval 数据集。
  - data/humaneval/humaneval_java_grade.json ：Java 分割
  - data/humaneval/humaneval_cpp_grade.json : C++ 分割
  - data/humaneval/humaneval_python_grade.json ：Python 拆分
  - data/humaneval/humaneval_js_grade.json ：JavaScript 拆分
experiment_source/包含收集所有自动评估结果的脚本。它们需要进行特定修改才能在您的计算机上运行。请注意，对于使用metrics_evaluation.metrics的任何脚本，您需要使用codegen-metrics中的metrics_evaluation文件夹中的实现。
llm_code_eval包含该项目的最小可行产品 (MVP) 的实现。您可以使用它来评估任何生成的代码片段。请参阅Use Large Language Models To Downstream Tasks Of Source Code了解更多详细信息。

用法

我们为此项目实施了最小可行产品（MVP）。要安装该项目，请使用以下命令：

pip install -e .

您可以使用它来评估任何生成的代码片段，输入为problem 、 output 、 task 、 aspect和model ，如下例所示：

 from llm_code_eval import evaluate

score = evaluate ( problem = "Given a list of integers, return the sum of all the integers." , 
                    output = "sum = 0 n for i in range(len(list)): n t sum += list[i] n return sum" , 
                    task = "code-gen" , aspect = "usefulness" , model = "gpt-3.5-turbo" )

print ( score )

如果您想使用参考代码进行评估，可以在以下示例中使用reference选项：

 from llm_code_eval import evaluate

score = evaluate ( problem = "Given a list of integers, return the sum of all the integers." , 
                output = "sum = 0 n for i in range(len(list)): n t sum += list[i] n return sum" , 
                reference = "sum = 0 n for i in range(len(list)): n t sum += list[i] n return sum" , 
                task = "code-gen" , aspect = "usefulness" , model = "gpt-3.5-turbo" )

print ( score )

您还可以在以下示例中使用cot=True选项来启用零样本思想链评估：

 from llm_code_eval import evaluate

score , eval_step = evaluate ( problem = "Given a list of integers, return the sum of all the integers." , 
                            output = "sum = 0 n for i in range(len(list)): n t sum += list[i] n return sum" , 
                            task = "code-gen" , aspect = "usefulness" , model = "gpt-3.5-turbo" , cot = True )

print ( score )
print ( eval_step )

引文

 @inproceedings{zhuo2024ice,
  title={ICE-Score: Instructing Large Language Models to Evaluate Code},
  author={Zhuo, Terry Yue},
  booktitle={Findings of the Association for Computational Linguistics: EACL 2024},
  pages={2232--2242},
  year={2024}
}