PropertyExtractor下载 - PropertyExtractor源码下载

PropertyExtractor

其他源码

v1.0

下载

PropertyExtractor：基于 LLM 的开源会话工具

介绍

自然语言处理和大型语言模型 (LLM) 的出现彻底改变了从非结构化学术论文中提取数据的方式。然而，确保数据的可信度仍然是一个重大挑战。 PropertyExtractor是一款开源工具，它利用Google Gemini Pro和OpenAI GPT-4等高级对话式 LLM，将零样本与少样本上下文学习相结合，并采用工程提示来动态细化结构化信息层次结构，以实现自主学习高效、可扩展、准确地识别、提取和验证材料属性数据，生成材料属性数据库。

特征

高级 LLM 集成：支持 Google Gemini Pro 和 OpenAI GPT-4。
零样本和少样本学习：混合上下文学习以提高提取精度。
工程提示：结构化信息层次结构的动态细化。
自主提取：高效且可扩展的材料属性识别和提取。
高精确率和召回率：精确率和召回率达到 90% 以上，错误率约为 10%。

安装

PropertyExtractor提供适合各种用户偏好的简单安装选项，如下所述。我们注意到，在所有安装选项中，所有库和依赖项都会自动确定并与 PropertyExtractor 可执行文件“propertyextract”一起安装。

使用 pip ：我们推荐的安装PropertyExtractor包的方法是使用 pip。
- 通过执行以下命令，使用 pip 快速安装最新版本的PropertyExtractor包：
```
 pip install -U propertyextract
```
来自源代码：
- 或者，用户可以通过以下方式下载源代码：
```
 git clone [[email protected]:gmp007/PropertyExtractor.git]
```
- 然后，通过导航到主目录并运行以下命令来安装PropertyExtractor ：
```
 pip install .
```
通过 setup.py 安装：
- PropertyExtractor 也可以使用setup.py脚本安装：
```
 python setup.py install [--prefix=/path/to/install/]
```
- 可选的--prefix参数对于在共享高性能计算 (HPC) 系统等环境中的安装非常有用，在这些环境中管理权限可能受到限制。
- 请注意，虽然此方法仍然受支持，但其使用量正在逐渐下降，有利于更现代的安装实践。我们仅在pip等标准安装方法不适用的情况下推荐此安装选项。

用法

配置

请不要暴露您的 API 密钥。在运行PropertyExtractor之前，将 Google Gemini Pro 和 OpenAI GPT-4 的 API 密钥配置为环境变量。

在 Linux/macOS 上

 export GPT4_API_KEY= ' your_gpt4_api_key_here '
export GEMINI_PRO_API_KEY= ' your_gemini_pro_api_key_here '

在 Windows 上

 set GPT4_API_KEY= ' your_gpt4_api_key_here '
set GEMINI_PRO_API_KEY= ' your_gemini_pro_api_key_here '

使用和运行 PropertyExtractor

PropertyExtractor易于运行。初始化PropertyExtractor的关键步骤如下：

非结构化数据生成*：使用API从您选择的发布者处获取您想要生成数据库的材料属性。我们为 Elsevier 的 ScienceDirect API、CrossRef REST API 和 PubMed API 编写了 API 函数。如果需要，我们可以分享其中一些。

创建计算目录：

首先为您的计算创建一个目录。
运行propextract -0生成PropertyExtractor的主输入模板，即extract.in 。按照包含的详细说明进行修改。

还会生成可选文件，例如additionalprompt.txt' for augmenting additional custom prompts and用于自定义附加关键字以支持主关键字的“keywords.json”。修改以适应正在提取的材料属性。主输入模板“extract.in”如下所示：

 ###############################################################################
 ### The input file to control the calculation details of PropertyExtract    ###
 ###############################################################################
 # Type of LLM model: gemini/chatgpt 
 model_type = gemini
 # LLM model name: gemini-pro/gpt-4
 model_name = gemini-pro
 # Property to extract from texts
 property = thickness
 # Harmonized unit for the property to be extracted
 property_unit = Angstrom
 # temperature to max_output_tokens are LLM model parameters
 temperature = 0.0
 top_p = 0.95
 max_output_tokens = 80
 # You can supply additional keywords to be used in conjunction with the property: modify the file keywords.json
 use_keywords = True
 # You can add additional custom prompts: modify the file additionalprompt.txt
 additional_prompts = additionalprompt.txt
 # Name of input file to be processed: csv/excel format
 inputfile_name = 2Dthickness_Elsevier.csv
 # Column name in the input file to be processed
 column_name = Text
 # Name of output file
 outputfile_name = ppt_test

初始化作业：
- 执行propextract开始计算过程。
了解 PropertyExtractor 选项：
- 主输入文件extract.in包含每个标志的描述性文本，使其用户友好。

引用 PropertyExtractor

如果您在研究中使用过PropertyExtractor包，请引用：

使用对话模型进行动态上下文学习，用于数据提取和材料特性预测 -

@article{Ekuma2024,
  title = {Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction},
  journal = {XXX},
  volume = {xx},
  pages = {xx},
  year = {xx},
  doi = {xx},
  url = {xx},
  author = {Chinedu Ekuma}
}

@misc{PropertyExtractor,
  author = {Chinedu Ekuma},
  title = {PropertyExtractor -- LLM-based model to extract material property from unstructured dataset},
  year = {2024},
  howpublished = { url {https://github.com/gmp007/PropertyExtractor}},
  note = {Open-source tool leveraging LLMs like Google Gemini Pro and OpenAI GPT-4 for material property extraction},
}