PropertyExtractor下載 - PropertyExtractor原始碼下載

PropertyExtractor

其他源碼

v1.0

下載

PropertyExtractor：基於 LLM 的開源會話工具

介紹

自然語言處理和大型語言模型 (LLM) 的出現徹底改變了從非結構化學術論文中提取資料的方式。然而，確保數據的可信度仍然是一個重大挑戰。 PropertyExtractor是一款開源工具，它利用Google Gemini Pro和OpenAI GPT-4等高級對話式LLM，將零樣本與少樣本上下文學習相結合，並採用工程提示來動態細化結構化資訊層次結構，以實現自主學習高效、可擴展、準確地識別、提取和驗證材料屬性數據，產生材料屬性資料庫。

特徵

進階 LLM 整合：支援 Google Gemini Pro 和 OpenAI GPT-4。
零樣本和少樣本學習：混合上下文學習以提高提取精度。
工程提示：結構化資訊層次結構的動態細化。
自主提取：高效且可擴展的材料屬性識別和提取。
高精確率和召回率：精確率和召回率達到 90% 以上，錯誤率約 10%。

安裝

PropertyExtractor提供適合各種使用者偏好的簡單安裝選項，如下所述。我們注意到，在所有安裝選項中，所有程式庫和依賴項都會自動確定並與 PropertyExtractor 可執行檔「propertyextract」一起安裝。

使用 pip ：我們建議的安裝PropertyExtractor套件的方法是使用 pip。
- 執行下列指令，使用 pip 快速安裝最新版本的PropertyExtractor套件：
```
 pip install -U propertyextract
```
來自原始碼：
- 或者，用戶可以透過以下方式下載原始碼：
```
 git clone [[email protected]:gmp007/PropertyExtractor.git]
```
- 然後，透過導航到主目錄並執行以下命令來安裝PropertyExtractor ：
```
 pip install .
```
透過 setup.py 安裝：
- PropertyExtractor 也可以使用setup.py腳本安裝：
```
 python setup.py install [--prefix=/path/to/install/]
```
- 可選的--prefix參數對於共享高效能運算 (HPC) 系統等環境中的安裝非常有用，在這些環境中管理權限可能受到限制。
- 請注意，雖然此方法仍然受支持，但其使用量正在逐漸下降，有利於更現代的安裝實踐。我們僅在pip等標準安裝方法不適用的情況下推薦此安裝選項。

用法

配置

請不要暴露您的 API 金鑰。在執行PropertyExtractor之前，將 Google Gemini Pro 和 OpenAI GPT-4 的 API 金鑰配置為環境變數。

在 Linux/macOS 上

 export GPT4_API_KEY= ' your_gpt4_api_key_here '
export GEMINI_PRO_API_KEY= ' your_gemini_pro_api_key_here '

在 Windows 上

 set GPT4_API_KEY= ' your_gpt4_api_key_here '
set GEMINI_PRO_API_KEY= ' your_gemini_pro_api_key_here '

使用和執行 PropertyExtractor

PropertyExtractor易於運作。初始化PropertyExtractor的關鍵步驟如下：

非結構化資料產生*：使用API從您選擇的發布商取得您想要產生資料庫的材質屬性。我們為 Elsevier 的 ScienceDirect API、CrossRef REST API 和 PubMed API 編寫了 API 函數。如果需要，我們可以分享其中一些。

建立計算目錄：

首先為您的計算建立一個目錄。
執行propextract -0產生PropertyExtractor的主輸入模板，即extract.in 。按照包含的詳細說明進行修改。

也會產生可選文件，例如additionalprompt.txt' for augmenting additional custom prompts and用於自訂附加關鍵字以支援主關鍵字的「keywords.json」。修改以適應正在提取的材料屬性。主輸入範本「extract.in」如下所示：

 ###############################################################################
 ### The input file to control the calculation details of PropertyExtract    ###
 ###############################################################################
 # Type of LLM model: gemini/chatgpt 
 model_type = gemini
 # LLM model name: gemini-pro/gpt-4
 model_name = gemini-pro
 # Property to extract from texts
 property = thickness
 # Harmonized unit for the property to be extracted
 property_unit = Angstrom
 # temperature to max_output_tokens are LLM model parameters
 temperature = 0.0
 top_p = 0.95
 max_output_tokens = 80
 # You can supply additional keywords to be used in conjunction with the property: modify the file keywords.json
 use_keywords = True
 # You can add additional custom prompts: modify the file additionalprompt.txt
 additional_prompts = additionalprompt.txt
 # Name of input file to be processed: csv/excel format
 inputfile_name = 2Dthickness_Elsevier.csv
 # Column name in the input file to be processed
 column_name = Text
 # Name of output file
 outputfile_name = ppt_test

初始化作業：
- 執行propextract開始計算過程。
了解 PropertyExtractor 選項：
- 主輸入檔案extract.in包含每個標誌的描述性文本，使其用戶友好。

引用 PropertyExtractor

如果您在研究中使用過PropertyExtractor包，請引用：

使用對話模型進行動態情境學習，用於資料擷取和材料特性預測 -

@article{Ekuma2024,
  title = {Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction},
  journal = {XXX},
  volume = {xx},
  pages = {xx},
  year = {xx},
  doi = {xx},
  url = {xx},
  author = {Chinedu Ekuma}
}

@misc{PropertyExtractor,
  author = {Chinedu Ekuma},
  title = {PropertyExtractor -- LLM-based model to extract material property from unstructured dataset},
  year = {2024},
  howpublished = { url {https://github.com/gmp007/PropertyExtractor}},
  note = {Open-source tool leveraging LLMs like Google Gemini Pro and OpenAI GPT-4 for material property extraction},
}