The advent of natural language processing and large language models (LLMs) has revolutionized the extraction of data from unstructured scholarly papers. However, ensuring data trustworthiness remains a significant challenge. PropertyExtractor is an open-source tool that leverages advanced conversational LLMs like Google Gemini Pro and OpenAI GPT-4, blends zero-shot with few-shot in-context learning, and employs engineered prompts for the dynamic refinement of structured information hierarchies to enable autonomous, efficient, scalable, and accurate identification, extraction, and verification of material property data to generate material property database.
PropertyExtractor offers straightforward installation options suitable for various user preferences as explained below. We note that all the libraries and dependables are automatically determined and installed alongside the PropertyExtractor executable "propertyextract" in all the installation options.
Using pip: Our recommended way to install the PropertyExtractor package is using pip.
pip install -U propertyextract
From Source Code:
git clone [[email protected]:gmp007/PropertyExtractor.git]
pip install .
Installation via setup.py:
setup.py
script:
python setup.py install [--prefix=/path/to/install/]
--prefix
argument is useful for installations in environments like shared High-Performance Computing (HPC) systems, where administrative privileges might be restricted.pip
are not applicable.Please don't expose your API keys. Before running PropertyExtractor, configure the API keys for Google Gemini Pro and OpenAI GPT-4 as environment variables.
export GPT4_API_KEY='your_gpt4_api_key_here'
export GEMINI_PRO_API_KEY='your_gemini_pro_api_key_here'
set GPT4_API_KEY='your_gpt4_api_key_here'
set GEMINI_PRO_API_KEY='your_gemini_pro_api_key_here'
PropertyExtractor is easy to run. The key steps for initializing PropertyExtractor follows:
Unstructured data generation*: Use API to obtain the material property that you want to generate the database from the publishers of your choice. We have written API functions for Elsevier's ScienceDirect API, CrossRef REST API, and PubMed API. We can share some of these if needed.
Create a Calculation Directory:
propextract -0
to generate the main input template of the PropertyExtractor, which is the extract.in
. Modify following the detailed instructions included.additionalprompt.txt' for augmenting additional custom prompts and
keywords.json' for custom additional keywords to support the primary keyword are also generated. Modify to suit the material property being extracted. The main input template `extract.in' looks like below:
###############################################################################
### The input file to control the calculation details of PropertyExtract ###
###############################################################################
# Type of LLM model: gemini/chatgpt
model_type = gemini
# LLM model name: gemini-pro/gpt-4
model_name = gemini-pro
# Property to extract from texts
property = thickness
# Harmonized unit for the property to be extracted
property_unit = Angstrom
# temperature to max_output_tokens are LLM model parameters
temperature = 0.0
top_p = 0.95
max_output_tokens = 80
# You can supply additional keywords to be used in conjunction with the property: modify the file keywords.json
use_keywords = True
# You can add additional custom prompts: modify the file additionalprompt.txt
additional_prompts = additionalprompt.txt
# Name of input file to be processed: csv/excel format
inputfile_name = 2Dthickness_Elsevier.csv
# Column name in the input file to be processed
column_name = Text
# Name of output file
outputfile_name = ppt_test
Initialize the Job:
propextract
to begin the calculation process.Understanding PropertyExtractor Options:
extract.in
includes descriptive text for each flag, making it user-friendly.If you have used the PropertyExtractor package in your research, please cite:
@article{Ekuma2024,
title = {Dynamic In-context Learning with Conversational Models for Data Extraction and Materials Property Prediction},
journal = {XXX},
volume = {xx},
pages = {xx},
year = {xx},
doi = {xx},
url = {xx},
author = {Chinedu Ekuma}
}
@misc{PropertyExtractor,
author = {Chinedu Ekuma},
title = {PropertyExtractor -- LLM-based model to extract material property from unstructured dataset},
year = {2024},
howpublished = {url{https://github.com/gmp007/PropertyExtractor}},
note = {Open-source tool leveraging LLMs like Google Gemini Pro and OpenAI GPT-4 for material property extraction},
}
If you have any questions or if you find a bug, please reach out to us.
Feel free to contact us via email:
Your feedback and questions are invaluable to us, and we look forward to hearing from you.
This project is licensed under the GNU GPL version 3 - see the LICENSE file for details.