Automated Filtering of Undesirable Web Data to Update LLM Knowledge
Created by Praneeth Vadlapati (@prane-eth)
Note
Please star the repository to show your support.
LLMs (Generative AI) like ChatGPT do not have the latest updated information. The reason for not auto-updating with the latest data is a lot of unsafe or unwanted text around the web.
This project is to automatically collect the data and filter unwanted text using AI and LLMs. The auto-filtered data can be used to automatically update knowledge of LLMs.
Languages supported: Only English for now (more languages will be added when contributors are available)
A published research paper is available at JMCA/2024(3)E121
To use my paper for reference, please cite it as below:
@article{vadlapati2024autopuredata,
title={{AutoPureData: Automated Filtering of Undesirable Web Data to Update LLM Knowledge}},
author={{Praneeth Vadlapati}},
journal={{Journal of Mathematical & Computer Applications}},
volume={3},
number={4},
pages={1--4},
year={2024},
month={July},
doi={10.47363/JMCA/2024(3)E121},
issn={2754-6705}
}
pip install -r requirements.txt
cp .env.example .env
Now, edit the .env
file and add your API keys.
Run the file Data_flagging.ipynb
to collect and filter the latest web data.
Run the file Analytics_and_Filtering.ipynb
to manually correct the flagging.
After the filtering process, the data can be used with an LLM as mentioned in Usage_with_LLMs.ipynb
For more projects, open the profile: @Pro-GenAI
Contributions are welcome! Feel free to create an issue for any bug reports or suggestions.
Please contribute to the code by adding more filters and making the code more efficient.
To contribute, star the repository and create an Issue. If I can't solve it, I will allow anyone to create a pull request.
Copyright (c) 2024 Praneeth Vadlapati
Please refer to the LICENSE file for more information.
The code is not intended for use in production environments. This code is for educational and research purposes only.
No author is responsible for any misuse or damage caused by this code. Use it at your own risk. The code is provided as is without any guarantees or warranty.
For personal queries, please find my contact details here: linktr.ee/prane.eth