This project is an AI-powered web scraper that allows you to extract information from HTML sources based on user-defined requirements. It generates scraping code and executes it to retrieve the desired data.
Before running the AI Web Scraper, ensure you have the following prerequisites installed:
requirements.txt
fileClone the project repository:
git clone https://github.com/dirkjbreeuwer/gpt-automated-web-scraper
Navigate to the project directory:
cd gpt-automated-web-scraper
Install the required Python packages:
pip install -r requirements.txt
Set up the OpenAI GPT-4 API key:
Obtain an API key from OpenAI by following their documentation.
Rename the file called .env.example
to .env
in the project directory.
Add the following line to the .env
file, replacing YOUR_API_KEY
with your actual API key:
OPENAI_API_KEY=YOUR_API_KEY
To use the AI Web Scraper, run the gpt-scraper.py
script with the desired command-line arguments.
The following command-line arguments are available:
--source
: The URL or local path to the HTML source to scrape.--source-type
: Type of the source. Specify either "url"
or "file"
.--requirements
: User-defined requirements for scraping.--target-string
: Due to the maximum token limit of GPT-4 (4k tokens), the AI model processes a smaller subset of the HTML where the desired data is located. The target string should be an example string that can be found within the website you want to scrape.Here are some example commands for using the AI Web Scraper:
python3 gpt-scraper.py --source-type "url" --source "https://www.scrapethissite.com/pages/forms/" --requirements "Print a JSON file with all the information available for the Chicago Blackhawks" --target-string "Chicago Blackhawks"
Replace the values for --source
, --requirements
, and --target-string
with your specific values.
This project is licensed under the MIT License. Feel free to modify and use it according to your needs.