YouTube Data Harvesting and Warehousing using SQL, MongoDB, and Streamlit
https://iambitttu-youtube-data-harvesting-and-warehousing-ytub-k90wgj.streamlit.app/
Introduction
YouTube Data Harvesting and Warehousing is a project that aims to allow users to access and analyze data from multiple YouTube channels. The project utilizes SQL, MongoDB, and Streamlit to create a user-friendly application that allows users to retrieve, store, and query YouTube channel and video data.
Project Overview
The YouTube Data Harvesting and Warehousing project consists of the following components:
- Streamlit Application: A user-friendly UI built using Streamlit library, allowing users to interact with the application and perform data retrieval and analysis tasks.
- YouTube API Integration: Integration with the YouTube API to fetch channel and video data based on the provided channel ID.
- MongoDB Data Lake: Storage of the retrieved data in a MongoDB database, providing a flexible and scalable solution for storing unstructured and semi-structured data.
- SQL Data Warehouse: Migration of data from the data lake to an SQL database, allowing for efficient querying and analysis using SQL queries.
- Data Visualization: Presentation of retrieved data using Streamlit's data visualization features, enabling users to analyze the data through charts and graphs.
Technologies Used
The following technologies are used in this project:
- Python: The programming language used for building the application and scripting tasks.
- Streamlit: A Python library used for creating interactive web applications and data visualizations.
- YouTube API: Google API is used to retrieve channel and video data from YouTube.
- MongoDB: A NoSQL database used as a data lake for storing retrieved YouTube data.
- SQL (MySQL): A relational database used as a data warehouse for storing migrated YouTube data.
- SQLAlchemy: A Python library used for SQL database connectivity and interaction.
- Pandas: A data manipulation library used for data processing and analysis.
- Matplotlib: A data visualization library used for creating charts and graphs.
Installation and Setup
To run the YouTube Data Harvesting and Warehousing project, follow these steps:
- Install Python: Install the Python programming language on your machine.
- Install Required Libraries: Install the necessary Python libraries using pip or conda package manager. Required libraries include Streamlit, MongoDB driver, SQLAlchemy, Pandas, and Matplotlib.
- Set Up Google API: Set up a Google API project and obtain the necessary API credentials for accessing the YouTube API.
- Configure Database: Set up a MongoDB database and SQL database (MySQL) to store the data.
- Configure Application: Update the configuration file or environment variables with the necessary API credentials and database connection details.
- Run the Application: Launch the Streamlit application using the command-line interface.
Usage
Once the project is setup and running, users can access the Streamlit application through a web browser. The application will provide a user interface where users can perform the following actions:
- Enter a YouTube channel ID to retrieve data for that channel.
- Store the retrieved data in the MongoDB data lake.
- Collect and store data for multiple YouTube channels in the data lake.
- Select a channel and migrate its data from the data lake to the SQL data warehouse.
- Search and retrieve data from the SQL database using various search options.
- Perform data analysis and visualization using the provided features.
Features
The YouTube Data Harvesting and Warehousing application offers the following features:
- Retrieval of channel and video data from YouTube using the YouTube API.
- Storage of data in a MongoDB database as a data lake.
- Migration of data from the data lake to a SQL database for efficient querying and analysis.
- Search and retrieval of data from the SQL database using different search options, including joining tables.
- Data analysis and visualization through charts and graphs using Streamlit's data visualization capabilities.
- Support for handling multiple YouTube channels and managing their data.
Future Enhancements
Here are some potential future enhancements for the YouTube Data Harvesting and Warehousing project:
- Authentication and User Management: Implement user authentication and management functionality to secure access to the application.
- Scheduled Data Harvesting: Set up automated data harvesting for selected YouTube channels regularly.
- Advanced Search and Filtering: Enhance the search functionality to allow for more advanced search criteria and filtering options.
- Additional Data Sources: Extend the project to support data retrieval from other social media platforms or streaming services.
- Advanced-Data Analysis: Incorporate advanced analytics techniques and machine learning algorithms for deeper insights into YouTube data.
- Export and Reporting: Add features to export data and generate reports in various formats for further analysis and sharing.
Conclusion
The YouTube Data Harvesting and Warehousing project provides a powerful tool for retrieving, storing, and analyzing YouTube channel and video data. By leveraging SQL, MongoDB, and Streamlit, users can easily access and manipulate YouTube data in a user-friendly interface. The project offers flexibility, scalability, and data visualization capabilities, empowering users to gain insights from the vast amount of YouTube data available.
References
- Streamlit Documentation: https://docs.streamlit.io/
- YouTube API Documentation: https://developers.google.com/youtube
- MongoDB Documentation: https://docs.mongodb.com/
- SQLAlchemy Documentation: https://docs.sqlalchemy.org/
- Python Documentation: https://docs.python.org/
- Matplotlib Documentation: https://matplotlib.org/