Description: This is the amazing Google Gemini Vision Pro ?, a powerful tool that scans images, generates descriptions using Gemini AI Pro Vision API, and provides speech feedback . It also captures images using the webcam .
? Introduction ?
Google Gemini Vision Pro is a versatile application that combines image processing ?️, speech recognition ?, and text-to-speech capabilities ?. With this application, you can capture images using your webcam ?, convert spoken words to text , generate image descriptions , and even have the descriptions spoken back to you .
Installation Guide
Step 1: Clone the repository
git clone https://github.com/haseeb-heaven/Gemini-Vision-Pro
cd Gemini-Vision-Pro
Step 2: Install the dependencies
pip install -r requirements.txt
Step 3: Run the application
Step 4: Obtain the Google Palm API key and Setup the application
- Obtain the Google Palm API key.
- Visit the following URL: Google AI Studio
- Click on the Create API Key button.
- The generated key is your API key. Please make sure to copy it and paste it in the application settings.
- The API key is crucial for the functioning, Please ensure to keep it safe and do not share it with anyone.
Gemini AI settings:
AI Sections
The core AI sections of this project include:
- ? Webcam detection using WebRTC, OpenCV, and PIL
- Speech-to-text conversion using Google Cloud Speech-to-Text API
- ?️ Text-to-speech conversion using Google Cloud Text-to-Speech API
- ? Image processing using Gemini AI Pro Vision API
Features
- ? Webcam detection with real-time image capture
- Speech-to-text conversion for spoken words
- ?️ Text-to-speech for generating spoken descriptions
- ? Image processing using AI to provide detailed descriptions
- Logging using Python's logging module
- Error handling with Python's exception handling
WebUI - Application Showcase
YouTube demo:
Webcam with live feed:
Gemini Ai Vision demo with object as Cap:
Gemini Ai Vision demo with Hand:
Gemini Ai Vision demo with Gesture:
Packages Used
This project relies on various Python packages, including:
- Streamlit - A web app framework used to build the application
- Streamlit Webrtc - Used for capturing images from the webcam
- OpenCV - Utilized for webcam image capture
- PIL (Pillow) - Used for image processing and conversion
- gTTS (Google Text-to-Speech) - Converts text to speech
- SpeechRecognition - Converts speech to text
- google.cloud.speech - Part of Google Cloud services for speech-to-text conversion
Links and References
Follow these links for Google Gemini Vision Pro related content:
- Google AI Studio
- Google Gemini Vision Pro
- Google Gemini Deepmind
Versioning
-
Version: 1.0 : Initial Release
Contributing
We welcome contributions! Please follow our Contribution Guidelines to get started.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Author
- HeavenHM
-
Date: 17-12-2023