語音聊天 AI 是一個允許您使用語音與不同的 AI 角色進行互動的專案。您可以選擇各種角色,每個角色都有獨特的個性和聲音。與阿爾伯特愛因斯坦進行嚴肅的對話,或與電影《她》中的作業系統進行角色扮演。
您可以在本機上運行所有內容,您可以使用 openai 進行聊天和語音,您可以在兩者之間混合。您可以將 ElevenLabs 語音與 ollama 模型一起使用,所有這些都透過 Web UI 進行控制。讓人工智慧查看你的螢幕,它會詳細解釋它正在查看的內容。
克隆儲存庫:
git clone https://github.com/bigsk1/voice-chat-ai.git
cd voice-chat-ai
對於僅 CPU 版本:克隆僅 cpu 分支 https://github.com/bigsk1/voice-chat-ai/tree/cpu-only
建立虛擬環境: ?
python -m venv venv
source venv/bin/activate # On Windows use `venvScriptsActivate`
或使用conda
只是將其設為 python 3.10
conda create --name voice-chat-ai python=3.10
conda activate voice-chat-ai
安裝依賴項:
僅限 Windows:需要在 Windows 上安裝 Microsoft C++ 14.0 或更高版本的建置工具才能使用 TTS Microsoft Build Tools
對於 GPU (CUDA) 版本:推薦
安裝支援 CUDA 的 PyTorch 和其他依賴項
pip install torch==2.3.1+cu121 torchaudio==2.3.1+cu121 torchvision==0.18.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
對於僅 CPU 版本:克隆僅 cpu 分支 https://github.com/bigsk1/voice-chat-ai/tree/cpu-only
# For CPU-only installations, use:
pip install -r cpu_requirements.txt
確保您已在 Windows 終端下載 ffmpeg ( winget install ffmpeg ) 或簽出 https://ffmpeg.org/download.html 然後重新啟動 shell 或 vscode,輸入 ffmpeg -version 查看是否安裝正確
本機 TTS 您可能還需要 cuDNN 來使用 nvidia GPU https://developer.nvidia.com/cudnn 並確保 C:Program FilesNVIDIACUDNNv9.5bin12.6 位於系統路徑中
您需要下載此項目中使用的模型的檢查點。您可以從 GitHub 發布頁面下載它們並將 zip 解壓縮到專案資料夾中。
下載後,將資料夾放置如下:
voice-chat-ai/
├── checkpoints/
│ ├── base_speakers/
│ │ ├── EN/
│ │ │ └── checkpoint.pth
│ │ ├── ZH/
│ │ │ └── checkpoint.pth
│ ├── converter/
│ │ └── checkpoint.pth
├── XTTS-v2/
│ ├── config.json
│ ├── other_xtts_files...
您可以使用以下命令下載檔案並將其直接解壓縮到專案目錄中:
# Navigate to the project directory
cd /path/to/your/voice-chat-ai
# Download and extract checkpoints.zip
wget https://github.com/bigsk1/voice-chat-ai/releases/download/models/checkpoints.zip
unzip checkpoints.zip -d .
# Download and extract XTTS-v2.zip
wget https://github.com/bigsk1/voice-chat-ai/releases/download/models/XTTS-v2.zip
unzip XTTS-v2.zip -d .
這個圖像在構建時很大,因為所有檢查點、基礎圖像、構建工具和音頻工具- 40gb - 也許有一種方法可以讓它變小,我還沒有嘗試過,只是一個實驗,看看我是否可以得到它工作!
Docker run 指令可讓您在 docker 容器中使用麥克風
docker build -t voice-chat-ai .
在 Windows docker 桌面上使用 wsl - 在 Windows 中運行
wsl docker run -d --gpus all -e " PULSE_SERVER=/mnt/wslg/PulseServer " -v /mnt/wslg/:/mnt/wslg/ --env-file .env --name voice-chat-ai -p 8000:8000 voice-chat-ai:latest
從 wsl 運行
docker run -d --gpus all -e " PULSE_SERVER=/mnt/wslg/PulseServer " -v \ wsl$ U buntu m nt w slg:/mnt/wslg/ --env-file .env --name voice-chat-ai -p 8000:8000 voice-chat-ai:latest
在 docker 資料夾中還有一些腳本用於將模型和 tts 提供者更新到容器中,因此您可以根據需要從 openai 更改為 ollama,然後再更改回來,而不是執行到容器中並手動進行更改。
.env
,並使用必要的環境變數對其進行配置: - 應用程式根據您新增的變數進行控制。 # Conditional API Usage:
# Depending on the value of MODEL_PROVIDER, the corresponding service will be used when run.
# You can mix and match, use Ollama with OpenAI speech or use OpenAI chat model with local XTTS or xAI chat etc..
# Model Provider: openai or ollama or xai
MODEL_PROVIDER = ollama
# Character to use - Options: alien_scientist, anarchist, bigfoot, chatgpt, clumsyhero, conandoyle, conspiracy, cyberpunk,
# detective, dog, dream_weaver, einstein, elon_musk, fight_club, fress_trainer, ghost, granny, haunted_teddybear, insult, joker, morpheus,
# mouse, mumbler, nebula_barista, nerd, newscaster_1920s, paradox, pirate, revenge_deer, samantha, shakespeare, split, telemarketer,
# terminator, valleygirl, vampire, vegetarian_vampire, wizard, zombie_therapist, grok_xai
CHARACTER_NAME = pirate
# Text-to-Speech (TTS) Configuration:
# TTS Provider - Options: xtts (local uses the custom character .wav) or openai (uses OpenAI TTS voice) or elevenlabs
TTS_PROVIDER = elevenlabs
# OpenAI TTS Voice - Used when TTS_PROVIDER is set to openai above
# Voice options: alloy, echo, fable, onyx, nova, shimmer
OPENAI_TTS_VOICE = onyx
# ElevenLabs Configuration:
ELEVENLABS_API_KEY = your_api_key_here
# Default voice ID
ELEVENLABS_TTS_VOICE = pgCnBQgKPGkIP8fJuita
# XTTS Configuration:
# The voice speed for XTTS only (1.0 - 1.5, default is 1.1)
XTTS_SPEED = 1.2
# OpenAI Configuration:
# OpenAI API Key for models and speech (replace with your actual API key)
OPENAI_API_KEY = your_api_key_here
# Models to use - OPTIONAL: For screen analysis, if MODEL_PROVIDER is ollama, llava will be used by default.
# Ensure you have llava downloaded with Ollama. If OpenAI is used, gpt-4o-mini works well. xai not supported yet falls back to openai if xai is selected and you ask for screen analysis.
OPENAI_MODEL = gpt-4o-mini
# Endpoints:
# Set these below and no need to change often
OPENAI_BASE_URL = https://api.openai.com/v1/chat/completions
OPENAI_TTS_URL = https://api.openai.com/v1/audio/speech
OLLAMA_BASE_URL = http://localhost:11434
# Models Configuration:
# Models to use - llama3.2 works well for local usage.
OLLAMA_MODEL = llama3.2
# xAI Configuration
XAI_MODEL = grok-beta
XAI_API_KEY = your_api_key_here
XAI_BASE_URL = https://api.x.ai/v1
# NOTES:
# List of trigger phrases to have the model view your desktop (desktop, browser, images, etc.).
# It will describe what it sees, and you can ask questions about it:
# "what's on my screen", "take a screenshot", "show me my screen", "analyze my screen",
# "what do you see on my screen", "screen capture", "screenshot"
# To stop the conversation, say "Quit", "Exit", or "Leave". ( ctl+c always works also)
運行應用程式: ?
網頁使用者介面
uvicorn app.main:app --host 0.0.0.0 --port 8000
在 http://localhost:8000/ 上找
僅 CLI
python cli.py
在elevenlabs_voices.json
中新增名稱和語音 ID - 在 WebUI 中,您可以在下拉式選單中選擇它們。
{
"voices" : [
{
"id" : " 2bk7ULW9HfwvcIbMWod0 " ,
"name" : " Female - Bianca - City girl "
},
{
"id" : " JqseNhWbQb1GDNNS1Ga1 " ,
"name" : " Female - Joanne - Pensive, introspective "
},
{
"id" : " b0uJ9TWzQss61d8f2OWX " ,
"name" : " Female - Lucy - Sweet and sensual "
},
{
"id" : " 2pF3fJJNnWg1nDwUW5CW " ,
"name" : " Male - Eustis - Fast speaking "
},
{
"id" : " pgCnBQgKPGkIP8fJuita " ,
"name" : " Male - Jarvis - Tony Stark AI "
},
{
"id" : " kz8mB8WAwV9lZ0fuDqel " ,
"name" : " Male - Nigel - Mysterious intriguing "
},
{
"id" : " MMHtVLagjZxJ53v4Wj8o " ,
"name" : " Male - Paddington - British narrator "
},
{
"id" : " 22FgtP4D63L7UXvnTmGf " ,
"name" : " Male - Wildebeest - Deep male voice "
}
]
}
對於 CLI,將使用 .env 中的語音 ID
按開始開始說話。休息一下點擊停止,準備好後再點擊開始。按下停止鍵可變更下拉式選單中的字元和聲音。您也可以在下拉式選單中選擇所需的模型提供者和 TTS 提供程序,它將更新並繼續使用所選的提供者。說「退出」、「離開」或「退出」就像按停止鍵一樣。
http://本地主機:8000/
點擊縮圖即可開啟影片☝️
character/wizard
)。character/wizard/wizard.txt
)。character/wizard/prompts.json
)。 wizard.txt
這是AI用來識別是誰的提示
You are a wise and ancient wizard who speaks with a mystical and enchanting tone. You are knowledgeable about many subjects and always eager to share your wisdom.
prompts.json
這是用於情感分析,根據你所說的內容,你可以引導人工智慧以某種方式做出反應,當你說話時,使用TextBlob
分析器並給出一個分數,根據該分數,它與如下所示的情緒相關聯並傳遞給後續回應中的人工智慧會解釋你的心情,從而引導人工智慧以某種風格回應。
{
"joyful" : " RESPOND WITH ENTHUSIASM AND WISDOM, LIKE A WISE OLD SAGE WHO IS HAPPY TO SHARE HIS KNOWLEDGE. " ,
"sad" : " RESPOND WITH EMPATHY AND COMFORT, LIKE A WISE OLD SAGE WHO UNDERSTANDS THE PAIN OF OTHERS. " ,
"flirty" : " RESPOND WITH A TOUCH OF MYSTERY AND CHARM, LIKE A WISE OLD SAGE WHO IS ALSO A BIT OF A ROGUE. " ,
"angry" : " RESPOND CALMLY AND WISELY, LIKE A WISE OLD SAGE WHO KNOWS THAT ANGER IS A PART OF LIFE. " ,
"neutral" : " KEEP RESPONSES SHORT AND NATURAL, LIKE A WISE OLD SAGE WHO IS ALWAYS READY TO HELP. " ,
"fearful" : " RESPOND WITH REASSURANCE, LIKE A WISE OLD SAGE WHO KNOWS THAT FEAR IS ONLY TEMPORARY. " ,
"surprised" : " RESPOND WITH AMAZEMENT AND CURIOSITY, LIKE A WISE OLD SAGE WHO IS ALWAYS EAGER TO LEARN. " ,
"disgusted" : " RESPOND WITH UNDERSTANDING AND COMFORT, LIKE A WISE OLD SAGE WHO KNOWS THAT DISGUST IS A PART OF LIFE. "
}
對於XTTS,找到一個.wav語音並將其添加到嚮導資料夾中並將其命名為wizard.wav,該語音只需6秒長。運行應用程式會自動找到具有角色名稱的 .wav 並使用它。如果僅使用 Openai Speech 或 ElevenLabs,則不需要 .wav
Could not locate cudnn_ops64_9.dll. Please make sure it is in your library path !
Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor
要解決這個問題:
安裝 cuDNN:從 NVIDIA cuDNN 頁面下載 cuDNN https://developer.nvidia.com/cudnn
將其加入 PATH 的方法如下:
開啟系統環境變數:
按 Win + R,輸入 sysdm.cpl,然後按 Enter。轉到“進階”選項卡,然後按一下“環境變數”。編輯系統路徑變數:
在系統變數部分中,找到 Path 變量,選擇它,然後按一下編輯。點選“新建”,新增cudnn_ops64_9.dll所在bin目錄的路徑。根據您的設置,您將添加:
C: P rogram Files N VIDIA C UDNN v 9.5 b in 1 2.6
應用程式並重新啟動:
按一下「確定」關閉所有對話框,然後重新啟動終端(或任何正在執行的應用程式)以套用變更。驗證更改:
打開新終端並運行
where cudnn_ops64_9.dll
File " C:Userssomeguyminiconda3envsvoice-chat-ailibsite-packagespyaudio__init__.py " , line 441, in __init__
self._stream = pa.open( ** arguments)
OSError: [Errno -9999] Unanticipated host error
確保 ffmpeg 已安裝並新增至 Windows 終端機上的 PATH ( winget install ffmpeg ),也要確保 Windows 上的麥克風隱私設定正常,並且將麥克風設定為預設裝置。我在使用藍牙蘋果AirPods時遇到了這個問題,這解決了它。
點擊縮圖即可開啟影片☝️
命令列介面
GPU - 100% 本地 - ollama llama3、xtts-v2
點擊縮圖即可開啟影片☝️
僅 CPU 模式 CLI
使用 openai gpt4o 和 openai 語音進行 tts 的外星人對話。
點擊縮圖即可開啟影片☝️
運行應用程式時終端機中的詳細輸出。
首次啟動伺服器時使用 Elevenlabs 時,您會獲得有關使用限制的詳細信息,以幫助您了解已使用的量。
(voice-chat-ai) X: v oice-chat-ai > uvicorn app.main:app --host 0.0.0.0 --port 8000
Switched to ElevenLabs TTS voice: VgPqCpkdPQacBNNIsAqI
ElevenLabs Character Usage: 33796 / 100027
Using device: cuda
Model provider: openai
Model: gpt-4o
Character: Nerd
Text-to-Speech provider: elevenlabs
To stop chatting say Quit, Leave or Exit. Say, what ' s on my screen, to have AI view screen. One moment please loading...
INFO: Started server process [12752]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:62671 - "GET / HTTP/1.1" 200 OK
INFO: 127.0.0.1:62671 - "GET /app/static/css/styles.css HTTP/1.1" 200 OK
INFO: 127.0.0.1:62672 - "GET /app/static/js/scripts.js HTTP/1.1" 200 OK
INFO: 127.0.0.1:62672 - "GET /characters HTTP/1.1" 200 OK
INFO: 127.0.0.1:62671 - "GET /app/static/favicon.ico HTTP/1.1" 200 OK
INFO: 127.0.0.1:62673 - "GET /elevenlabs_voices HTTP/1.1" 200 OK
INFO: ( ' 127.0.0.1 ' , 62674) - "WebSocket /ws" [accepted]
INFO: connection open
特徵:
該項目已獲得 MIT 許可證的許可。