语音聊天 AI 是一个允许您使用语音与不同的 AI 角色进行交互的项目。您可以选择各种角色,每个角色都有独特的个性和声音。与阿尔伯特·爱因斯坦进行严肃的对话,或者与电影《她》中的操作系统进行角色扮演。
您可以在本地运行所有内容,您可以使用 openai 进行聊天和语音,您可以在两者之间混合。您可以将 ElevenLabs 语音与 ollama 模型一起使用,所有这些都通过 Web UI 进行控制。让人工智能查看你的屏幕,它会详细解释它正在查看的内容。
克隆存储库:
git clone https://github.com/bigsk1/voice-chat-ai.git
cd voice-chat-ai
对于仅 CPU 版本:克隆仅 cpu 分支 https://github.com/bigsk1/voice-chat-ai/tree/cpu-only
创建虚拟环境: ?
python -m venv venv
source venv/bin/activate # On Windows use `venvScriptsActivate`
或者使用conda
只是将其设为 python 3.10
conda create --name voice-chat-ai python=3.10
conda activate voice-chat-ai
安装依赖项:
仅限 Windows:需要在 Windows 上安装 Microsoft C++ 14.0 或更高版本的构建工具才能使用 TTS Microsoft Build Tools
对于 GPU (CUDA) 版本:推荐
安装支持 CUDA 的 PyTorch 和其他依赖项
pip install torch==2.3.1+cu121 torchaudio==2.3.1+cu121 torchvision==0.18.1+cu121 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt
对于仅 CPU 版本:克隆仅 cpu 分支 https://github.com/bigsk1/voice-chat-ai/tree/cpu-only
# For CPU-only installations, use:
pip install -r cpu_requirements.txt
确保您已在 Windows 终端上下载 ffmpeg ( winget install ffmpeg ) 或签出 https://ffmpeg.org/download.html 然后重新启动 shell 或 vscode,输入 ffmpeg -version 查看是否安装正确
本地 TTS 您可能还需要 cuDNN 来使用 nvidia GPU https://developer.nvidia.com/cudnn 并确保 C:Program FilesNVIDIACUDNNv9.5bin12.6 位于系统路径中
您需要下载此项目中使用的模型的检查点。您可以从 GitHub 发布页面下载它们并将 zip 解压到项目文件夹中。
下载后,将文件夹放置如下:
voice-chat-ai/
├── checkpoints/
│ ├── base_speakers/
│ │ ├── EN/
│ │ │ └── checkpoint.pth
│ │ ├── ZH/
│ │ │ └── checkpoint.pth
│ ├── converter/
│ │ └── checkpoint.pth
├── XTTS-v2/
│ ├── config.json
│ ├── other_xtts_files...
您可以使用以下命令下载文件并将其直接解压到项目目录中:
# Navigate to the project directory
cd /path/to/your/voice-chat-ai
# Download and extract checkpoints.zip
wget https://github.com/bigsk1/voice-chat-ai/releases/download/models/checkpoints.zip
unzip checkpoints.zip -d .
# Download and extract XTTS-v2.zip
wget https://github.com/bigsk1/voice-chat-ai/releases/download/models/XTTS-v2.zip
unzip XTTS-v2.zip -d .
这个图像在构建时很大,因为所有检查点、基础图像、构建工具和音频工具 - 40gb - 也许有一种方法可以让它变小,我还没有尝试过,只是一个实验,看看我是否可以得到它工作!
Docker run 命令允许您在 docker 容器中使用麦克风
docker build -t voice-chat-ai .
在 Windows docker 桌面上使用 wsl - 在 Windows 中运行
wsl docker run -d --gpus all -e " PULSE_SERVER=/mnt/wslg/PulseServer " -v /mnt/wslg/:/mnt/wslg/ --env-file .env --name voice-chat-ai -p 8000:8000 voice-chat-ai:latest
从 wsl 运行
docker run -d --gpus all -e " PULSE_SERVER=/mnt/wslg/PulseServer " -v \ wsl$ U buntu m nt w slg:/mnt/wslg/ --env-file .env --name voice-chat-ai -p 8000:8000 voice-chat-ai:latest
在 docker 文件夹中还有一些脚本用于将模型和 tts 提供程序更新到容器中,因此您可以根据需要从 openai 更改为 ollama,然后再更改回来,而不是执行到容器中并手动进行更改。
.env
,并使用必要的环境变量对其进行配置: - 应用程序根据您添加的变量进行控制。 # Conditional API Usage:
# Depending on the value of MODEL_PROVIDER, the corresponding service will be used when run.
# You can mix and match, use Ollama with OpenAI speech or use OpenAI chat model with local XTTS or xAI chat etc..
# Model Provider: openai or ollama or xai
MODEL_PROVIDER = ollama
# Character to use - Options: alien_scientist, anarchist, bigfoot, chatgpt, clumsyhero, conandoyle, conspiracy, cyberpunk,
# detective, dog, dream_weaver, einstein, elon_musk, fight_club, fress_trainer, ghost, granny, haunted_teddybear, insult, joker, morpheus,
# mouse, mumbler, nebula_barista, nerd, newscaster_1920s, paradox, pirate, revenge_deer, samantha, shakespeare, split, telemarketer,
# terminator, valleygirl, vampire, vegetarian_vampire, wizard, zombie_therapist, grok_xai
CHARACTER_NAME = pirate
# Text-to-Speech (TTS) Configuration:
# TTS Provider - Options: xtts (local uses the custom character .wav) or openai (uses OpenAI TTS voice) or elevenlabs
TTS_PROVIDER = elevenlabs
# OpenAI TTS Voice - Used when TTS_PROVIDER is set to openai above
# Voice options: alloy, echo, fable, onyx, nova, shimmer
OPENAI_TTS_VOICE = onyx
# ElevenLabs Configuration:
ELEVENLABS_API_KEY = your_api_key_here
# Default voice ID
ELEVENLABS_TTS_VOICE = pgCnBQgKPGkIP8fJuita
# XTTS Configuration:
# The voice speed for XTTS only (1.0 - 1.5, default is 1.1)
XTTS_SPEED = 1.2
# OpenAI Configuration:
# OpenAI API Key for models and speech (replace with your actual API key)
OPENAI_API_KEY = your_api_key_here
# Models to use - OPTIONAL: For screen analysis, if MODEL_PROVIDER is ollama, llava will be used by default.
# Ensure you have llava downloaded with Ollama. If OpenAI is used, gpt-4o-mini works well. xai not supported yet falls back to openai if xai is selected and you ask for screen analysis.
OPENAI_MODEL = gpt-4o-mini
# Endpoints:
# Set these below and no need to change often
OPENAI_BASE_URL = https://api.openai.com/v1/chat/completions
OPENAI_TTS_URL = https://api.openai.com/v1/audio/speech
OLLAMA_BASE_URL = http://localhost:11434
# Models Configuration:
# Models to use - llama3.2 works well for local usage.
OLLAMA_MODEL = llama3.2
# xAI Configuration
XAI_MODEL = grok-beta
XAI_API_KEY = your_api_key_here
XAI_BASE_URL = https://api.x.ai/v1
# NOTES:
# List of trigger phrases to have the model view your desktop (desktop, browser, images, etc.).
# It will describe what it sees, and you can ask questions about it:
# "what's on my screen", "take a screenshot", "show me my screen", "analyze my screen",
# "what do you see on my screen", "screen capture", "screenshot"
# To stop the conversation, say "Quit", "Exit", or "Leave". ( ctl+c always works also)
运行应用程序: ?
网页用户界面
uvicorn app.main:app --host 0.0.0.0 --port 8000
在 http://localhost:8000/ 上查找
仅 CLI
python cli.py
在elevenlabs_voices.json
中添加名称和语音 ID - 在 WebUI 中,您可以在下拉菜单中选择它们。
{
"voices" : [
{
"id" : " 2bk7ULW9HfwvcIbMWod0 " ,
"name" : " Female - Bianca - City girl "
},
{
"id" : " JqseNhWbQb1GDNNS1Ga1 " ,
"name" : " Female - Joanne - Pensive, introspective "
},
{
"id" : " b0uJ9TWzQss61d8f2OWX " ,
"name" : " Female - Lucy - Sweet and sensual "
},
{
"id" : " 2pF3fJJNnWg1nDwUW5CW " ,
"name" : " Male - Eustis - Fast speaking "
},
{
"id" : " pgCnBQgKPGkIP8fJuita " ,
"name" : " Male - Jarvis - Tony Stark AI "
},
{
"id" : " kz8mB8WAwV9lZ0fuDqel " ,
"name" : " Male - Nigel - Mysterious intriguing "
},
{
"id" : " MMHtVLagjZxJ53v4Wj8o " ,
"name" : " Male - Paddington - British narrator "
},
{
"id" : " 22FgtP4D63L7UXvnTmGf " ,
"name" : " Male - Wildebeest - Deep male voice "
}
]
}
对于 CLI,将使用 .env 中的语音 ID
按开始开始说话。休息一下点击停止,准备好后再点击开始。按停止键可更改下拉菜单中的字符和声音。您还可以在下拉菜单中选择所需的模型提供程序和 TTS 提供程序,它将更新并继续使用所选的提供程序。说“退出”、“离开”或“退出”就像按停止键一样。
http://本地主机:8000/
点击缩略图即可打开视频☝️
character/wizard
)。character/wizard/wizard.txt
)。character/wizard/prompts.json
)。 wizard.txt
这是AI用来识别是谁的提示
You are a wise and ancient wizard who speaks with a mystical and enchanting tone. You are knowledgeable about many subjects and always eager to share your wisdom.
prompts.json
这是用于情感分析,根据你所说的内容,你可以引导人工智能以某种方式做出反应,当你说话时,使用TextBlob
分析器并给出一个分数,根据该分数,它与如下所示的情绪相关联并传递给后续回复中的人工智能会解释你的心情,从而引导人工智能以某种风格回复。
{
"joyful" : " RESPOND WITH ENTHUSIASM AND WISDOM, LIKE A WISE OLD SAGE WHO IS HAPPY TO SHARE HIS KNOWLEDGE. " ,
"sad" : " RESPOND WITH EMPATHY AND COMFORT, LIKE A WISE OLD SAGE WHO UNDERSTANDS THE PAIN OF OTHERS. " ,
"flirty" : " RESPOND WITH A TOUCH OF MYSTERY AND CHARM, LIKE A WISE OLD SAGE WHO IS ALSO A BIT OF A ROGUE. " ,
"angry" : " RESPOND CALMLY AND WISELY, LIKE A WISE OLD SAGE WHO KNOWS THAT ANGER IS A PART OF LIFE. " ,
"neutral" : " KEEP RESPONSES SHORT AND NATURAL, LIKE A WISE OLD SAGE WHO IS ALWAYS READY TO HELP. " ,
"fearful" : " RESPOND WITH REASSURANCE, LIKE A WISE OLD SAGE WHO KNOWS THAT FEAR IS ONLY TEMPORARY. " ,
"surprised" : " RESPOND WITH AMAZEMENT AND CURIOSITY, LIKE A WISE OLD SAGE WHO IS ALWAYS EAGER TO LEARN. " ,
"disgusted" : " RESPOND WITH UNDERSTANDING AND COMFORT, LIKE A WISE OLD SAGE WHO KNOWS THAT DISGUST IS A PART OF LIFE. "
}
对于XTTS,找到一个.wav语音并将其添加到向导文件夹中并将其命名为wizard.wav,该语音只需6秒长。运行应用程序会自动找到具有角色名称的 .wav 并使用它。如果仅使用 Openai Speech 或 ElevenLabs,则不需要 .wav
Could not locate cudnn_ops64_9.dll. Please make sure it is in your library path !
Invalid handle. Cannot load symbol cudnnCreateTensorDescriptor
要解决这个问题:
安装 cuDNN:从 NVIDIA cuDNN 页面下载 cuDNN https://developer.nvidia.com/cudnn
将其添加到 PATH 的方法如下:
打开系统环境变量:
按 Win + R,输入 sysdm.cpl,然后按 Enter。转到“高级”选项卡,然后单击“环境变量”。编辑系统路径变量:
在系统变量部分中,找到 Path 变量,选择它,然后单击编辑。单击“新建”,添加cudnn_ops64_9.dll所在bin目录的路径。根据您的设置,您将添加:
C: P rogram Files N VIDIA C UDNN v 9.5 b in 1 2.6
应用并重新启动:
单击“确定”关闭所有对话框,然后重新启动终端(或任何正在运行的应用程序)以应用更改。验证更改:
打开新终端并运行
where cudnn_ops64_9.dll
File " C:Userssomeguyminiconda3envsvoice-chat-ailibsite-packagespyaudio__init__.py " , line 441, in __init__
self._stream = pa.open( ** arguments)
OSError: [Errno -9999] Unanticipated host error
确保 ffmpeg 已安装并添加到 Windows 终端上的 PATH ( winget install ffmpeg ),还要确保 Windows 上的麦克风隐私设置正常,并且将麦克风设置为默认设备。我在使用蓝牙苹果AirPods时遇到了这个问题,这解决了它。
点击缩略图即可打开视频☝️
命令行界面
GPU - 100% 本地 - ollama llama3、xtts-v2
点击缩略图即可打开视频☝️
仅 CPU 模式 CLI
使用 openai gpt4o 和 openai 语音进行 tts 的外星人对话。
点击缩略图即可打开视频☝️
运行应用程序时终端中的详细输出。
首次启动服务器时使用 Elevenlabs 时,您会获得有关使用限制的详细信息,以帮助您了解已使用的量。
(voice-chat-ai) X: v oice-chat-ai > uvicorn app.main:app --host 0.0.0.0 --port 8000
Switched to ElevenLabs TTS voice: VgPqCpkdPQacBNNIsAqI
ElevenLabs Character Usage: 33796 / 100027
Using device: cuda
Model provider: openai
Model: gpt-4o
Character: Nerd
Text-to-Speech provider: elevenlabs
To stop chatting say Quit, Leave or Exit. Say, what ' s on my screen, to have AI view screen. One moment please loading...
INFO: Started server process [12752]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 127.0.0.1:62671 - "GET / HTTP/1.1" 200 OK
INFO: 127.0.0.1:62671 - "GET /app/static/css/styles.css HTTP/1.1" 200 OK
INFO: 127.0.0.1:62672 - "GET /app/static/js/scripts.js HTTP/1.1" 200 OK
INFO: 127.0.0.1:62672 - "GET /characters HTTP/1.1" 200 OK
INFO: 127.0.0.1:62671 - "GET /app/static/favicon.ico HTTP/1.1" 200 OK
INFO: 127.0.0.1:62673 - "GET /elevenlabs_voices HTTP/1.1" 200 OK
INFO: ( ' 127.0.0.1 ' , 62674) - "WebSocket /ws" [accepted]
INFO: connection open
特征:
该项目已获得 MIT 许可证的许可。