VoiceCraft下载 - VoiceCraft源代码下载

VoiceCraft

其他源码

下载

VoiceCraft：野外零样本语音编辑和文本转语音

长话短说

VoiceCraft 是一种代币填充神经编解码器语言模型，它在语音编辑和零样本文本转语音 (TTS)方面实现了最先进的性能，对野外数据（包括有声读物、互联网视频和播客。

要克隆或编辑看不见的声音，VoiceCraft 只需要几秒钟的参考。

如何运行推理

有三种方法（除了在 Colab 中运行 Gradio 之外）：

Google Colab 中比 Gradio UI 更灵活的推理。请参阅快速入门 Colab
与码头工人。请参阅快速入门 docker
没有码头工人。请参阅环境设置。如果选择此选项，您还可以在本地运行 gradio
作为独立脚本，您可以轻松集成到其他项目中。请参阅快速入门命令行。

当您位于 docker 映像内或已安装所有依赖项时，请检查inference_tts.ipynb 。

如果您想进行模型开发，例如训练/微调，我建议您进行以下环境设置和训练。

消息

04/22/2024：330M/830M TTS 增强模型在这里，通过gradio_app.py或inference_tts.ipynb加载它们！复制演示已经完成，主要感谢@chenxwh！

2024 年 4 月 11 日：VoiceCraft Gradio 现已在 HuggingFace Spaces 上提供！主要感谢@zuev-stepan、@Sewell、@pgsoar @Ph0rk0z。

04/05/2024：我在 gigaspeech 和 1/5 的 librilight 上使用 TTS 目标对 giga330M 进行了微调。权重在这里。确保最大提示 + 生成长度 <= 16 秒（由于我们的计算有限，我们不得不在训练数据中丢弃超过 16 秒的话语）。更强大的模型即将推出，敬请期待！

2024 年 3 月 28 日：HuggingFace 上 giga330M 和 giga830M 的模型权重有所增加？这里！

待办事项

快速入门 Colab

要尝试使用 VoiceCraft 进行语音编辑或 TTS 推理，最简单的方法是使用 Google Colab。运行说明位于 Colab 本身上。

尝试语音编辑
尝试 TTS 推理

快速入门命令行

要将其用作独立脚本，请查看 tts_demo.py 和peech_editing_demo.py。请务必首先设置您的环境。如果没有参数，他们将运行此存储库中其他地方用作示例的标准演示参数。您可以使用命令行参数来指定唯一的输入音频、目标转录本和推理超参数。运行帮助命令以获取更多信息： python3 tts_demo.py -h

快速入门 Docker

要尝试使用 VoiceCraft 进行 TTS 推理，您还可以使用 docker。感谢@ubergarm 和@jayc88 让这一切发生。

在 Linux 和 Windows 上进行了测试，应该适用于任何安装了 docker 的主机。

 # 1. clone the repo on in a directory on a drive with plenty of free space
git clone [email protected]:jasonppy/VoiceCraft.git
cd VoiceCraft

# 2. assumes you have docker installed with nvidia container container-toolkit (windows has this built into the driver)
# https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/1.13.5/install-guide.html
# sudo apt-get install -y nvidia-container-toolkit-base || yay -Syu nvidia-container-toolkit || echo etc...

# 3. First build the docker image
docker build --tag " voicecraft " .

# 4. Try to start an existing container otherwise create a new one passing in all GPUs
./start-jupyter.sh  # linux
start-jupyter.bat   # windows

# 5. now open a webpage on the host box to the URL shown at the bottom of:
docker logs jupyter

# 6. optionally look inside from another terminal
docker exec -it jupyter /bin/bash
export USER=(your_linux_username_used_above)
export HOME=/home/ $USER
sudo apt-get update

# 7. confirm video card(s) are visible inside container
nvidia-smi

# 8. Now in browser, open inference_tts.ipynb and work through one cell at a time
echo GOOD LUCK

环境设置

conda create -n voicecraft python=3.9.16
conda activate voicecraft

pip install -e git+https://github.com/facebookresearch/audiocraft.git@c5157b5bf14bf83449c17ea1eeb66c19fb4bc7f0#egg=audiocraft
pip install xformers==0.0.22
pip install torchaudio==2.0.2 torch==2.0.1 # this assumes your system is compatible with CUDA 11.7, otherwise checkout https://pytorch.org/get-started/previous-versions/#v201
apt-get install ffmpeg # if you don't already have ffmpeg installed
apt-get install espeak-ng # backend for the phonemizer installed below
pip install tensorboard==2.16.2
pip install phonemizer==3.2.1
pip install datasets==2.16.0
pip install torchmetrics==0.11.1
pip install huggingface_hub==0.22.2
# install MFA for getting forced-alignment, this could take a few minutes
conda install -c conda-forge montreal-forced-aligner=2.2.17 openfst=1.8.2 kaldi=5.5.1068
# install MFA english dictionary and model
mfa model download dictionary english_us_arpa
mfa model download acoustic english_us_arpa
# pip install huggingface_hub
# conda install pocl # above gives an warning for installing pocl, not sure if really need this

# to run ipynb
conda install -n voicecraft ipykernel --no-deps --force-reinstall

如果您在运行时遇到版本问题，请检查environment.yml以进行精确匹配。

推理示例

查看inference_speech_editing.ipynb和inference_tts.ipynb

格拉迪奥

在 Colab 中运行

本地运行

环境设置后安装附加依赖项：

apt-get install -y espeak espeak-data libespeak1 libespeak-dev
apt-get install -y festival *
apt-get install -y build-essential
apt-get install -y flac libasound2-dev libsndfile1-dev vorbis-tools
apt-get install -y libxml2-dev libxslt-dev zlib1g-dev
pip install -r gradio_requirements.txt

从终端或gradio_app.ipynb运行 gradio 服务器：

python gradio_app.py

它已准备好在默认 url 上使用。

如何使用

（可选）选择型号
加载模型
录制
（可选）调整一些参数
跑步
（可选）以长 TTS 模式逐部分重新运行

一些功能

智能转录：只写你想要生成的内容

TTS 模式：零样本 TTS

编辑模式：语音编辑

长 TTS 模式：长文本轻松 TTS

训练

为了训练VoiceCraft模型，您需要准备以下部分：

话语及其文字记录
使用例如 Encodec 将话语编码为代码
将转录文本转换为音素序列和音素集（我们将其命名为 vocab.txt）
清单（即元数据）

步骤1,2,3在./data/phonemize_encodec_encode_hf.py中处理，其中

Gigaspeech 是通过 HuggingFace 下载的。请注意，您需要签署协议才能下载数据集（它需要您的身份验证令牌）
还使用该脚本提取音素序列和编码解码器代码。

运行示例：

conda activate voicecraft
export CUDA_VISIBLE_DEVICES=0
cd ./data
python phonemize_encodec_encode_hf.py 
--dataset_size xs 
--download_to path/to/store_huggingface_downloads 
--save_dir path/to/store_extracted_codes_and_phonemes 
--encodec_model_path path/to/encodec_model 
--mega_batch_size 120 
--batch_size 32 
--max_len 30000

其中，encodec_model_path 在这里可用。该模型在Gigaspeech XL上训练，有56M参数，4个码本，每个码本有2048个代码。我们的论文中描述了详细信息。如果在提取过程中遇到 OOM，请尝试减小batch_size 和/或 max_len。提取的代码、音素和 vocab.txt 将存储在path/to/store_extracted_codes_and_phonemes/${dataset_size}/{encodec_16khz_4codebooks,phonemes,vocab.txt} 。

至于manifest，请从这里下载train.txt和validation.txt，并将它们放在path/to/store_extracted_codes_and_phonemes/manifest/下。如果您想使用我们预先训练的 VoiceCraft 模型（以便音素到标记的匹配相同），还请从此处下载 vocab.txt。

现在，您可以开始训练了！

conda activate voicecraft
cd ./z_scripts
bash e830M.sh

准备您自己的自定义数据集的过程相同。确保如果

微调

您还需要执行步骤 1-4 作为训练，如果您微调预训练模型以获得更好的稳定性，我建议使用 AdamW 进行优化。结帐脚本./z_scripts/e830M_ft.sh 。

如果您的数据集引入了千兆检查点中不存在的新音素（很可能），请确保在构建词汇时将原始音素与数据中的音素结合起来。并且您需要调整--text_vocab_size和--text_pad_token ，使前者大于或等于您的词汇大小，而后者具有与--text_vocab_size相同的值（即--text_pad_token始终是最后一个标记）。另外，由于文本嵌入现在的大小不同，请确保修改权重加载部分，这样我就不会崩溃（您可以跳过加载text_embedding或仅加载现有部分，并随机初始化新部分）

执照

代码库遵循 CC BY-NC-SA 4.0 (LICENSE-CODE)，模型权重遵循 Coqui 公共模型许可证 1.0.0 (LICENSE-MODEL)。请注意，我们使用了其他存储库中不同许可证下的一些代码： ./models/codebooks_patterns.py codebooks_patterns.py 是在 MIT 许可证下； ./models/modules 、 ./steps/optim.py 、 data/tokenizer.py均遵循 Apache 许可证，版本 2.0；我们使用的phonemizer 是在GNU 3.0 许可证下的。

致谢

我们感谢飞腾的 VALL-E 复制品，感谢 Audiocraft 团队的开源编解码器。

引文

 @article{peng2024voicecraft,
  author    = {Peng, Puyuan and Huang, Po-Yao and Mohamed, Abdelrahman and Harwath, David},
  title     = {VoiceCraft: Zero-Shot Speech Editing and Text-to-Speech in the Wild},
  journal   = {arXiv},
  year      = {2024},
}