English | 中文
An open source implementation of Microsoft's VALL-E X zero-shot TTS model.
We release our trained model to the public for research or application usage.
VALL-E X is an amazing multilingual text-to-speech (TTS) model proposed by Microsoft. While Microsoft initially publish in their research paper, they did not release any code or pretrained models. Recognizing the potential and value of this technology, our team took on the challenge to reproduce the results and train our own model. We are glad to share our trained VALL-E X model with the community, allowing everyone to experience the power next-generation TTS!
More details about the model are presented in model card.
2023.09.10
2023.08.30
2023.08.23
2023.08.20
2023.08.14
git clone https://github.com/Plachtaa/VALL-E-X.git
cd VALL-E-X
pip install -r requirements.txt
Note: If you want to make prompt, you need to install ffmpeg and add its folder to the environment variable PATH.
When you run the program for the first time, it will automatically download the corresponding model.
If the download fails and reports an error, please follow the steps below to manually download the model.
(Please pay attention to the capitalization of folders)
Check whether there is a checkpoints
folder in the installation directory.
If not, manually create a checkpoints
folder (./checkpoints/
) in the installation directory.
Check whether there is a vallex-checkpoint.pt
file in the checkpoints
folder.
If not, please manually download the vallex-checkpoint.pt
file from here and put it in the checkpoints
folder.
Check whether there is a whisper
folder in the installation directory.
If not, manually create a whisper
folder (./whisper/
) in the installation directory.
Check whether there is a medium.pt
file in the whisper
folder.
If not, please manually download the medium.pt
file from here and put it in the whisper
folder.
Not ready to set up the environment on your local machine just yet? No problem! We've got you covered with our online demos. You can try out VALL-E X directly on Hugging Face or Google Colab, experiencing the model's capabilities hassle-free!
VALL-E X comes packed with cutting-edge functionalities:
Multilingual TTS: Speak in three languages - English, Chinese, and Japanese - with natural and expressive speech synthesis.
Zero-shot Voice Cloning: Enroll a short 3~10 seconds recording of an unseen speaker, and watch VALL-E X create personalized, high-quality speech that sounds just like them!
Explore our demo page for a lot more examples!
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
from IPython.display import Audio
# download and load all models
preload_models()
# generate audio from text
text_prompt = """
Hello, my name is Nose. And uh, and I like hamburger. Hahaha... But I also have other interests such as playing tactic toast.
"""
audio_array = generate_audio(text_prompt)
# save audio to disk
write_wav("vallex_generation.wav", SAMPLE_RATE, audio_array)
# play text in notebook
Audio(audio_array, rate=SAMPLE_RATE)
text_prompt = """
チュソクは私のお気に入りの祭りです。 私は数日間休んで、友人や家族との時間を過ごすことができます。
"""
audio_array = generate_audio(text_prompt)
Note: VALL-E X controls accent perfectly even when synthesizing code-switch text. However, you need to manually denote language of respective sentences (since our g2p tool is rule-base)
text_prompt = """
[EN]The Thirty Years' War was a devastating conflict that had a profound impact on Europe.[EN]
[ZH]这是历史的开始。 如果您想听更多,请继续。[ZH]
"""
audio_array = generate_audio(text_prompt, language='mix')
VALL-E X provides tens of speaker voices which you can directly used for inference! Browse all voices in the code
VALL-E X tries to match the tone, pitch, emotion and prosody of a given preset. The model also attempts to preserve music, ambient noise, etc.
text_prompt = """
I am an innocent boy with a smoky voice. It is a great honor for me to speak at the United Nations today.
"""
audio_array = generate_audio(text_prompt, prompt="dingzhen")
VALL-E X supports voice cloning! You can make a voice prompt with any person, character or even your own voice, and use it like other voice presets.
To make a voice prompt, you need to provide a speech of 3~10 seconds long, as well as the transcript of the speech.
You can also leave the transcript blank to let the Whisper model to generate the transcript.
VALL-E X tries to match the tone, pitch, emotion and prosody of a given prompt. The model also attempts to preserve music, ambient noise, etc.
from utils.prompt_making import make_prompt
### Use given transcript
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav",
transcript="Just, what was that? Paimon thought we were gonna get eaten.")
### Alternatively, use whisper
make_prompt(name="paimon", audio_prompt_path="paimon_prompt.wav")
Now let's try out the prompt we've just made!
from utils.generation import SAMPLE_RATE, generate_audio, preload_models
from scipy.io.wavfile import write as write_wav
# download and load all models
preload_models()
text_prompt = """
Hey, Traveler, Listen to this, This machine has taken my voice, and now it can talk just like me!
"""
audio_array = generate_audio(text_prompt, prompt="paimon")
write_wav("paimon_cloned.wav", SAMPLE_RATE, audio_array)
Not comfortable with codes? No problem! We've also created a user-friendly graphical interface for VALL-E X. It allows you to interact with the model effortlessly, making voice cloning and multilingual speech synthesis a breeze.
You can launch the UI by the following command:
python -X utf8 launch-ui.py
VALL-E X works well on both CPU and GPU (pytorch 2.0+
, CUDA 11.7 and CUDA 12.0).
A GPU VRAM of 6GB is enough for running VALL-E X without offloading.
VALL-E X is similar to Bark, VALL-E and AudioLM, which generates audio in GPT-style by predicting audio tokens quantized by EnCodec.
Comparing to Bark:
Language | Status |
---|---|
English (en) | ✅ |
Japanese (ja) | ✅ |
Chinese, simplified (zh) | ✅ |
wget
to download the model to directory ./checkpoints/
when you run the program for the first time../checkpoints/
..bat
scripts for non-python usersIf you find VALL-E X interesting and useful, give us a star on GitHub! ️ It encourages us to keep improving the model and adding exciting features.
VALL-E X is licensed under the MIT License.
Have questions or need assistance? Feel free to open an issue or join our Discord
Happy voice cloning! ?