Recently, a new speech synthesis model called Kokoro was released on the Hugging Face platform, attracting widespread attention. This model only uses 82 million parameters and less than 100 hours of audio data to achieve results comparable to models with far more parameters than itself, ranking among the best in the TTS field. Its efficient training process and convenient use make it a breakthrough in the field of speech synthesis. This article will introduce in detail the performance, training process, usage and limitations of the Kokoro model.
In the rapid development of artificial intelligence, speech synthesis technology is receiving increasing attention. Recently, the latest speech synthesis model named Kokoro was officially released on the Hugging Face platform. The model has 82 million parameters, marking an important milestone in the field of speech synthesis.
Kokoro v0.19 ranked first in the TTS (text-to-speech) leaderboard in the weeks leading up to its release, outperforming even other models with more parameters. In a mono setting, this model achieved results comparable to models such as the 467M parameter XTTS v2 and the 1.2B parameter MetaVoice using less than 100 hours of audio data. This achievement shows that the relationship between the performance of traditional speech synthesis models and the amount of parameters, computation and data may be more significant than previously expected.
In terms of use, users only need to run a few lines of code in Google Colab to load the model and voice package and generate high-quality audio. Kokoro currently supports US English and British English, and provides multiple voice packs for users to choose from.
Kokoro's training process uses Vast.ai's A10080GB vRAM instance, and the rental cost is relatively low, ensuring an efficient training process. The entire model was trained using less than 20 training epochs and less than 100 hours of audio data. Kokoro models use public domain audio data as well as audio from other open licenses in training, ensuring data compliance.
Although Kokoro performs well in speech synthesis, it is currently unable to support voice cloning due to limitations in its training data and architecture, and the main training data focuses on long readings and narratives rather than dialogue.
Model: https://huggingface.co/hexgrad/Kokoro-82M
Experience: https://huggingface.co/spaces/hexgrad/Kokoro-TTS
Highlight:
Kokoro-82M is a newly released speech synthesis model with 82 million parameters and supports a variety of speech packages.
This model has excellent performance in the field of TTS and once ranked first in the rankings. It only used less than 100 hours of audio data for training.
Kokoro models are trained using open-licensed data to ensure compliance, but there are currently some functional limitations.
All in all, the Kokoro model shows impressive potential in the field of speech synthesis, and its efficient training and excellent performance are worthy of attention. Although there are still some limitations at present, I believe that with the continuous development of technology, Kokoro will have wider application scenarios in the future.