Geely Automobile has made breakthrough progress in the field of speech synthesis. Its independently developed HAM-TTS large model "Xingrui" surpassed the industry benchmark VALL-E in performance, attracting widespread attention. The editor of Downcodes will explain in detail the core advantages and future impact of this technology.
Geely Automobile has recently made a major breakthrough in the field of speech synthesis. The performance of its independently developed HAM-TTS large model has surpassed the industry benchmark VALL-E, attracting widespread attention in the industry. This large AI model named Xingrui has achieved significant improvements in key indicators such as pronunciation accuracy, naturalness and speaker similarity.
The HAM-TTS model uses token-based zero-sample text-to-speech hierarchical acoustic modeling technology, which greatly improves the user interaction experience in the smart cockpit. Under the same 400 million parameter conditions, the character error rate of the HAM-TTS model dropped by 1.5% compared to VALL-E; and on the complete model with 800 million parameters, the character error rate dropped by 2.3%. In terms of style consistency, pitch consistency and overall score, the HAM-TTS model achieved a significant improvement of 10%.
The advantages of the Xingrui model are not only reflected in its performance indicators, but its practicality is also impressive. It can maintain the stability of the speaker's voice in a variety of scenarios such as avatar linkage, voice navigation, and news broadcasts, and intelligently adjust the tone, intonation, pauses, and emotions according to the situation. What’s more worth mentioning is that this model can seamlessly switch between different languages, including dialects and foreign languages, and can complete sound reproduction with only 3 seconds of sample input, which is far better than the more than 10 seconds usually required in the industry.
The Geely team innovatively improved model performance by introducing layered acoustic modeling. They solved the problem of inaccurate pronunciation and introduced latent space variable sequence predictors and text aligners to make the matching of text and sounds more accurate, making the synthesized speech more natural and smooth.
This breakthrough not only demonstrates Geely's R&D strength in intelligent technology, but also reflects its ambition in the field of AI. Geely's Xingrui AI large model system has been expanded to multiple directions such as multi-modal large models and language large models, laying the foundation for smart car technology. At the same time, Geely's total cloud computing power has also increased from 81 petaflops/second last year to 102 petaflops/second, demonstrating its continued investment in technology.
With the initial success of electrification, Geely's breakthrough in the field of intelligence has provided new ideas and possibilities for the future development of the automobile industry. This not only redefines our understanding of traditional automobile manufacturers, but also indicates that intelligence will become a key area of competition in the future automobile industry.
Paper address: https://arxiv.org/pdf/2403.05989
The success of Geely's "Xing Rui" marks the rise of China's automobile industry in the field of artificial intelligence, and its technological breakthroughs will profoundly affect the direction of intelligent development of future automobiles. We look forward to more innovations from Geely in the field of artificial intelligence!