In real-time voice communication, changing the speaker's timbre without affecting semantics and prosody has always been a technical problem. The editor of Downcodes will introduce a breakthrough technology today - StreamVC, which can change the speaker's voice timbre in real time while maintaining the voice content and rhythm. It is suitable for mobile platforms and provides real-time communication and voice anonymization. New possibilities. StreamVC's low latency, high-quality speech synthesis, and pitch stability give it significant advantages in the field of real-time communications.
In a world of real-time communication, whether it’s a phone call or a video conference, sound is an important tool for us to express ourselves. But have you ever thought about what would happen if we could change the timbre of a speaker's voice in real time without affecting the content and rhythm of the language? The emergence of StreamVC technology allows us to do this.
StreamVC is an innovative voice conversion solution that matches the timbre of the target voice while maintaining the content and prosody of the source voice. Unlike traditional methods, StreamVC produces the resulting waveform with low latency on the input signal, even on mobile platforms, making it suitable for real-time communication scenarios such as phone calls and video conferencing, as well as voice anonymization in these scenarios.
Technical Highlights:
Real-time: StreamVC is capable of 70.8 milliseconds of low-latency inference on mobile devices.
High-quality speech synthesis: Utilize the architecture and training strategy of the SoundStream neural audio codec to achieve lightweight, high-quality speech synthesis.
Pitch stability: By introducing whitened fundamental frequency (f0) information, pitch consistency is improved without leaking the source speaker's timbre information.
The design of StreamVC is inspired by Soft-VC and SoundStream. It uses discrete speech units extracted by the HuBERT model as prediction targets for the content encoder network. The content encoder and decoder architecture and training strategy are designed from the SoundStream neural audio codec to achieve high-quality causal audio synthesis.
StreamVC was compared to existing technologies on multiple benchmarks, including naturalness, understandability, speaker similarity, and pitch consistency. Experimental results show that StreamVC performs well in preserving the pitch of the source language and is comparable to the fine-tuned model in terms of speaker similarity.
StreamVC proves that efficient sound conversion with low latency on mobile devices is entirely feasible. HuBERT-derived soft speech units can be learned through a streamable causal convolutional neural network architecture, and injecting whitened f0 information into the decoder is crucial to provide high-quality output.
Paper address: https://arxiv.org/pdf/2401.03078
The emergence of StreamVC technology has brought new possibilities for real-time voice communication. Its low-latency, high-quality voice conversion capabilities will promote the application of voice technology in more fields. I believe that in the future, StreamVC will play a greater role in voice anonymization, voice special effects, etc. Looking forward to more innovative applications based on StreamVC!