Downcodes editor reports: Johns Hopkins University and Tencent AI Lab jointly developed a breakthrough text-to-audio generation model called EzAudio. Its efficient and high-quality audio conversion capabilities mark the field of artificial intelligence and audio technology. A big progress. EzAudio uses innovative audio waveform latent space technology, combined with advanced technologies such as AdaLN-SOLA, to surpass existing open source models in both objective and subjective evaluations. The model’s open source code, dataset, and model checkpoints are made publicly available to encourage further research and application.
EzAudio works by exploiting the latent space of audio waveforms rather than traditional spectrograms, an innovation that allows it to work at high temporal resolution without the need for an additional neural vocoder.
EzAudio's architecture, called EzAudio-DiT (Diffusion Transformer), uses a number of technological innovations to improve performance and efficiency. These include a new adaptive layer normalization technology AdaLN-SOLA, long-hop connections, and advanced position encoding technologies such as RoPE (rotated position embedding).
The researchers say the audio samples generated by EzAudio are so realistic that both objective and subjective evaluations outperform existing open source models.
Currently, the AI audio generation market is growing rapidly. Well-known companies like ElevenLabs recently launched an iOS app for text-to-speech conversion, showing strong consumer interest in AI audio tools. At the same time, technology giants such as Microsoft and Google are also increasing investment in AI voice simulation technology.
According to Gartner's predictions, by 2027, 40% of generative AI solutions will be multimodal, combining the capabilities of text, images, and audio, which means that high-quality audio generation models like EzAudio are likely to continue to evolve. play an important role in the field of AI.
The EzAudio team has made their code, datasets, and model checkpoints publicly available, emphasizing transparency and encouraging further research in this area.
Researchers believe EzAudio may have applications beyond sound effects generation, involving areas such as speech and music production. As technology continues to advance, it is expected to be widely used in industries such as entertainment, media, ancillary services, and virtual assistants.
demo:https://huggingface.co/spaces/OpenSound/EzAudio
Project entrance: https://github.com/haidog-yaqub/EzAudio?tab=readme-ov-file
Highlight:
EzAudio is a new text-to-audio generation model launched by Johns Hopkins University in collaboration with Tencent, marking a major advancement in audio technology.
? Through innovative architecture and technology, the audio samples generated by this model are superior in quality to existing open source models, and have broad application potential.
As technology develops, issues of ethical and responsible use gradually come to the fore, and EzAudio’s public research code also provides extensive opportunities for future examination of risks and benefits.
EzAudio's open source and high performance give it significant advantages in the field of AI audio generation, and its future application prospects are broad, but it also needs to pay attention to its ethical and social impacts. The editor of Downcodes will continue to pay attention to the progress and application of this technology.