The editor of Downcodes reported: Wuhan University, China Mobile Jiutian Artificial Intelligence Team and Duke Kunshan University collaborated to open source the huge audio and video speaker recognition data set VoxBlink2 based on YouTube data. This data set contains more than 110,000 hours of audio and video data, covering nearly 10 million high-quality audio clips, from more than 110,000 YouTube users. It is unprecedented in scale and provides valuable resources for research in the field of speech recognition. The open source of VoxBlink2 aims to promote the training and development of large voiceprint models and promote technological progress in this field.
Wuhan University, China Mobile Jiutian Artificial Intelligence Team and Duke Kunshan University have open sourced more than 110,000 hours of audio and video speaker recognition data set VoxBlink2 based on YouTube data. This dataset contains 9,904,382 high-quality audio clips and their corresponding video clips from 111,284 users on YouTube. It is currently the largest publicly available audio and video speaker recognition dataset. The release of the data set aims to enrich the open source speech corpus and support the training of large voiceprint models.
The VoxBlink2 dataset is data mined through the following steps:
Candidate preparation: Collect multilingual keyword lists, retrieve user videos, and select videos from the previous minute for processing.
Face extraction & detection: Extract video frames at high frame rate, use MobileNet to detect faces, and ensure that the video track only contains a single speaker.
Face recognition: Pre-trained face recognizer recognizes frame by frame to ensure that the audio and video clips come from the same person.
Active speaker detection: Using lip movement sequences and audio, the multi-modal active speaker detector outputs vocal segments, and aliasing detection removes multi-speaker segments.
In order to improve the data accuracy, a bypass step of the in-set face recognizer was also introduced to increase the accuracy from 72% to 92% through rough face extraction, face verification, face sampling and training.
VoxBlink2 also open sourced voiceprint models of different sizes, including a 2D convolution model based on ResNet and a temporal model based on ECAPA-TDNN, as well as the very large model ResNet293 based on Simple Attention Module. These models can achieve an EER of 0.17% and a minDCF of 0.006% after post-processing on the Vox1-O data set.
Data set website : https://VoxBlink2.github.io
Data set download method : https://github.com/VoxBlink2/ScriptsForVoxBlink2
Meta files and models: https://drive.google.com/drive/folders/1lzumPsnl5yEaMP9g2bFbSKINLZ-QRJVP
Paper address : https://arxiv.org/abs/2407.11510
In short, the open source of the VoxBlink2 data set provides a powerful boost for research in the field of speech recognition and voiceprint recognition, and we look forward to its greater role in future applications. The editor of Downcodes will continue to pay attention to the subsequent development and application of this data set.