Official Implementation of TSELM: Target Speaker Extraction using Discrete Tokens and Language Models.
To refer to the model class, check exp/tselm/model.py directly. Note that the mixed audio is clipped to length 48080 (3.05s x 16khz) and reference speech is clipped to 64080 (4.05s x 16khz) for training, respectively.
We propose TSELM, a novel target speaker extraction network that leverages discrete tokens and language models. TSELM utilizes multiple discretized layers from WavLM as input tokens and incorporates cross-attention mechanisms to integrate target speaker information. Language models are employed to capture the sequence dependencies, while a scalable HiFi-GAN is used to reconstruct the audio from the tokens. By applying a cross-entropy loss, TSELM models the probability distribution of output tokens, thus converting the complex regression problem of audio generation into a classification task. Experimental results show that TSELM achieves excellent results in speech quality and comparable results in speech intelligibility.
requirements.txt
Befor running experiments, we need to download the following frozen pretrained models.
Name | Link | Result |
---|---|---|
WavLM Large | https://huggingface.co/microsoft/wavlm-large/tree/main | wavlm-large |
Kmeans | Download Kmeans Checkpoint | kmeans_ckpt |
Scalable HiFiGAN | Download HiFiGAN Checkpoint | hifigan-wavlm-l1-3-7-18-23-k1000-LibriTTS |
Note that for the output of WavLM Large, it is recommended to clone the whole repository or download the whole directory. For Kmeans and Scalable HiFiGAN, we need to extract them after downloading.
The training config is specified using hyperpyyaml
package, which is basically a reflection.
The config for training TSELM-L
can be found in config/tselm_l.yaml. Before training, you need to specify the config for the frozen pretrained models and other training details. Details can be found in config/tselm_l.yaml and config/README.md.
After configuration, you can run
## Train the model using the config
python train.py --config_path ./config/tselm_l.yaml --log ./log --ckpt_path ./ckpt/tselm_l
--config_path
specifies the path to the config file.--log
specifies the log output directory. All logs will be put here.--ckpt_path
specifies the checkpoint directory. Training can be resumed using the same checkpoint path.After training, the best model will be at
.
To infer our model on libri2mix testset, for example, you can run
## Generate output audio on libri2mix testset
python inference.py -scp <path_to_libri2mix_test_scp_folder>
-config ./config/tselm_l.yaml
-ckpt <path_to_ckpt>
--output <path_to_output_folder>
-gpus cuda:0 cuda:1 cuda:2 cuda:3
-proc 8
-scp
specifies the the path to the libri2mix testset folder containing aux_s1.scp
, s1.scp
, and mix_clean.scp
.-config
specifies the config. This config needs to have the model
field.-ckpt
specifies the model checkpoint.--output
specifies the output directory.
The output audio will be output to this folder. Their names will be the same as those in .scp files.-gpus
specifies the available gpus to run inference.-proc
specifies the total number of processes to run the inference in parallel. It will
use the provided gpus and divide the processes equally on each device. Data will be split equally to each process.Our TSELM-L checkpoint can be downloaded here.
You can infer on the libri2mix testset by substituting the -ckpt
with path to the checkpoint.
Note that you still need to download the pretrained models and add the corresponding checkpoint folder to config/tselm_l.yaml.