Code for ACL 2024 paper "Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?".
Listen to ComSpeech's translated speech
ComSpeech is a general composite S2ST model architecture, which can seamlessly integrate any pretrained S2TT and TTS models into a direct S2ST model.
ComSpeech surpasses previous two-pass models like UnitY and Translatotron 2 in both translation quality and decoding speed.
With our proposed training strategy ComSpeech-ZS, we achieve performance comparable to supervised training without using any parallel speech data.
We also have some other projects on speech-to-speech translation that you might be interested in:
StreamSpeech (ACL 2024): An "All in One" seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.
NAST-S2x (ACL 2024): A fast and end-to-end simultaneous speech-to-text/speech translation model.
DASpeech (NeurIPS 2023): An non-autoregressive two-pass direct speech-to-speech translation model with high-quality translations and fast decoding speed.
CTC-S2UT (ACL 2024 Findings): An non-autoregressive textless speech-to-speech translation model with up to 26.81× decoding speedup.
python==3.8, torch==2.1.2
Install fairseq:
cd fairseq pip install -e .
Download CoVoST 2 Fr/De/Es-En and CVSS-C X-En (21 languages in total) datasets and place them in the data/
directory.
Download our released data manifests from ?Huggingface, and also place them in the data/
directory. The directory should be like the following:
data ├── comspeech │ ├── cvss_de_en │ ├── cvss_es_en │ ├── cvss_fr_en │ └── cvss_x_en ├── covost2 │ └── fr │ ├── clips │ ├── dev.tsv │ ├── invalidated.tsv │ ├── other.tsv │ ├── test.tsv │ ├── train.tsv │ └── validated.tsv └── cvss-c └── fr-en └── mfa.tar.gz
Extract fbank features for the source speech.
for src_lang in fr de es; dopython ComSpeech/data_preparation/extract_src_features.py --cvss-data-root data/cvss-c/ --covost-data-root data/covost2/ --output-root data/cvss-c/${src_lang}-en/src --src-lang $src_langdone
Extract mel-spectrogram, duration, pitch, and energy information for the target speech.
for src_lang in ar ca cy de es et fa fr id it ja lv mn nl pt ru sl sv-SE ta tr zh-CN; domkdir -p data/cvss-c/${src_lang}-en/mfa_align tar -xzvf data/cvss-c/${src_lang}-en/mfa.tar.gz -C data/cvss-c/${src_lang}-en/mfa_align/ python ComSpeech/data_preparation/extract_tgt_features.py --audio-manifest-root data/cvss-c/${src_lang}-en/ --output-root data/cvss-c/${src_lang}-en/tts --textgrid-dir data/cvss-c/${src_lang}-en/mfa_align/speaker/done
Replace the path in files in the data/comspeech/
directory.
python ComSpeech/data_preparation/fill_data.py
Note
The following scripts use 4 RTX 3090 GPUs by default. You can adjust --update-freq
, --max-tokens-st
, --max-tokens
, and --batch-size-tts
depending on your available GPUs.
In the supervised learning scenario, we first use the S2TT data and TTS data to pretrain the S2TT and TTS models respectively, and then finetune the entire model using the S2ST data. The following script is an example on the CVSS Fr-En dataset. For De-En and Es-En directions, you only need to change the source language in scripts.
Pretrain the S2TT model, and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en/checkpoint_best.pt
.
bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.sh
Pretrain the TTS model, and the best checkpoint will be saved at ComSpeech/checkpoints/tts.fastspeech2.cvss-fr-en/checkpoint_best.pt
.
bash ComSpeech/train_scripts/tts/train.tts.fastspeech2.cvss-fr-en.sh
Finetune the entire model using the S2ST data, and the chekpoints will be saved at ComSpeech/checkpoints/s2st.fr-en.comspeech
.
bash ComSpeech/train_scripts/s2st/train.s2st.fr-en.comspeech.sh
Average the 5 best checkpoints and test the results on the test
set.
bash ComSpeech/test_scripts/generate.fr-en.comspeech.sh
Note
To run inference, you need to download the pretrained HiFi-GAN vocoder from this link and place it in the hifi-gan/
directory.
In the zero-shot learning scenario, we first pretrain the S2TT model using CVSS Fr/De/Es-En S2TT data, and pretrain the TTS model using CVSS X-En TTS (X∉{Fr,De,Es}) data. Then, we finetune the entire model in two stages using these two parts of the data.
Pretrain the S2TT model, and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en/checkpoint_best.pt
.
bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.sh
Pretrain the TTS model, and the best checkpoint will be saved at ComSpeech/checkpoints/tts.fastspeech2.cvss-x-en/checkpoint_best.pt
(note: this checkpoint is used for experiments on all language pairs in the zero-shot learning scenario).
bash ComSpeech/train_scripts/tts/train.tts.fastspeech2.cvss-x-en.sh
Finetune the S2TT model and the vocabulary adaptor using S2TT data (stage 1), and the best checkpoint will be saved at ComSpeech/checkpoints/st.cvss.fr-en.ctc/checkpoint_best.pt
.
bash ComSpeech/train_scripts/st/train.st.cvss.fr-en.ctc.sh
Finetune the entire model using both S2TT and TTS data (stage 2), and the checkpoints will be saved at ComSpeech/checkpoints/s2st.fr-en.comspeech-zs
.
bash ComSpeech/train_scripts/s2st/train.s2st.fr-en.comspeech-zs.sh
Average the 5 best checkpoints and test the results on the test
set.
bash ComSpeech/test_scripts/generate.fr-en.comspeech-zs.sh
We have released the checkpoints for each of the above steps. You can download them from ?HuggingFace.
Directions | S2TT Pretrain | TTS Pretrain | ComSpeech |
---|---|---|---|
Fr-En | [download] | [download] | [download] |
De-En | [download] | [download] | [download] |
Es-En | [download] | [download] | [download] |
Directions | S2TT Pretrain | TTS Pretrain | 1-stage Finetune | 2-stage Finetune |
---|---|---|---|---|
Fr-En | [download] | [download] | [download] | [download] |
De-En | [download] | [download] | [download] | [download] |
Es-En | [download] | [download] | [download] | [download] |
If you have any questions, please feel free to submit an issue or contact [email protected]
.
If our work is useful for you, please cite as:
@inproceedings{fang-etal-2024-can, title = {Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?}, author = {Fang, Qingkai and Zhang, Shaolei and Ma, Zhengrui and Zhang, Min and Feng, Yang}, booktitle = {Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics}, year = {2024}, }