Implementation of Spear-TTS - multi-speaker text-to-speech attention network, in Pytorch
The text-to-semantic module built here will be used for SoundStorm for conditioning.
Stability for their generous sponsorships to work on and open source cutting edge artificial intelligence research
Lucas Newman for completing the backtranslation portion, as well as beam search decoding!
Lucas Newman for completing the final text to semantic transformer training code!
$ pip install spear-tts-pytorch
import torch
from audiolm_pytorch import HubertWithKmeans
from spear_tts_pytorch import (
TextToSemantic,
SemanticToTextDatasetGenerator,
GeneratedAudioTextDataset,
MockDataset
)
wav2vec = HubertWithKmeans(
checkpoint_path = './hubert_base_ls960.pt',
kmeans_path = './hubert_base_ls960_L9_km500.bin'
)
model = TextToSemantic(
wav2vec = wav2vec,
dim = 512,
num_text_token_ids = 256,
heads = 8,
target_kv_heads = 2, # grouped query attention, for memory efficient decoding
source_depth = 1,
target_depth = 1
)
ds = MockDataset(10)
dataset_generator = SemanticToTextDatasetGenerator(
model = model,
dataset = ds,
folder = './output_folder'
)
dataset_generator(max_length = 2)
generated_dataset = GeneratedAudioTextDataset(
folder = './output_folder'
)
assert len(generated_dataset) == 10
add eos logic + generate, and hook up end-to-end generation in soundstorm
add first pretraining speech-to-speech with the reconstruction of 60% deleted tokens
add dropouts for this project, as low-resource
add total flexiblity of which layers of encoder / decoder to freeze during training
add step for training on small speech -> text corpus and generating pseudo-labelled dataset + finetuning (thanks to @lucasnewman)
add final step of finetuning on text -> speech + pseudolabelled dataset
figure out the best way to store and manage the pseudo-labelled generated dataset
batched beam search decoding
allow for using rotary positions in decoder + flash attention, give Tri another citation
integrate speculative decoding with some improvisation - done in same model using early exit strategy
add cached key / values for starter + single / grouped key values, make sure flash attention can support specialized causal mask before flash attention 2 is in pytorch core
polish the audio-text generation workflow
concatting the real audio-text dataset with the generated one -> or being able to convert real audio-text dataset to generated
@misc{kharitonov2023speak,
title = {Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision},
author = {Eugene Kharitonov and Damien Vincent and Zalán Borsos and Raphaël Marinier and Sertan Girgin and Olivier Pietquin and Matt Sharifi and Marco Tagliasacchi and Neil Zeghidour},
year = {2023},
eprint = {2302.03540},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
@inproceedings{dao2022flashattention,
title = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
author = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher},
booktitle = {Advances in Neural Information Processing Systems},
year = {2022}
}
@misc{shi2023enhance,
title = {Enhance audio generation controllability through representation similarity regularization},
author = {Yangyang Shi and Gael Le Lan and Varun Nagaraja and Zhaoheng Ni and Xinhao Mei and Ernie Chang and Forrest Iandola and Yang Liu and Vikas Chandra},
year = {2023},
eprint = {2309.08773},
archivePrefix = {arXiv},
primaryClass = {cs.SD}
}
@article{Ainslie2023GQATG,
title = {GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints},
author = {Joshua Ainslie and James Lee-Thorp and Michiel de Jong and Yury Zemlyanskiy and Federico Lebr'on and Sumit K. Sanghai},
journal = {ArXiv},
year = {2023},
volume = {abs/2305.13245},
url = {https://api.semanticscholar.org/CorpusID:258833177}
}
@inproceedings{Leviathan2022FastIF,
title = {Fast Inference from Transformers via Speculative Decoding},
author = {Yaniv Leviathan and Matan Kalman and Y. Matias},
booktitle = {International Conference on Machine Learning},
year = {2022},
url = {https://api.semanticscholar.org/CorpusID:254096365}
}