naturalspeech2 pytorch下載naturalspeech2 pytorch原始碼下載

自然語音 2 - Pytorch（正在開發中）

在 Pytorch 中實現自然語音 2、零樣本語音和歌唱合成器

NaturalSpeech 2 是一個 TTS 系統，它利用具有連續潛在向量的神經音頻編解碼器和具有非自回歸生成的潛在擴散模型來實現自然和零樣本的文本到語音合成

該儲存庫將使用去噪擴散而不是基於分數的 SDE，並且也可能提供闡明的版本。它還將在適用的情況下對注意力/變壓器組件進行改進。

欣賞

穩定性和？ Huggingface 慷慨贊助尖端人工智慧研究並開源
？擁抱令人驚嘆的加速庫
感謝 Manmay 提交了音素、音調、持續時間和語音提示編碼器以及多語言音素產生器和音素對齊器的初始代碼！
Manmay 連接擴散網路的完整端對端調整！
你？如果您是一位有抱負的 ML / AI 工程師或在 TTS 領域工作，並希望為最先進的開源做出貢獻，請立即加入！

安裝

$ pip install naturalspeech2-pytorch

用法

 import torch
from naturalspeech2_pytorch import (
    EncodecWrapper ,
    Model ,
    NaturalSpeech2
)

# use encodec as an example

codec = EncodecWrapper ()

model = Model (
    dim = 128 ,
    depth = 6
)

# natural speech diffusion model

diffusion = NaturalSpeech2 (
    model = model ,
    codec = codec ,
    timesteps = 1000
). cuda ()

# mock raw audio data

raw_audio = torch . randn ( 4 , 327680 ). cuda ()

loss = diffusion ( raw_audio )
loss . backward ()

# do the above in a loop for a lot of raw audio data...
# then you can sample from your generative model as so

generated_audio = diffusion . sample ( length = 1024 ) # (1, 327680)

有調理

前任。

 import torch
from naturalspeech2_pytorch import (
    EncodecWrapper ,
    Model ,
    NaturalSpeech2 ,
    SpeechPromptEncoder
)

# use encodec as an example

codec = EncodecWrapper ()

model = Model (
    dim = 128 ,
    depth = 6 ,
    dim_prompt = 512 ,
    cond_drop_prob = 0.25 ,                  # dropout prompt conditioning with this probability, for classifier free guidance
    condition_on_prompt = True
)

# natural speech diffusion model

diffusion = NaturalSpeech2 (
    model = model ,
    codec = codec ,
    timesteps = 1000
)

# mock raw audio data

raw_audio = torch . randn ( 4 , 327680 )
prompt = torch . randn ( 4 , 32768 )               # they randomly excised a range on the audio for the prompt during training, eventually will take care of this auto-magically

text = torch . randint ( 0 , 100 , ( 4 , 100 ))
text_lens = torch . tensor ([ 100 , 50 , 80 , 100 ])

# forwards and backwards

loss = diffusion (
    audio = raw_audio ,
    text = text ,
    text_lens = text_lens ,
    prompt = prompt
)

loss . backward ()

# after much training

generated_audio = diffusion . sample (
    length = 1024 ,
    text = text ,
    prompt = prompt
) # (1, 327680)

或者，如果您希望Trainer類別負責訓練和取樣循環，只需簡單地執行

 from naturalspeech2_pytorch import Trainer

trainer = Trainer (
    diffusion_model = diffusion ,     # diffusion model + codec from above
    folder = '/path/to/speech' ,
    train_batch_size = 16 ,
    gradient_accumulate_every = 2 ,
)

trainer . train ()

托多

完成感知者，然後在 ddpm 方面進行交叉注意調節
添加分類器免費指導，即使不是紙質的
訓練期間完整的持續時間/音高預測 - 感謝 Manmay
確保 pyworld 計算音高的方式也可以運作
向 TTS 領域的博士生諮詢 pyworld 的用法
也可以使用 Spear-TTS 文字到語意模組（如果可用）提供直接求和調節
在ddpm端加入自調節
負責提示音訊的自動切片，並了解編解碼器模型允許的最小音訊段
確保 curtail_from_left 適用於編碼器，弄清楚他們在做什麼

引文

 @inproceedings { Shen2023NaturalSpeech2L ,
    title   = { NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers } ,
    author  = { Kai Shen and Zeqian Ju and Xu Tan and Yanqing Liu and Yichong Leng and Lei He and Tao Qin and Sheng Zhao and Jiang Bian } ,
    year    = { 2023 }
}

 @misc { shazeer2020glu ,
    title   = { GLU Variants Improve Transformer } ,
    author  = { Noam Shazeer } ,
    year    = { 2020 } ,
    url     = { https://arxiv.org/abs/2002.05202 }
}

 @inproceedings { dao2022flashattention ,
    title   = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
    author  = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
    booktitle = { Advances in Neural Information Processing Systems } ,
    year    = { 2022 }
}

 @article { Salimans2022ProgressiveDF ,
    title   = { Progressive Distillation for Fast Sampling of Diffusion Models } ,
    author  = { Tim Salimans and Jonathan Ho } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2202.00512 }
}

 @inproceedings { Hang2023EfficientDT ,
    title   = { Efficient Diffusion Training via Min-SNR Weighting Strategy } ,
    author  = { Tiankai Hang and Shuyang Gu and Chen Li and Jianmin Bao and Dong Chen and Han Hu and Xin Geng and Baining Guo } ,
    year    = { 2023 }
}

 @article { Alayrac2022FlamingoAV ,
    title   = { Flamingo: a Visual Language Model for Few-Shot Learning } ,
    author  = { Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan } ,
    journal  = { ArXiv } ,
    year     = { 2022 } ,
    volume   = { abs/2204.14198 }
}

 @article { Badlani2021OneTA ,
    title   = { One TTS Alignment to Rule Them All } ,
    author  = { Rohan Badlani and Adrian Lancucki and Kevin J. Shih and Rafael Valle and Wei Ping and Bryan Catanzaro } ,
    journal = { ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) } ,
    year    = { 2021 } ,
    pages   = { 6092-6096 } ,
    url     = { https://api.semanticscholar.org/CorpusID:237277973 }
}