Descarga naturalspeech2 pytorch - Descarga del código fuente de naturalspeech2 pytorch

Discurso natural 2 - Pytorch (wip)

Implementación de Natural Speech 2, sintetizador de voz y canto Zero-shot, en Pytorch

NaturalSpeech 2 es un sistema TTS que aprovecha un códec de audio neuronal con vectores latentes continuos y un modelo de difusión latente con generación no autorregresiva para permitir la síntesis de texto a voz natural y de disparo cero.

Este repositorio utilizará difusión de eliminación de ruido en lugar de SDE basado en puntuaciones, y potencialmente también puede ofrecer una versión aclarada. También ofrecerá mejoras para los componentes de atención/transformador cuando corresponda.

Apreciación

Estabilidad y ? Huggingface por sus generosos patrocinios para trabajar en investigaciones de inteligencia artificial de vanguardia y de código abierto.
? Huggingface para la increíble biblioteca acelerada
¡Manmay por enviar el código inicial para los codificadores de fonemas, tono, duración y indicaciones de voz, así como el fonemizador multilingüe y el alineador de fonemas!
¡Manmay por cablear el acondicionamiento completo de extremo a extremo de la red de difusión!
¿Tú? Si eres un aspirante a ingeniero de ML/IA o trabajas en el campo de TTS y te gustaría contribuir al código abierto de última generación, ¡participa!

Instalar

$ pip install naturalspeech2-pytorch

Uso

 import torch
from naturalspeech2_pytorch import (
    EncodecWrapper ,
    Model ,
    NaturalSpeech2
)

# use encodec as an example

codec = EncodecWrapper ()

model = Model (
    dim = 128 ,
    depth = 6
)

# natural speech diffusion model

diffusion = NaturalSpeech2 (
    model = model ,
    codec = codec ,
    timesteps = 1000
). cuda ()

# mock raw audio data

raw_audio = torch . randn ( 4 , 327680 ). cuda ()

loss = diffusion ( raw_audio )
loss . backward ()

# do the above in a loop for a lot of raw audio data...
# then you can sample from your generative model as so

generated_audio = diffusion . sample ( length = 1024 ) # (1, 327680)

Con acondicionamiento

ex.

 import torch
from naturalspeech2_pytorch import (
    EncodecWrapper ,
    Model ,
    NaturalSpeech2 ,
    SpeechPromptEncoder
)

# use encodec as an example

codec = EncodecWrapper ()

model = Model (
    dim = 128 ,
    depth = 6 ,
    dim_prompt = 512 ,
    cond_drop_prob = 0.25 ,                  # dropout prompt conditioning with this probability, for classifier free guidance
    condition_on_prompt = True
)

# natural speech diffusion model

diffusion = NaturalSpeech2 (
    model = model ,
    codec = codec ,
    timesteps = 1000
)

# mock raw audio data

raw_audio = torch . randn ( 4 , 327680 )
prompt = torch . randn ( 4 , 32768 )               # they randomly excised a range on the audio for the prompt during training, eventually will take care of this auto-magically

text = torch . randint ( 0 , 100 , ( 4 , 100 ))
text_lens = torch . tensor ([ 100 , 50 , 80 , 100 ])

# forwards and backwards

loss = diffusion (
    audio = raw_audio ,
    text = text ,
    text_lens = text_lens ,
    prompt = prompt
)

loss . backward ()

# after much training

generated_audio = diffusion . sample (
    length = 1024 ,
    text = text ,
    prompt = prompt
) # (1, 327680)

O si desea que una clase Trainer se encargue del ciclo de capacitación y muestreo, simplemente haga

 from naturalspeech2_pytorch import Trainer

trainer = Trainer (
    diffusion_model = diffusion ,     # diffusion model + codec from above
    folder = '/path/to/speech' ,
    train_batch_size = 16 ,
    gradient_accumulate_every = 2 ,
)

trainer . train ()

Hacer

Perceptor completo y luego condicionamiento de atención cruzada en el lado ddpm.
agregue orientación gratuita del clasificador, incluso si no está en papel
predicción completa de duración/tono durante el entrenamiento - gracias a Manmay
asegúrese de que la forma de calcular el tono de pyworld también pueda funcionar
consulte al estudiante de doctorado en el campo TTS sobre el uso de pyworld
También ofrecemos acondicionamiento de suma directa utilizando el módulo de texto a semántico Spear-TTS, si está disponible.
añadir autoacondicionamiento en el lado ddpm
encargarse de la división automática del audio según el aviso, teniendo en cuenta el segmento de audio mínimo permitido por el modelo de códec
asegúrese de que curtail_from_left funcione para encodec, averigüe qué están haciendo

Citas

 @inproceedings { Shen2023NaturalSpeech2L ,
    title   = { NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers } ,
    author  = { Kai Shen and Zeqian Ju and Xu Tan and Yanqing Liu and Yichong Leng and Lei He and Tao Qin and Sheng Zhao and Jiang Bian } ,
    year    = { 2023 }
}

 @misc { shazeer2020glu ,
    title   = { GLU Variants Improve Transformer } ,
    author  = { Noam Shazeer } ,
    year    = { 2020 } ,
    url     = { https://arxiv.org/abs/2002.05202 }
}

 @inproceedings { dao2022flashattention ,
    title   = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
    author  = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
    booktitle = { Advances in Neural Information Processing Systems } ,
    year    = { 2022 }
}

 @article { Salimans2022ProgressiveDF ,
    title   = { Progressive Distillation for Fast Sampling of Diffusion Models } ,
    author  = { Tim Salimans and Jonathan Ho } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2202.00512 }
}

 @inproceedings { Hang2023EfficientDT ,
    title   = { Efficient Diffusion Training via Min-SNR Weighting Strategy } ,
    author  = { Tiankai Hang and Shuyang Gu and Chen Li and Jianmin Bao and Dong Chen and Han Hu and Xin Geng and Baining Guo } ,
    year    = { 2023 }
}

 @article { Alayrac2022FlamingoAV ,
    title   = { Flamingo: a Visual Language Model for Few-Shot Learning } ,
    author  = { Jean-Baptiste Alayrac and Jeff Donahue and Pauline Luc and Antoine Miech and Iain Barr and Yana Hasson and Karel Lenc and Arthur Mensch and Katie Millican and Malcolm Reynolds and Roman Ring and Eliza Rutherford and Serkan Cabi and Tengda Han and Zhitao Gong and Sina Samangooei and Marianne Monteiro and Jacob Menick and Sebastian Borgeaud and Andy Brock and Aida Nematzadeh and Sahand Sharifzadeh and Mikolaj Binkowski and Ricardo Barreira and Oriol Vinyals and Andrew Zisserman and Karen Simonyan } ,
    journal  = { ArXiv } ,
    year     = { 2022 } ,
    volume   = { abs/2204.14198 }
}

 @article { Badlani2021OneTA ,
    title   = { One TTS Alignment to Rule Them All } ,
    author  = { Rohan Badlani and Adrian Lancucki and Kevin J. Shih and Rafael Valle and Wei Ping and Bryan Catanzaro } ,
    journal = { ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) } ,
    year    = { 2021 } ,
    pages   = { 6092-6096 } ,
    url     = { https://api.semanticscholar.org/CorpusID:237277973 }
}