musiclm pytorch Télécharger - musiclm pytorch Téléchargement du code source

MusicLM - Pytorch

Implémentation de MusicLM, le nouveau modèle SOTA de Google pour la génération de musique à l'aide de réseaux d'attention, dans Pytorch.

Ils utilisent essentiellement AudioLM conditionné en texte, mais étonnamment avec les intégrations d'un modèle appris contrastif texte-audio nommé MuLan. MuLan est ce qui sera construit dans ce référentiel, avec AudioLM modifié à partir de l'autre référentiel pour répondre aux besoins de génération de musique ici.

Veuillez nous rejoindre si vous souhaitez aider à la réplication avec la communauté LAION

Qu'est-ce que l'IA par Louis Bouchard

Appréciation

Stability.ai pour son généreux parrainage en faveur du travail et de la recherche de pointe en intelligence artificielle open source
? Huggingface pour leur bibliothèque de formation accélérée

Usage

 $ pip install musiclm-pytorch

Usage

MuLaN doit d'abord être formé

 import torch
from musiclm_pytorch import MuLaN , AudioSpectrogramTransformer , TextTransformer

audio_transformer = AudioSpectrogramTransformer (
    dim = 512 ,
    depth = 6 ,
    heads = 8 ,
    dim_head = 64 ,
    spec_n_fft = 128 ,
    spec_win_length = 24 ,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer (
    dim = 512 ,
    depth = 6 ,
    heads = 8 ,
    dim_head = 64
)

mulan = MuLaN (
    audio_transformer = audio_transformer ,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch . randn ( 2 , 1024 )
texts = torch . randint ( 0 , 20000 , ( 2 , 256 ))

loss = mulan ( wavs , texts )
loss . backward ()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan . get_audio_latents ( wavs )  # during training

embeds = mulan . get_text_latents ( texts )  # during inference

Pour obtenir les intégrations de conditionnement pour les trois transformateurs qui font partie d' AudioLM , vous devez utiliser le MuLaNEmbedQuantizer comme suit.

 from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer (
    mulan = mulan ,                          # pass in trained mulan from above
    conditioning_dims = ( 1024 , 1024 , 1024 ), # say all three transformers have model dimensions of 1024
    namespaces = ( 'semantic' , 'coarse' , 'fine' )
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch . randn ( 2 , 1024 )
conds = quantizer ( wavs = wavs , namespace = 'semantic' ) # (2, 8, 1024) - 8 is number of quantizers

Pour entraîner (ou affiner) les trois transformateurs qui font partie d' AudioLM , vous suivez simplement les instructions sur audiolm-pytorch pour la formation, mais transmettez l'instance MulanEmbedQuantizer aux classes de formation sous le mot-clé audio_conditioner

ex. SemanticTransformerTrainer

 import torch
from audiolm_pytorch import HubertWithKmeans , SemanticTransformer , SemanticTransformerTrainer

wav2vec = HubertWithKmeans (
    checkpoint_path = './hubert/hubert_base_ls960.pt' ,
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer (
    num_semantic_tokens = wav2vec . codebook_size ,
    dim = 1024 ,
    depth = 6 ,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
). cuda ()

trainer = SemanticTransformerTrainer (
    transformer = semantic_transformer ,
    wav2vec = wav2vec ,
    audio_conditioner = quantizer ,   # pass in the MulanEmbedQuantizer instance above
    folder = '/path/to/audio/files' ,
    batch_size = 1 ,
    data_max_length = 320 * 32 ,
    num_train_steps = 1
)

trainer . train ()

Après une longue formation sur les trois transformateurs (sémantique, grossier, fin), vous transmettrez vos AudioLM et MuLaN affinés ou formés à partir de zéro, enveloppés dans MuLaNEmbedQuantizer au MusicLM

 # you need the trained AudioLM (audio_lm) from above
# with the MulanEmbedQuantizer (mulan_embed_quantizer)

from musiclm_pytorch import MusicLM

musiclm = MusicLM (
    audio_lm = audio_lm ,                 # `AudioLM` from https://github.com/lucidrains/audiolm-pytorch
    mulan_embed_quantizer = quantizer    # the `MuLaNEmbedQuantizer` from above
)

music = musiclm ( 'the crystalline sounds of the piano in a ballroom' , num_samples = 4 ) # sample 4 and pick the top match with mulan

Faire

Mulan semble utiliser l'apprentissage contrastif découplé, propose-le en option
envelopper mulan avec mulan wrapper et quantifier la sortie, projeter aux dimensions audiolm
modifier l'audiolm pour accepter les intégrations de conditionnement, prendre éventuellement en charge différentes dimensions via une projection séparée
audiolm et mulan vont dans musiclm et génèrent, filtrent avec mulan
donner un biais de position dynamique à l'attention personnelle dans l'AST
implémenter MusicLM en générant plusieurs échantillons et en sélectionnant la meilleure correspondance avec MuLaN
prend en charge l'audio de longueur variable avec masquage dans le transformateur audio
ajouter une version de Mulan pour ouvrir le clip
définir tous les hyperparamètres appropriés du spectrogramme

Citations

 @inproceedings { Agostinelli2023MusicLMGM ,
    title     = { MusicLM: Generating Music From Text } ,
    author    = { Andrea Agostinelli and Timo I. Denk and Zal{'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank } ,
    year      = { 2023 }
}

 @article { Huang2022MuLanAJ ,
    title   = { MuLan: A Joint Embedding of Music Audio and Natural Language } ,
    author  = { Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2208.12415 }
}

 @misc { https://doi.org/10.48550/arxiv.2302.01327 ,
    doi     = { 10.48550/ARXIV.2302.01327 } ,
    url     = { https://arxiv.org/abs/2302.01327 } ,
    author  = { Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil } ,
    title   = { Dual PatchNorm } ,
    publisher = { arXiv } ,
    year    = { 2023 } ,
    copyright = { Creative Commons Attribution 4.0 International }
}

 @article { Liu2022PatchDropoutEV ,
    title   = { PatchDropout: Economizing Vision Transformers Using Patch Dropout } ,
    author  = { Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2208.07220 }
}

 @misc { liu2021swin ,
    title   = { Swin Transformer V2: Scaling Up Capacity and Resolution } ,
    author  = { Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo } ,
    year    = { 2021 } ,
    eprint  = { 2111.09883 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { gilmer2023intriguing
    title  = { Intriguing Properties of Transformer Training Instabilities } ,
    author = { Justin Gilmer, Andrea Schioppa, and Jeremy Cohen } ,
    year   = { 2023 } ,
    status = { to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams }
}

 @inproceedings { Shukor2022EfficientVP ,
    title   = { Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment } ,
    author  = { Mustafa Shukor and Guillaume Couairon and Matthieu Cord } ,
    booktitle = { British Machine Vision Conference } ,
    year    = { 2022 }
}

 @inproceedings { Zhai2023SigmoidLF ,
    title   = { Sigmoid Loss for Language Image Pre-Training } ,
    author  = { Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer } ,
    year    = { 2023 }
}