musiclm pytorch Unduh - musiclm pytorch Unduh kode sumber

MusikLM - Pytorch

Implementasi MusicLM, model SOTA baru Google untuk pembuatan musik menggunakan jaringan perhatian, di Pytorch.

Mereka pada dasarnya menggunakan AudioLM yang dikondisikan teks, namun yang mengejutkan adalah dengan penyematan dari model pembelajaran kontrastif teks-audio bernama MuLan. MuLan inilah yang akan dibangun dalam repositori ini, dengan AudioLM dimodifikasi dari repositori lain untuk mendukung kebutuhan generasi musik di sini.

Silakan bergabung jika Anda tertarik membantu replikasi bersama komunitas LAION

Apa itu AI oleh Louis Bouchard

Apresiasi

Stability.ai atas sponsor yang murah hati untuk bekerja dan membuka sumber penelitian kecerdasan buatan yang canggih
? Huggingface untuk perpustakaan pelatihan akselerasi mereka

Penggunaan

 $ pip install musiclm-pytorch

Penggunaan

MuLaN perlu dilatih terlebih dahulu

 import torch
from musiclm_pytorch import MuLaN , AudioSpectrogramTransformer , TextTransformer

audio_transformer = AudioSpectrogramTransformer (
    dim = 512 ,
    depth = 6 ,
    heads = 8 ,
    dim_head = 64 ,
    spec_n_fft = 128 ,
    spec_win_length = 24 ,
    spec_aug_stretch_factor = 0.8
)

text_transformer = TextTransformer (
    dim = 512 ,
    depth = 6 ,
    heads = 8 ,
    dim_head = 64
)

mulan = MuLaN (
    audio_transformer = audio_transformer ,
    text_transformer = text_transformer
)

# get a ton of <sound, text> pairs and train

wavs = torch . randn ( 2 , 1024 )
texts = torch . randint ( 0 , 20000 , ( 2 , 256 ))

loss = mulan ( wavs , texts )
loss . backward ()

# after much training, you can embed sounds and text into a joint embedding space
# for conditioning the audio LM

embeds = mulan . get_audio_latents ( wavs )  # during training

embeds = mulan . get_text_latents ( texts )  # during inference

Untuk mendapatkan penyematan pengkondisian untuk tiga transformator yang merupakan bagian dari AudioLM , Anda harus menggunakan MuLaNEmbedQuantizer juga.

 from musiclm_pytorch import MuLaNEmbedQuantizer

# setup the quantizer with the namespaced conditioning embeddings, unique per quantizer as well as namespace (per transformer)

quantizer = MuLaNEmbedQuantizer (
    mulan = mulan ,                          # pass in trained mulan from above
    conditioning_dims = ( 1024 , 1024 , 1024 ), # say all three transformers have model dimensions of 1024
    namespaces = ( 'semantic' , 'coarse' , 'fine' )
)

# now say you want the conditioning embeddings for semantic transformer

wavs = torch . randn ( 2 , 1024 )
conds = quantizer ( wavs = wavs , namespace = 'semantic' ) # (2, 8, 1024) - 8 is number of quantizers

Untuk melatih (atau menyempurnakan) tiga trafo yang merupakan bagian dari AudioLM , cukup ikuti instruksi di audiolm-pytorch untuk pelatihan, namun teruskan instance MulanEmbedQuantizer ke kelas pelatihan dengan kata kunci audio_conditioner

mantan. SemanticTransformerTrainer

 import torch
from audiolm_pytorch import HubertWithKmeans , SemanticTransformer , SemanticTransformerTrainer

wav2vec = HubertWithKmeans (
    checkpoint_path = './hubert/hubert_base_ls960.pt' ,
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer (
    num_semantic_tokens = wav2vec . codebook_size ,
    dim = 1024 ,
    depth = 6 ,
    audio_text_condition = True      # this must be set to True (same for CoarseTransformer and FineTransformers)
). cuda ()

trainer = SemanticTransformerTrainer (
    transformer = semantic_transformer ,
    wav2vec = wav2vec ,
    audio_conditioner = quantizer ,   # pass in the MulanEmbedQuantizer instance above
    folder = '/path/to/audio/files' ,
    batch_size = 1 ,
    data_max_length = 320 * 32 ,
    num_train_steps = 1
)

trainer . train ()

Setelah banyak pelatihan pada ketiga transformator (semantik, kasar, halus), Anda akan meneruskan AudioLM dan MuLaN yang telah disempurnakan atau dilatih dari awal yang dibungkus dalam MuLaNEmbedQuantizer ke MusicLM

 # you need the trained AudioLM (audio_lm) from above
# with the MulanEmbedQuantizer (mulan_embed_quantizer)

from musiclm_pytorch import MusicLM

musiclm = MusicLM (
    audio_lm = audio_lm ,                 # `AudioLM` from https://github.com/lucidrains/audiolm-pytorch
    mulan_embed_quantizer = quantizer    # the `MuLaNEmbedQuantizer` from above
)

music = musiclm ( 'the crystalline sounds of the piano in a ballroom' , num_samples = 4 ) # sample 4 and pick the top match with mulan

Semua yang harus dilakukan

mulan tampaknya menggunakan pembelajaran kontrastif yang dipisahkan, menawarkan itu sebagai pilihan
bungkus mulan dengan pembungkus mulan dan kuantisasi hasilnya, proyeksikan ke dimensi audiolm
memodifikasi audiolm untuk menerima penyematan pengkondisian, secara opsional menangani dimensi yang berbeda melalui proyeksi terpisah
audiolm dan mulan masuk ke musiklm dan menghasilkan, memfilter dengan mulan
memberikan bias posisi dinamis terhadap perhatian diri di AST
mengimplementasikan MusicLM yang menghasilkan banyak sampel dan memilih yang paling cocok dengan MuLaN
mendukung audio berdurasi variabel dengan masking pada transformator audio
tambahkan versi mulan ke klip terbuka
atur semua hyperparameter spektogram yang tepat

Kutipan

 @inproceedings { Agostinelli2023MusicLMGM ,
    title     = { MusicLM: Generating Music From Text } ,
    author    = { Andrea Agostinelli and Timo I. Denk and Zal{'a}n Borsos and Jesse Engel and Mauro Verzetti and Antoine Caillon and Qingqing Huang and Aren Jansen and Adam Roberts and Marco Tagliasacchi and Matthew Sharifi and Neil Zeghidour and C. Frank } ,
    year      = { 2023 }
}

 @article { Huang2022MuLanAJ ,
    title   = { MuLan: A Joint Embedding of Music Audio and Natural Language } ,
    author  = { Qingqing Huang and Aren Jansen and Joonseok Lee and Ravi Ganti and Judith Yue Li and Daniel P. W. Ellis } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2208.12415 }
}

 @misc { https://doi.org/10.48550/arxiv.2302.01327 ,
    doi     = { 10.48550/ARXIV.2302.01327 } ,
    url     = { https://arxiv.org/abs/2302.01327 } ,
    author  = { Kumar, Manoj and Dehghani, Mostafa and Houlsby, Neil } ,
    title   = { Dual PatchNorm } ,
    publisher = { arXiv } ,
    year    = { 2023 } ,
    copyright = { Creative Commons Attribution 4.0 International }
}

 @article { Liu2022PatchDropoutEV ,
    title   = { PatchDropout: Economizing Vision Transformers Using Patch Dropout } ,
    author  = { Yue Liu and Christos Matsoukas and Fredrik Strand and Hossein Azizpour and Kevin Smith } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2208.07220 }
}

 @misc { liu2021swin ,
    title   = { Swin Transformer V2: Scaling Up Capacity and Resolution } ,
    author  = { Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo } ,
    year    = { 2021 } ,
    eprint  = { 2111.09883 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @misc { gilmer2023intriguing
    title  = { Intriguing Properties of Transformer Training Instabilities } ,
    author = { Justin Gilmer, Andrea Schioppa, and Jeremy Cohen } ,
    year   = { 2023 } ,
    status = { to be published - one attention stabilization technique is circulating within Google Brain, being used by multiple teams }
}

 @inproceedings { Shukor2022EfficientVP ,
    title   = { Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment } ,
    author  = { Mustafa Shukor and Guillaume Couairon and Matthieu Cord } ,
    booktitle = { British Machine Vision Conference } ,
    year    = { 2022 }
}

 @inproceedings { Zhai2023SigmoidLF ,
    title   = { Sigmoid Loss for Language Image Pre-Training } ,
    author  = { Xiaohua Zhai and Basil Mustafa and Alexander Kolesnikov and Lucas Beyer } ,
    year    = { 2023 }
}