DALLE2 pytorch تنزيل - DALLE2 pytorch تنزيل كود المصدر

DALL-E 2 - بيتورتش

تنفيذ DALL-E 2، الشبكة العصبية المحدثة لتحويل النص إلى صورة من OpenAI، في Pytorch.

ملخص يانيك كيلشر | شرح التجميع AI

يبدو أن الحداثة الرئيسية هي طبقة إضافية من عدم الاتجاه مع الشبكة السابقة (سواء كانت محولات انحدار ذاتي أو شبكة نشر)، والتي تتنبأ بتضمين الصورة بناءً على تضمين النص من CLIP. على وجه التحديد، لن يقوم هذا المستودع إلا ببناء شبكة الانتشار السابقة، حيث إنه البديل الأفضل أداءً (ولكنه يتضمن بالمناسبة محولًا سببيًا كشبكة تقليل الضوضاء؟)

هذا النموذج هو SOTA لتحويل النص إلى صورة في الوقت الحالي.

يرجى الانضمام إذا كنت مهتمًا بالمساعدة في النسخ المتماثل مع مجتمع LAION | مقابلة يانيك

اعتبارًا من 23/5/22، لم يعد SOTA. سوف تكون SOTA هنا. سيتم تحويل إصدارات Jax بالإضافة إلى مشروع تحويل النص إلى فيديو نحو بنية Imagen، لأنها أبسط بكثير.

حالة

استخدمت مجموعة بحثية الكود الموجود في هذا المستودع لتدريب النشر الوظيفي قبل أجيال CLIP الخاصة بهم. سوف يشاركون عملهم بمجرد إصدار النسخة الأولية الخاصة بهم. هذا، بالإضافة إلى تجارب كاثرين الخاصة، يؤكدان صحة اكتشاف OpenAI بأن الأسبقية الإضافية تزيد من تنوع الأجيال.
تم الآن التحقق من أن وحدة فك الترميز تعمل من أجل التوليد غير المشروط في الإعداد التجريبي الخاص بي لزهور أكسفورد. أكد باحثان أيضًا أن شركة Decoder تعمل لصالحهما.

مستمر عند 21 ألف خطوة

نجح جاستن بينكني في تدريب عملية النشر مسبقًا في المستودع لتطبيق CLIP to Stylegan2 لتحويل النص إلى صورة
قام رومان بتوسيع نطاق التدريب إلى 800 وحدة معالجة رسوميات باستخدام النصوص البرمجية المتاحة دون أي مشاكل

نماذج تم تدريبها مسبقًا

تقوم LAION بتدريب النماذج السابقة. نقاط التفتيش متاحة على "المعانقة" وإحصائيات التدريب متاحة على "WANDB".
وحدة فك الترميز - تشغيل اختباري قيد التقدم؟
وحدة فك الترميز - اختبار آخر يتم تشغيله باهتمام قليل
دال-E 2؟ - مستودع DALL-E 2 Laion

تقدير

لم تكن هذه المكتبة لتصل إلى حالة العمل هذه دون مساعدة

صهيون للكود التدريبي الموزع للنشر المسبق
Aidan لكود التدريب الموزع لوحدة فك التشفير وكذلك أدوات تحميل البيانات
كومار للعمل على النص التدريبي الأولي للنشر
رومان لمراجعات طلب السحب وإدارة المشاريع
He Cao وxiankgx للأسئلة والأجوبة ولتحديد الأخطاء الحرجة
Marunine لتحديد المشكلات المتعلقة بتغيير حجم المكيف منخفض الدقة، عند تدريب جهاز رفع العينات، بالإضافة إلى العديد من إصلاحات الأخطاء الأخرى
MalumaDev لاقتراحه استخدام أداة خلط وحدات البكسل لإصلاح عيوب لوحة التحقق
كاثرين لنصيحتها
استقرار الذكاء الاصطناعي للرعاية السخية
؟ Huggingface وعلى وجه الخصوص Sylvain لمكتبة Accelerate
Alex for einops، أداة لا غنى عنها لمعالجة الموتر

... وغيرها الكثير. شكرًا لك!

ثَبَّتَ

$ pip install dalle2-pytorch

الاستخدام

تدريب DALLE-2 هو عملية مكونة من ثلاث خطوات، مع كون تدريب CLIP هو الأكثر أهمية

لتدريب CLIP، يمكنك إما استخدام حزمة x-clip، أو الانضمام إلى LAION discord، حيث تجري بالفعل الكثير من جهود النسخ المتماثل.

سيوضح هذا المستودع التكامل مع x-clip للمبتدئين

 import torch
from dalle2_pytorch import CLIP

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 1 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 1 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
    use_all_token_embeds = True ,            # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True ,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True ,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
    use_visual_ssl = True ,                  # whether to do self supervised learning on images
    visual_ssl_type = 'simclr' ,             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False ,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05 ,            # weight for text MLM loss
    image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# train

loss = clip (
    text ,
    images ,
    return_loss = True              # needs to be set to True to return contrastive loss
)

loss . backward ()

# do the above with as many texts and images as possible in a loop

بعد ذلك، ستحتاج إلى تدريب وحدة فك التشفير، التي تتعلم كيفية إنشاء الصور بناءً على تضمين الصورة القادمة من CLIP المدرب أعلاه

 import torch
from dalle2_pytorch import Unet , Decoder , CLIP

# trained clip from step 1

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 1 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 1 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# unet for the decoder

unet = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 )
). cuda ()

# decoder, which contains the unet and clip

decoder = Decoder (
    unet = unet ,
    clip = clip ,
    timesteps = 100 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

# mock images (get a lot of this)

images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# feed images into decoder

loss = decoder ( images )
loss . backward ()

# do the above for many many many many steps
# then it will learn to generate images based on the CLIP image embeddings

وأخيرا، المساهمة الرئيسية للورقة. يقدم المستودع شبكة الانتشار السابقة. يأخذ تضمينات نص CLIP ويحاول إنشاء تضمينات صورة CLIP. مرة أخرى، سوف تحتاج إلى CLIP المدرب من الخطوة الأولى

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior , CLIP

# get trained CLIP from step one

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
). cuda ()

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# feed text and images into diffusion prior network

loss = diffusion_prior ( text , images )
loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

في هذه الورقة، استخدموا بالفعل تقنية تم اكتشافها مؤخرًا، من جوناثان هو نفسه (المؤلف الأصلي لـ DDPMs، التقنية الأساسية المستخدمة في DALL-E v2) لتركيب الصور عالية الدقة.

يمكن استخدام هذا بسهولة ضمن هذا الإطار على هذا النحو

 import torch
from dalle2_pytorch import Unet , Decoder , CLIP

# trained clip from step 1

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# 2 unets for the decoder (a la cascading DDPM)

unet1 = Unet (
    dim = 32 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 )
). cuda ()

unet2 = Unet (
    dim = 32 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

# decoder, which contains the unet(s) and clip

decoder = Decoder (
    clip = clip ,
    unet = ( unet1 , unet2 ),            # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
    image_sizes = ( 256 , 512 ),         # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
    timesteps = 1000 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

# mock images (get a lot of this)

images = torch . randn ( 4 , 3 , 512 , 512 ). cuda ()

# feed images into decoder, specifying which unet you want to train
# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme

loss = decoder ( images , unet_number = 1 )
loss . backward ()

loss = decoder ( images , unet_number = 2 )
loss . backward ()

# do the above for many steps for both unets

وأخيرًا، لإنشاء صور DALL-E2 من النص. أدخل DiffusionPrior المدرب بالإضافة إلى Decoder (التي تغلف CLIP والمحول السببي وunet(s))

 from dalle2_pytorch import DALLE2

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

# send the text as a string if you want to use the simple tokenizer from DALLE v1
# or you can do it as token ids, if you have your own tokenizer

texts = [ 'glistening morning dew on a flower petal' ]
images = dalle2 ( texts ) # (1, 3, 256, 256)

هذا كل شيء!

دعونا نرى النص بأكمله أدناه

 import torch
from dalle2_pytorch import DALLE2 , DiffusionPriorNetwork , DiffusionPrior , Unet , Decoder , CLIP

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# train

loss = clip (
    text ,
    images ,
    return_loss = True
)

loss . backward ()

# do above for many steps ...

# prior networks (with transformer)

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 1000 ,
    sample_timesteps = 64 ,
    cond_drop_prob = 0.2
). cuda ()

loss = diffusion_prior ( text , images )
loss . backward ()

# do above for many steps ...

# decoder (with unet)

unet1 = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    text_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 ),
    cond_on_text_encodings = True    # set to True for any unets that need to be conditioned on text encodings
). cuda ()

unet2 = Unet (
    dim = 16 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

decoder = Decoder (
    unet = ( unet1 , unet2 ),
    image_sizes = ( 128 , 256 ),
    clip = clip ,
    timesteps = 100 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

for unet_number in ( 1 , 2 ):
    loss = decoder ( images , text = text , unet_number = unet_number ) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss . backward ()

# do above for many steps

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

images = dalle2 (
    [ 'cute puppy chasing after a squirrel' ],
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
)

# save your image (in this example, of size 256x256)

يجب أن يعمل كل شيء في هذا الملف التمهيدي بدون أخطاء

يمكنك أيضًا تدريب وحدة فك التشفير على صور أكبر من الحجم (على سبيل المثال 512 × 512) التي تم تدريب CLIP عليها (256 × 256). سيتم تغيير حجم الصور إلى دقة صورة CLIP لتضمين الصورة

بالنسبة للشخص العادي، لا داعي للقلق، سيتم أتمتة التدريب بالكامل في أداة CLI، على الأقل للتدريب على نطاق صغير.

التدريب على تضمينات CLIP المعالجة مسبقًا

من المحتمل، عند التوسيع، أن تقوم أولاً بمعالجة صورك ونصوصك مسبقًا في التضمينات المقابلة قبل تدريب الشبكة السابقة. يمكنك القيام بذلك بسهولة عن طريق تمرير image_embed و text_embed واختياريًا text_encodings

مثال العمل أدناه

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior , CLIP

# get trained CLIP from step one

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
). cuda ()

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2 ,
    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# precompute the text and image embeddings
# here using the diffusion prior class, but could be done with CLIP alone

clip_image_embeds = diffusion_prior . clip . embed_image ( images ). image_embed
clip_text_embeds = diffusion_prior . clip . embed_text ( text ). text_embed

# feed text and images into diffusion prior network

loss = diffusion_prior (
    text_embed = clip_text_embeds ,
    image_embed = clip_image_embeds
)

loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

يمكنك أيضًا الانتقال إلى CLIP -less تمامًا، وفي هذه الحالة ستحتاج إلى تمرير image_embed_dim إلى DiffusionPrior عند التهيئة

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    image_embed_dim = 512 ,               # this needs to be set
    timesteps = 100 ,
    cond_drop_prob = 0.2 ,
    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# precompute the text and image embeddings
# here using the diffusion prior class, but could be done with CLIP alone

clip_image_embeds = torch . randn ( 4 , 512 ). cuda ()
clip_text_embeds = torch . randn ( 4 , 512 ). cuda ()

# feed text and images into diffusion prior network

loss = diffusion_prior (
    text_embed = clip_text_embeds ,
    image_embed = clip_image_embeds
)

loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

OpenAI كليب

على الرغم من وجود احتمال أنهم يستخدمون CLIP غير المُصدر والأكثر قوة، يمكنك استخدام أحد المقاطع التي تم إصدارها، إذا كنت لا ترغب في تدريب CLIP الخاص بك من البداية. سيسمح هذا أيضًا للمجتمع بالتحقق بسرعة أكبر من صحة استنتاجات الورقة.

لاستخدام OpenAI CLIP المُدرب مسبقًا، ما عليك سوى استيراد OpenAIClipAdapter وتمريره إلى DiffusionPrior أو Decoder بهذه الطريقة

 import torch
from dalle2_pytorch import DALLE2 , DiffusionPriorNetwork , DiffusionPrior , Unet , Decoder , OpenAIClipAdapter

# openai pretrained clip - defaults to ViT-B/32

clip = OpenAIClipAdapter ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# prior networks (with transformer)

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2
). cuda ()

loss = diffusion_prior ( text , images )
loss . backward ()

# do above for many steps ...

# decoder (with unet)

unet1 = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 ),
    text_embed_dim = 512 ,
    cond_on_text_encodings = True  # set to True for any unets that need to be conditioned on text encodings (ex. first unet in cascade)
). cuda ()

unet2 = Unet (
    dim = 16 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

decoder = Decoder (
    unet = ( unet1 , unet2 ),
    image_sizes = ( 128 , 256 ),
    clip = clip ,
    timesteps = 1000 ,
    sample_timesteps = ( 250 , 27 ),
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

for unet_number in ( 1 , 2 ):
    loss = decoder ( images , text = text , unet_number = unet_number ) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss . backward ()

# do above for many steps

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

images = dalle2 (
    [ 'a butterfly trying to escape a tornado' ],
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
)

# save your image (in this example, of size 256x256)