Téléchargement du code source DALLE2 pytorch - Téléchargement du code source DALLE2 pytorch

DALL-E 2 - Pytorch

Implémentation de DALL-E 2, le réseau neuronal de synthèse texte-image mis à jour d'OpenAI, dans Pytorch.

Yannic Kilcher résumé | Explication AssemblyAI

La principale nouveauté semble être une couche supplémentaire d'indirection avec le réseau précédent (qu'il s'agisse d'un transformateur autorégressif ou d'un réseau de diffusion), qui prédit une intégration d'image basée sur l'intégration de texte de CLIP. Concrètement, ce référentiel ne construira que le réseau a priori de diffusion, car c'est la variante la plus performante (mais qui implique accessoirement un transformateur causal comme le réseau débruiteur ?)

Ce modèle est SOTA pour le texte en image pour l'instant.

Veuillez nous rejoindre si vous souhaitez aider à la réplication avec la communauté LAION | Entretien avec Yannic

Depuis le 23/05/22, ce n'est plus SOTA. SOTA sera là. Les versions Jax ainsi que le projet texte-vidéo seront déplacés vers l'architecture Imagen, car elle est beaucoup plus simple.

Statut

Un groupe de recherche a utilisé le code de ce référentiel pour former une diffusion fonctionnelle préalable pour leurs générations CLIP. Partagera leur travail une fois qu'ils auront publié leur prépublication. Ceci, ainsi que les propres expériences de Katherine, valident les conclusions d'OpenAI selon lesquelles le précédent supplémentaire augmente la variété des générations.
Il est maintenant vérifié que le décodeur fonctionne pour la génération inconditionnelle sur ma configuration expérimentale pour les fleurs d'Oxford. 2 chercheurs ont également confirmé que Decoder fonctionne pour eux.

en cours à 21 000 étapes

Justin Pinkney a formé avec succès la diffusion préalable dans le référentiel pour son application de conversion texte-image CLIP to Stylegan2
Romain a étendu la formation à 800 GPU avec les scripts disponibles sans aucun problème

Modèles pré-entraînés

LAION forme des modèles antérieurs. Les points de contrôle sont disponibles sur ?huggingface et les statistiques d'entraînement sont disponibles sur ?WANDB.
Décodeur - Test en cours ?
Décodeur - Un autre test avec peu d'attention
DALL-E2 ? - Référentiel DALL-E 2 Laion

Appréciation

Cette bibliothèque ne serait pas parvenue à cet état de fonctionnement sans l'aide de

Zion pour le code de formation distribué pour la diffusion préalable
Aidan pour le code de formation distribué pour le décodeur ainsi que les chargeurs de données
Kumar pour avoir travaillé sur le script de formation à la diffusion initiale
Romain pour les revues de pull request et la gestion de projet
He Cao et xiankgx pour les questions et réponses et pour l'identification des bugs critiques
Marunine pour identifier les problèmes de redimensionnement du conditionneur basse résolution, lors de la formation du suréchantillonneur, en plus de diverses autres corrections de bugs
MalumaDev pour avoir proposé l'utilisation d'un suréchantillonneur de mélange de pixels pour corriger les artefacts du damier
Katherine pour ses conseils
Stability AI pour le généreux parrainage
? Huggingface et notamment Sylvain pour la bibliothèque Accelerate
Alex pour einops, outil indispensable pour la manipulation tensorielle

... et bien d'autres. Merci!

Installer

$ pip install dalle2-pytorch

Usage

Entraîner DALLE-2 est un processus en 3 étapes, la formation de CLIP étant la plus importante

Pour entraîner CLIP, vous pouvez soit utiliser le package x-clip, soit rejoindre le discord LAION, où de nombreux efforts de réplication sont déjà en cours.

Ce référentiel démontrera l'intégration avec x-clip pour les débutants

 import torch
from dalle2_pytorch import CLIP

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 1 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 1 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
    use_all_token_embeds = True ,            # whether to use fine-grained contrastive learning (FILIP)
    decoupled_contrastive_learning = True ,  # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
    extra_latent_projection = True ,         # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
    use_visual_ssl = True ,                  # whether to do self supervised learning on images
    visual_ssl_type = 'simclr' ,             # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
    use_mlm = False ,                        # use masked language learning (MLM) on text (DeCLIP)
    text_ssl_loss_weight = 0.05 ,            # weight for text MLM loss
    image_ssl_loss_weight = 0.05            # weight for image self-supervised learning loss
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# train

loss = clip (
    text ,
    images ,
    return_loss = True              # needs to be set to True to return contrastive loss
)

loss . backward ()

# do the above with as many texts and images as possible in a loop

Ensuite, vous devrez former le décodeur, qui apprend à générer des images basées sur l'intégration d'images provenant du CLIP formé ci-dessus.

 import torch
from dalle2_pytorch import Unet , Decoder , CLIP

# trained clip from step 1

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 1 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 1 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# unet for the decoder

unet = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 )
). cuda ()

# decoder, which contains the unet and clip

decoder = Decoder (
    unet = unet ,
    clip = clip ,
    timesteps = 100 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

# mock images (get a lot of this)

images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# feed images into decoder

loss = decoder ( images )
loss . backward ()

# do the above for many many many many steps
# then it will learn to generate images based on the CLIP image embeddings

Enfin, la principale contribution de l’article. Le référentiel propose le réseau préalable de diffusion. Il prend les intégrations de texte CLIP et essaie de générer les intégrations d'images CLIP. Encore une fois, vous aurez besoin du CLIP formé dès la première étape

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior , CLIP

# get trained CLIP from step one

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
). cuda ()

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# feed text and images into diffusion prior network

loss = diffusion_prior ( text , images )
loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

Dans cet article, ils ont en fait utilisé une technique récemment découverte, par Jonathan Ho lui-même (auteur original des DDPM, la technique de base utilisée dans DALL-E v2) pour la synthèse d'images haute résolution.

Cela peut facilement être utilisé dans ce cadre ainsi

 import torch
from dalle2_pytorch import Unet , Decoder , CLIP

# trained clip from step 1

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# 2 unets for the decoder (a la cascading DDPM)

unet1 = Unet (
    dim = 32 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 )
). cuda ()

unet2 = Unet (
    dim = 32 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

# decoder, which contains the unet(s) and clip

decoder = Decoder (
    clip = clip ,
    unet = ( unet1 , unet2 ),            # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
    image_sizes = ( 256 , 512 ),         # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
    timesteps = 1000 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

# mock images (get a lot of this)

images = torch . randn ( 4 , 3 , 512 , 512 ). cuda ()

# feed images into decoder, specifying which unet you want to train
# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme

loss = decoder ( images , unet_number = 1 )
loss . backward ()

loss = decoder ( images , unet_number = 2 )
loss . backward ()

# do the above for many steps for both unets

Enfin, générer les images DALL-E2 à partir du texte. Insérez le DiffusionPrior formé ainsi que le Decoder (qui enveloppe CLIP , le transformateur causal et une ou plusieurs unités)

 from dalle2_pytorch import DALLE2

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

# send the text as a string if you want to use the simple tokenizer from DALLE v1
# or you can do it as token ids, if you have your own tokenizer

texts = [ 'glistening morning dew on a flower petal' ]
images = dalle2 ( texts ) # (1, 3, 256, 256)

C'est ça!

Voyons l'intégralité du script ci-dessous

 import torch
from dalle2_pytorch import DALLE2 , DiffusionPriorNetwork , DiffusionPrior , Unet , Decoder , CLIP

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# train

loss = clip (
    text ,
    images ,
    return_loss = True
)

loss . backward ()

# do above for many steps ...

# prior networks (with transformer)

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 1000 ,
    sample_timesteps = 64 ,
    cond_drop_prob = 0.2
). cuda ()

loss = diffusion_prior ( text , images )
loss . backward ()

# do above for many steps ...

# decoder (with unet)

unet1 = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    text_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 ),
    cond_on_text_encodings = True    # set to True for any unets that need to be conditioned on text encodings
). cuda ()

unet2 = Unet (
    dim = 16 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

decoder = Decoder (
    unet = ( unet1 , unet2 ),
    image_sizes = ( 128 , 256 ),
    clip = clip ,
    timesteps = 100 ,
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

for unet_number in ( 1 , 2 ):
    loss = decoder ( images , text = text , unet_number = unet_number ) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss . backward ()

# do above for many steps

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

images = dalle2 (
    [ 'cute puppy chasing after a squirrel' ],
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
)

# save your image (in this example, of size 256x256)

Tout dans ce fichier Lisez-moi devrait fonctionner sans erreur

Vous pouvez également entraîner le décodeur sur des images d'une taille supérieure à la taille (disons 512 x 512) à laquelle CLIP a été formé (256 x 256). Les images seront redimensionnées à la résolution d'image CLIP pour les intégrations d'images

Pour le profane, ne vous inquiétez pas, la formation sera entièrement automatisée dans un outil CLI, du moins pour les formations à petite échelle.

Formation sur les intégrations CLIP prétraitées

Il est probable, lors de la mise à l'échelle, que vous prétraitiez d'abord vos images et votre texte dans les intégrations correspondantes avant de former le réseau précédent. Vous pouvez le faire facilement en passant simplement image_embed , text_embed et éventuellement text_encodings

Exemple de travail ci-dessous

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior , CLIP

# get trained CLIP from step one

clip = CLIP (
    dim_text = 512 ,
    dim_image = 512 ,
    dim_latent = 512 ,
    num_text_tokens = 49408 ,
    text_enc_depth = 6 ,
    text_seq_len = 256 ,
    text_heads = 8 ,
    visual_enc_depth = 6 ,
    visual_image_size = 256 ,
    visual_patch_size = 32 ,
    visual_heads = 8 ,
). cuda ()

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2 ,
    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# precompute the text and image embeddings
# here using the diffusion prior class, but could be done with CLIP alone

clip_image_embeds = diffusion_prior . clip . embed_image ( images ). image_embed
clip_text_embeds = diffusion_prior . clip . embed_text ( text ). text_embed

# feed text and images into diffusion prior network

loss = diffusion_prior (
    text_embed = clip_text_embeds ,
    image_embed = clip_image_embeds
)

loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

Vous pouvez également utiliser complètement CLIP -less, auquel cas vous devrez transmettre le image_embed_dim dans le DiffusionPrior lors de l'initialisation.

 import torch
from dalle2_pytorch import DiffusionPriorNetwork , DiffusionPrior

# setup prior network, which contains an autoregressive transformer

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

# diffusion prior network, which contains the CLIP and network (with transformer) above

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    image_embed_dim = 512 ,               # this needs to be set
    timesteps = 100 ,
    cond_drop_prob = 0.2 ,
    condition_on_text_encodings = False  # this probably should be true, but just to get Laion started
). cuda ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# precompute the text and image embeddings
# here using the diffusion prior class, but could be done with CLIP alone

clip_image_embeds = torch . randn ( 4 , 512 ). cuda ()
clip_text_embeds = torch . randn ( 4 , 512 ). cuda ()

# feed text and images into diffusion prior network

loss = diffusion_prior (
    text_embed = clip_text_embeds ,
    image_embed = clip_image_embeds
)

loss . backward ()

# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings

CLIP OpenAI

Bien qu'il soit possible qu'ils utilisent un CLIP inédit et plus puissant, vous pouvez utiliser l'un de ceux publiés, si vous ne souhaitez pas former votre propre CLIP à partir de zéro. Cela permettra également à la communauté de valider plus rapidement les conclusions du document.

Pour utiliser un CLIP OpenAI pré-entraîné, importez simplement OpenAIClipAdapter et transmettez-le dans DiffusionPrior ou Decoder comme ceci

 import torch
from dalle2_pytorch import DALLE2 , DiffusionPriorNetwork , DiffusionPrior , Unet , Decoder , OpenAIClipAdapter

# openai pretrained clip - defaults to ViT-B/32

clip = OpenAIClipAdapter ()

# mock data

text = torch . randint ( 0 , 49408 , ( 4 , 256 )). cuda ()
images = torch . randn ( 4 , 3 , 256 , 256 ). cuda ()

# prior networks (with transformer)

prior_network = DiffusionPriorNetwork (
    dim = 512 ,
    depth = 6 ,
    dim_head = 64 ,
    heads = 8
). cuda ()

diffusion_prior = DiffusionPrior (
    net = prior_network ,
    clip = clip ,
    timesteps = 100 ,
    cond_drop_prob = 0.2
). cuda ()

loss = diffusion_prior ( text , images )
loss . backward ()

# do above for many steps ...

# decoder (with unet)

unet1 = Unet (
    dim = 128 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 ),
    text_embed_dim = 512 ,
    cond_on_text_encodings = True  # set to True for any unets that need to be conditioned on text encodings (ex. first unet in cascade)
). cuda ()

unet2 = Unet (
    dim = 16 ,
    image_embed_dim = 512 ,
    cond_dim = 128 ,
    channels = 3 ,
    dim_mults = ( 1 , 2 , 4 , 8 , 16 )
). cuda ()

decoder = Decoder (
    unet = ( unet1 , unet2 ),
    image_sizes = ( 128 , 256 ),
    clip = clip ,
    timesteps = 1000 ,
    sample_timesteps = ( 250 , 27 ),
    image_cond_drop_prob = 0.1 ,
    text_cond_drop_prob = 0.5
). cuda ()

for unet_number in ( 1 , 2 ):
    loss = decoder ( images , text = text , unet_number = unet_number ) # this can optionally be decoder(images, text) if you wish to condition on the text encodings as well, though it was hinted in the paper it didn't do much
    loss . backward ()

# do above for many steps

dalle2 = DALLE2 (
    prior = diffusion_prior ,
    decoder = decoder
)

images = dalle2 (
    [ 'a butterfly trying to escape a tornado' ],
    cond_scale = 2. # classifier free guidance strength (> 1 would strengthen the condition)
)

# save your image (in this example, of size 256x256)