Download sinkhorn transformer - download do código fonte sinkhorn transformer

Transformador Sinkhorn

Esta é uma reprodução do trabalho descrito em Sparse Sinkhorn Attention, com melhorias adicionais.

Inclui uma rede de classificação parametrizada, usando normalização sinkhorn para amostrar uma matriz de permutação que combina os grupos de chaves mais relevantes com os grupos de consultas.

Este trabalho também traz redes reversíveis e feed forward chunking (conceitos introduzidos no Reformer) para gerar ainda mais economia de memória.

204 mil tokens (fins de demonstração)

Instalar

$ pip install sinkhorn_transformer

Usar

Um modelo de linguagem baseado no Sinkhorn Transformer

 import torch
from sinkhorn_transformer import SinkhornTransformerLM

model = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 1024 ,
    heads = 8 ,
    depth = 12 ,
    max_seq_len = 8192 ,
    bucket_size = 128 ,        # size of the buckets
    causal = False ,           # auto-regressive or not
    n_sortcut = 2 ,            # use sortcut to reduce memory complexity to linear
    n_top_buckets = 2 ,        # sort specified number of key/value buckets to one query bucket. paper is at 1, defaults to 2
    ff_chunks = 10 ,           # feedforward chunking, from Reformer paper
    reversible = True ,        # make network reversible, from Reformer paper
    emb_dropout = 0.1 ,        # embedding dropout
    ff_dropout = 0.1 ,         # feedforward dropout
    attn_dropout = 0.1 ,       # post attention dropout
    attn_layer_dropout = 0.1 , # post attention layer dropout
    layer_dropout = 0.1 ,      # add layer dropout, from 'Reducing Transformer Depth on Demand' paper
    weight_tie = True ,        # tie layer parameters, from Albert paper
    emb_dim = 128 ,            # embedding factorization, from Albert paper
    dim_head = 64 ,            # be able to fix the dimension of each head, making it independent of the embedding dimension and the number of heads
    ff_glu = True ,            # use GLU in feedforward, from paper 'GLU Variants Improve Transformer'
    n_local_attn_heads = 2 ,   # replace N heads with local attention, suggested to work well from Routing Transformer paper
    pkm_layers = ( 4 , 7 ),       # specify layers to use product key memory. paper shows 1 or 2 modules near the middle of the transformer is best
    pkm_num_keys = 128 ,       # defaults to 128, but can be increased to 256 or 512 as memory allows
)

x = torch . randint ( 0 , 20000 , ( 1 , 2048 ))
model ( x ) # (1, 2048, 20000)

Um transformador Sinkhorn simples, camadas de atenção sinkhorn

 import torch
from sinkhorn_transformer import SinkhornTransformer

model = SinkhornTransformer (
    dim = 1024 ,
    heads = 8 ,
    depth = 12 ,
    bucket_size = 128
)

x = torch . randn ( 1 , 2048 , 1024 )
model ( x ) # (1, 2048, 1024)

Transformador codificador/decodificador Sinkhorn

 import torch
from sinkhorn_transformer import SinkhornTransformerLM

DE_SEQ_LEN = 4096
EN_SEQ_LEN = 4096

enc = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 512 ,
    depth = 6 ,
    heads = 8 ,
    bucket_size = 128 ,
    max_seq_len = DE_SEQ_LEN ,
    reversible = True ,
    return_embeddings = True
). cuda ()

dec = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 512 ,
    depth = 6 ,
    causal = True ,
    bucket_size = 128 ,
    max_seq_len = EN_SEQ_LEN ,
    receives_context = True ,
    context_bucket_size = 128 ,  # context key / values can be bucketed differently
    reversible = True
). cuda ()

x = torch . randint ( 0 , 20000 , ( 1 , DE_SEQ_LEN )). cuda ()
y = torch . randint ( 0 , 20000 , ( 1 , EN_SEQ_LEN )). cuda ()

x_mask = torch . ones_like ( x ). bool (). cuda ()
y_mask = torch . ones_like ( y ). bool (). cuda ()

context = enc ( x , input_mask = x_mask )
dec ( y , context = context , input_mask = y_mask , context_mask = x_mask ) # (1, 4096, 20000)

Preenchimento automático

Por padrão, o modelo reclamará se receber uma entrada que não seja um múltiplo do tamanho do intervalo. Para evitar ter que fazer os mesmos cálculos de preenchimento todas as vezes, você pode usar a classe auxiliar Autopadder . Ele cuidará da input_mask para você também, se for fornecido. Chaves/valores contextuais e máscara também são suportados.

 import torch
from sinkhorn_transformer import SinkhornTransformerLM
from sinkhorn_transformer import Autopadder

model = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 1024 ,
    heads = 8 ,
    depth = 12 ,
    max_seq_len = 2048 ,
    bucket_size = 128 ,
    causal = True
)

model = Autopadder ( model , pad_left = True ) # autopadder will fetch the bucket size and autopad input

x = torch . randint ( 0 , 20000 , ( 1 , 1117 )) # odd sequence length
model ( x ) # (1, 1117, 20000)

Chifre

Este repositório divergiu do artigo e agora está usando a atenção no lugar da amostragem original de rede de classificação + gumbel sinkhorn. Ainda não encontrei uma diferença notável no desempenho, e o novo esquema me permite generalizar a rede para comprimentos de sequência flexíveis. Se você quiser experimentar o Sinkhorn, use as configurações a seguir, que funcionam apenas para redes não causais.

 import torch
from sinkhorn_transformer import SinkhornTransformerLM

model = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 1024 ,
    heads = 8 ,
    depth = 12 ,
    bucket_size = 128 ,
    max_seq_len = 8192 ,
    use_simple_sort_net = True , # turn off attention sort net
    sinkhorn_iter = 7 ,          # number of sinkhorn iterations - default is set at reported best in paper
    n_sortcut = 2 ,              # use sortcut to reduce complexity to linear time
    temperature = 0.75 ,         # gumbel temperature - default is set at reported best in paper
    non_permutative = False ,    # allow buckets of keys to be sorted to queries more than once
)

x = torch . randint ( 0 , 20000 , ( 1 , 8192 ))
model ( x ) # (1, 8192, 20000)

Memória da chave do produto

Para ver os benefícios do uso do PKM, a taxa de aprendizado dos valores deve ser definida mais alta do que o restante dos parâmetros. (Recomendado para ser 1e-2 )

Você pode seguir as instruções aqui para configurá-lo corretamente https://github.com/lucidrains/product-key-memory#learning-rates

Problemas

Decodificação e comprimentos de sequência

Sinkhorn, quando treinado em sequências de comprimento fixo, parece ter problemas para decodificar sequências do zero, principalmente devido ao fato de que a rede de classificação tem problemas para generalizar quando os baldes estão parcialmente preenchidos com tokens de preenchimento.

Felizmente, acho que encontrei uma solução simples. Durante o treinamento, para redes causais, trunque aleatoriamente as sequências e force a generalização da rede de classificação. Forneci um sinalizador ( randomly_truncate_sequence ) para a instância AutoregressiveWrapper para facilitar isso.

 import torch
from sinkhorn_transformer import SinkhornTransformerLM , AutoregressiveWrapper

model = SinkhornTransformerLM (
    num_tokens = 20000 ,
    dim = 1024 ,
    heads = 8 ,
    depth = 12 ,
    bucket_size = 75 ,
    max_seq_len = 8192 ,
    causal = True
)

model = AutoregressiveWrapper ( model )

x = torch . randint ( 0 , 20000 , ( 1 , 8192 ))
loss = model ( x , return_loss = True , randomly_truncate_sequence = True ) # (1, 8192, 20000)

Estou aberto a sugestões se alguém encontrar uma solução melhor.

Rede de classificação causal

Há um problema potencial com a rede de classificação causal, onde a decisão de quais intervalos de chave/valor das classificações anteriores para um intervalo depende apenas do primeiro token e não do restante (devido ao esquema de agrupamento e à prevenção de vazamento de futuros para passado).

Tentei aliviar esse problema girando metade das cabeças para a esquerda pelo tamanho do balde - 1, promovendo assim o último token como o primeiro. Essa também é a razão pela qual o AutoregressiveWrapper usa como padrão o preenchimento esquerdo durante o treinamento, para sempre garantir que o último token da sequência tenha uma palavra a dizer sobre o que recuperar.

Se alguém encontrou uma solução mais limpa, por favor me avise nos problemas.

Alternativas

Transformador de roteamento - https://github.com/lucidrains/routing-transformer
Reformador - https://github.com/lucidrains/reformer-pytorch

Citações

 @misc { tay2020sparse ,
    title   = { Sparse Sinkhorn Attention } ,
    author  = { Yi Tay and Dara Bahri and Liu Yang and Donald Metzler and Da-Cheng Juan } ,
    year    = { 2020 } ,
    url.    = { https://arxiv.org/abs/2002.11296 }
}

 @inproceedings { kitaev2020reformer ,
    title       = { Reformer: The Efficient Transformer } ,
    author      = { Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya } ,
    booktitle   = { International Conference on Learning Representations } ,
    year        = { 2020 } ,
    url         = { https://openreview.net/forum?id=rkgNKkHtvB }
}

 @misc { lan2019albert ,
    title       = { ALBERT: A Lite BERT for Self-supervised Learning of Language Representations } ,
    author      = { Zhenzhong Lan and Mingda Chen and Sebastian Goodman and Kevin Gimpel and Piyush Sharma and Radu Soricut } ,
    year        = { 2019 } ,
    url         = { https://arxiv.org/abs/1909.11942 }
}

 @misc { shazeer2020glu ,
    title   = { GLU Variants Improve Transformer } ,
    author  = { Noam Shazeer } ,
    year    = { 2020 } ,
    url     = { https://arxiv.org/abs/2002.05202 }
}

 @misc { roy*2020efficient ,
    title   = { Efficient Content-Based Sparse Attention with Routing Transformers } ,
    author  = { Aurko Roy* and Mohammad Taghi Saffar* and David Grangier and Ashish Vaswani } ,
    year    = { 2020 } ,
    url     = { https://openreview.net/forum?id=B1gjs6EtDr }
}

 @inproceedings { fan2020reducing ,
    title     = { Reducing Transformer Depth on Demand with Structured Dropout } ,
    author    = { Angela Fan and Edouard Grave and Armand Joulin } ,
    booktitle = { International Conference on Learning Representations } ,
    year      = { 2020 } ,
    url       = { https://openreview.net/forum?id=SylO2yStDr }
}

 @misc { lample2019large ,
    title   = { Large Memory Layers with Product Keys } ,
    author  = { Guillaume Lample and Alexandre Sablayrolles and Marc'Aurelio Ranzato and Ludovic Denoyer and Hervé Jégou } ,
    year    = { 2019 } ,
    eprint  = { 1907.05242 } ,
    archivePrefix = { arXiv }
}

 @misc { bhojanapalli2020lowrank ,
    title   = { Low-Rank Bottleneck in Multi-head Attention Models } ,
    author  = { Srinadh Bhojanapalli and Chulhee Yun and Ankit Singh Rawat and Sashank J. Reddi and Sanjiv Kumar } ,
    year    = { 2020 } ,
    eprint  = { 2002.07028 }
}