RETRO pytorch 다운로드 - RETRO pytorch 소스 코드 다운로드

레트로 - 파이토치

Pytorch에서 Deepmind의 검색 기반 주의 네트워크인 RETRO를 구현합니다. 이것은 상대 위치 인코딩을 위한 회전식 임베딩과 Scann 대신 Faiss 라이브러리를 사용하여 논문에서 약간 벗어납니다.

이 라이브러리는 인덱스를 구축하고 모든 청크에 대해 k-최근접 이웃을 계산하기 위해 autofaiss를 활용합니다.

Jay Alammar 설명 블로그 게시물

이 검색기 접근 방식의 판매 포인트는 10배 더 적은 매개변수로 GPT-3 성능에 도달한다는 것입니다. 이 분야에 대해서는 더 많은 연구가 확실히 가치가 있습니다.

또한 DeepNet 논문의 주장을 믿을 경우 검색 변환기를 1000개 레이어로 확장하는 데 필요한 기능도 포함했습니다.

업데이트: Reddit의 누군가가 나에게 금상을 선물했습니다. 뭔지는 모르겠지만 감사합니다!

업데이트: Deepnorm은 Tsinghua의 130B 모델에서 대규모로 검증되었습니다. 이제 use_deepnet True 로 설정하여 훈련하는 것이 좋습니다.

설치하다

$ pip install retro-pytorch

용법

 import torch
from retro_pytorch import RETRO

retro = RETRO (
    chunk_size = 64 ,                         # the chunk size that is indexed and retrieved (needed for proper relative positions as well as causal chunked cross attention)
    max_seq_len = 2048 ,                      # max sequence length
    enc_dim = 896 ,                           # encoder model dim
    enc_depth = 2 ,                           # encoder depth
    dec_dim = 796 ,                           # decoder model dim
    dec_depth = 12 ,                          # decoder depth
    dec_cross_attn_layers = ( 3 , 6 , 9 , 12 ),   # decoder cross attention layers (with causal chunk cross attention)
    heads = 8 ,                               # attention heads
    dim_head = 64 ,                           # dimension per head
    dec_attn_dropout = 0.25 ,                 # decoder attention dropout
    dec_ff_dropout = 0.25 ,                   # decoder feedforward dropout
    use_deepnet = True                       # turn on post-normalization with DeepNet residual scaling and initialization, for scaling to 1000 layers
)

seq = torch . randint ( 0 , 20000 , ( 2 , 2048 + 1 ))      # plus one since it is split into input and labels for training
retrieved = torch . randint ( 0 , 20000 , ( 2 , 32 , 2 , 128 )) # retrieved tokens - (batch, num chunks, num retrieved neighbors, retrieved chunk with continuation)

loss = retro ( seq , retrieved , return_loss = True )
loss . backward ()

# do above for many steps

RETRO 트레이닝 래퍼

TrainingWrapper 의 목적은 RETRO 훈련을 시작하기 위해 텍스트 문서 폴더를 필요한 memmapped numpy 배열로 처리하는 것입니다.

 import torch
from retro_pytorch import RETRO , TrainingWrapper

# instantiate RETRO, fit it into the TrainingWrapper with correct settings

retro = RETRO (
    max_seq_len = 2048 ,                      # max sequence length
    enc_dim = 896 ,                           # encoder model dimension
    enc_depth = 3 ,                           # encoder depth
    dec_dim = 768 ,                           # decoder model dimensions
    dec_depth = 12 ,                          # decoder depth
    dec_cross_attn_layers = ( 1 , 3 , 6 , 9 ),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8 ,                               # attention heads
    dim_head = 64 ,                           # dimension per head
    dec_attn_dropout = 0.25 ,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
). cuda ()

wrapper = TrainingWrapper (
    retro = retro ,                                 # path to retro instance
    knn = 2 ,                                       # knn (2 in paper was sufficient)
    chunk_size = 64 ,                               # chunk size (64 in paper)
    documents_path = './text_folder' ,              # path to folder of text
    glob = '**/*.txt' ,                             # text glob
    chunks_memmap_path = './train.chunks.dat' ,     # path to chunks
    seqs_memmap_path = './train.seq.dat' ,          # path to sequence data
    doc_ids_memmap_path = './train.doc_ids.dat' ,   # path to document ids per chunk (used for filtering neighbors belonging to same document)
    max_chunks = 1_000_000 ,                        # maximum cap to chunks
    max_seqs = 100_000 ,                            # maximum seqs
    knn_extra_neighbors = 100 ,                     # num extra neighbors to fetch
    max_index_memory_usage = '100m' ,
    current_memory_available = '1G'
)

# get the dataloader and optimizer (AdamW with all the correct settings)

train_dl = iter ( wrapper . get_dataloader ( batch_size = 2 , shuffle = True ))
optim = wrapper . get_optimizer ( lr = 3e-4 , wd = 0.01 )

# now do your training
# ex. one gradient step

seq , retrieved = map ( lambda t : t . cuda (), next ( train_dl ))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro (
    seq ,
    retrieved ,
    return_loss = True
)

# one gradient step

loss . backward ()
optim . step ()
optim . zero_grad ()

# do above for many steps, then ...

# topk sampling with retrieval at chunk boundaries

sampled = wrapper . generate ( filter_thres = 0.9 , temperature = 1.0 ) # (1, <2049) terminates early if all <eos>

# or you can generate with a prompt, knn retrieval for initial chunks all taken care of

prompt = torch . randint ( 0 , 1000 , ( 1 , 128 ))  # start with two chunks worth of sequence
sampled = wrapper . generate ( prompt , filter_thres = 0.9 , temperature = 1.0 ) # (1, <2049) terminates early if all <eos>

훈련 데이터를 강제로 재처리하려면 REPROCESS=1 환경 플래그를 사용하여 스크립트를 실행하면 됩니다.

$ REPROCESS=1 python train.py

RETRO 데이터세트

RETRODataset 클래스는 청크, 훈련할 시퀀스의 첫 번째 청크 인덱스(RETRO 디코더에서), 청크당 k-최근접 이웃의 사전 계산된 인덱스를 포함하는 여러 memmapped numpy 배열에 대한 경로를 허용합니다.

위에서 TrainingWrapper 사용하지 않으려는 경우 이를 사용하여 RETRO 교육용 데이터를 쉽게 조합할 수 있습니다.

게다가 필요한 memmapped 데이터를 생성하는 데 필요한 모든 기능은 다음 섹션에 나와 있습니다.

 import torch
from torch . utils . data import DataLoader
from retro_pytorch import RETRO , RETRODataset

# mock data constants

import numpy as np

NUM_CHUNKS = 1000
CHUNK_SIZE = 64
NUM_SEQS = 100
NUM_NEIGHBORS = 2

def save_memmap ( path , tensor ):
    f = np . memmap ( path , dtype = tensor . dtype , mode = 'w+' , shape = tensor . shape )
    f [:] = tensor
    del f

# generate mock chunk data

save_memmap (
    './train.chunks.dat' ,
    np . int32 ( np . random . randint ( 0 , 8192 , size = ( NUM_CHUNKS , CHUNK_SIZE + 1 )))
)

# generate nearest neighbors for each chunk

save_memmap (
    './train.chunks.knn.dat' ,
    np . int32 ( np . random . randint ( 0 , 1000 , size = ( NUM_CHUNKS , NUM_NEIGHBORS )))
)

# generate seq data

save_memmap (
    './train.seq.dat' ,
    np . int32 ( np . random . randint ( 0 , 128 , size = ( NUM_SEQS ,)))
)

# instantiate dataset class
# which constructs the sequence and neighbors from memmapped chunk and neighbor information

train_ds = RETRODataset (
    num_sequences = NUM_SEQS ,
    num_chunks = NUM_CHUNKS ,
    num_neighbors = NUM_NEIGHBORS ,
    chunk_size = CHUNK_SIZE ,
    seq_len = 2048 ,
    chunk_memmap_path = './train.chunks.dat' ,
    chunk_nn_memmap_path = './train.chunks.knn.dat' ,
    seq_memmap_path = './train.seq.dat'
)

train_dl = iter ( DataLoader ( train_ds , batch_size = 2 ))

# one forwards and backwards

retro = RETRO (
    max_seq_len = 2048 ,                      # max sequence length
    enc_dim = 896 ,                           # encoder model dimension
    enc_depth = 3 ,                           # encoder depth
    dec_dim = 768 ,                           # decoder model dimensions
    dec_depth = 12 ,                          # decoder depth
    dec_cross_attn_layers = ( 1 , 3 , 6 , 9 ),    # decoder cross attention layers (with causal chunk cross attention)
    heads = 8 ,                               # attention heads
    dim_head = 64 ,                           # dimension per head
    dec_attn_dropout = 0.25 ,                 # decoder attention dropout
    dec_ff_dropout = 0.25                    # decoder feedforward dropout
). cuda ()

seq , retrieved = map ( lambda t : t . cuda (), next ( train_dl ))

# seq       - (2, 2049)         - 1 extra token since split by seq[:, :-1], seq[:, 1:]
# retrieved - (2, 32, 2, 128)   - 128 since chunk + continuation, each 64 tokens

loss = retro (
    seq ,
    retrieved ,
    return_loss = True
)

loss . backward ()

검색 관련 도구

이 저장소는 BERT의 케이스 버전에 대해 기본 토크나이저(문장)를 사용합니다. 임베딩은 바닐라 BERT에서 가져오며 마스크된 평균 풀 표현 또는 CLS 토큰일 수 있습니다.

전. 마스크된 평균 풀링 표현

 from retro_pytorch . retrieval import bert_embed , tokenize

ids = tokenize ([
    'hello world' ,
    'foo bar'
])

embeds = bert_embed ( ids ) # (2, 768) - 768 is hidden dimension of BERT

전. CLS 토큰 표현

 from retro_pytorch . retrieval import bert_embed , tokenize

ids = tokenize ([
    'hello world' ,
    'foo bar'
])

embeds = bert_embed ( ids , return_cls_repr = True ) # (2, 768)

text_folder_to_chunks_ 사용하여 청크 및 청크 시작 인덱스(자동 회귀 훈련을 위한 시퀀스 범위 계산용)를 생성합니다.

 from retro_pytorch . retrieval import text_folder_to_chunks_

stats = text_folder_to_chunks_ (
    folder = './text_folder' ,
    glob = '**/*.txt' ,
    chunks_memmap_path = './train.chunks.dat' ,
    seqs_memmap_path = './train.seq.dat' ,
    doc_ids_memmap_path = './train.doc_ids.dat' ,  # document ids are needed for filtering out neighbors belonging to same document appropriately during computation of nearest neighbors
    chunk_size = 64 ,
    seq_len = 2048 ,
    max_chunks = 1_000_000 ,
    max_seqs = 100_000
)

# {'chunks': <number of chunks>, 'docs': <number of documents>, 'seqs': <number of sequences>}

가장 가까운 이웃 가져오기

하나의 명령으로 memmapped 청크 numpy 배열을 임베딩 및 faiss 인덱스로 변환할 수 있습니다.

 from retro_pytorch . retrieval import chunks_to_index_and_embed

index , embeddings = chunks_to_index_and_embed (
    num_chunks = 1000 ,
    chunk_size = 64 ,
    chunk_memmap_path = './train.chunks.dat'
)

query_vector = embeddings [: 1 ]                   # use first embedding as query
_ , indices = index . search ( query_vector , k = 2 )  # fetch 2 neighbors, first indices should be self

neighbor_embeddings = embeddings [ indices ]       # (1, 2, 768)

또한 chunks_to_precalculated_knn_ 명령을 사용하여 훈련에 필요한 가장 가까운 이웃 파일을 직접 계산할 수도 있습니다.

 from retro_pytorch . retrieval import chunks_to_precalculated_knn_

chunks_to_precalculated_knn_ (
    num_chunks = 1000 ,
    chunk_size = 64 ,
    chunk_memmap_path = './train.chunks.dat' ,    # path to main chunks dataset
    doc_ids_memmap_path = './train.doc_ids.dat' , # path to document ids created by text_folder_to_chunks_, used for filtering out neighbors that belong to the same document
    num_nearest_neighbors = 2 ,                   # number of nearest neighbors you'd like to use
    num_extra_neighbors = 10                     # fetch 10 extra neighbors, in the case that fetched neighbors are frequently from same document (filtered out)
)

# nearest neighbor info saved to ./train.chunks.knn.dat

인용

 @misc { borgeaud2022improving ,
    title   = { Improving language models by retrieving from trillions of tokens } , 
    author  = { Sebastian Borgeaud and Arthur Mensch and Jordan Hoffmann and Trevor Cai and Eliza Rutherford and Katie Millican and George van den Driessche and Jean-Baptiste Lespiau and Bogdan Damoc and Aidan Clark and Diego de Las Casas and Aurelia Guy and Jacob Menick and Roman Ring and Tom Hennigan and Saffron Huang and Loren Maggiore and Chris Jones and Albin Cassirer and Andy Brock and Michela Paganini and Geoffrey Irving and Oriol Vinyals and Simon Osindero and Karen Simonyan and Jack W. Rae and Erich Elsen and Laurent Sifre } ,
    year  = { 2022 } ,
    eprint = { 2112.04426 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}

 @misc { su2021roformer ,
    title   = { RoFormer: Enhanced Transformer with Rotary Position Embedding } ,
    author  = { Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu } ,
    year    = { 2021 } ,
    eprint  = { 2104.09864 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}

 @article { Wang2022DeepNetST ,
    title   = { DeepNet: Scaling Transformers to 1, 000 Layers } ,
    author  = { Hongyu Wang and Shuming Ma and Li Dong and Shaohan Huang and Dongdong Zhang and Furu Wei } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2203.00555 }
}

 @misc { zhang2021sparse ,
    title   = { Sparse Attention with Linear Units } ,
    author  = { Biao Zhang and Ivan Titov and Rico Sennrich } ,
    year    = { 2021 } ,
    eprint  = { 2104.07012 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}