local attention 다운로드 - local attention 소스코드 다운로드

지역적 관심

언어 모델링을 위한 믿을 수 없을 만큼 강력한 기준을 설정하는 로컬 창 주의 구현입니다. 변압기는 이전 레이어의 결과를 통합하기 위해 전 세계적인 관심을 위해 예약된 상위 레이어와 함께 하위 레이어에 로컬 주의가 필요하다는 것이 분명해지고 있습니다. 이 저장소를 사용하면 로컬 창 주의를 즉시 쉽게 사용할 수 있습니다.

이 코드는 희소 장거리 관심의 다양한 구현과 함께 이미 여러 저장소에서 전투 테스트를 거쳤습니다.

설치하다

$ pip install local-attention

용법

 import torch
from local_attention import LocalAttention

q = torch . randn ( 2 , 8 , 2048 , 64 )
k = torch . randn ( 2 , 8 , 2048 , 64 )
v = torch . randn ( 2 , 8 , 2048 , 64 )

attn = LocalAttention (
    dim = 64 ,                # dimension of each head (you need to pass this in for relative positional encoding)
    window_size = 512 ,       # window size. 512 is optimal, but 256 or 128 yields good enough results
    causal = True ,           # auto-regressive or not
    look_backward = 1 ,       # each window looks at the window before
    look_forward = 0 ,        # for non-auto-regressive case, will default to 1, so each window looks at the window before and after it
    dropout = 0.1 ,           # post-attention dropout
    exact_windowsize = False # if this is set to true, in the causal setting, each query will see at maximum the number of keys equal to the window size
)

mask = torch . ones ( 2 , 2048 ). bool ()
out = attn ( q , k , v , mask = mask ) # (2, 8, 2048, 64)

또한 이 라이브러리를 사용하면 공유 쿼리/키 공간(Reformer 아키텍처) 설정 시 로컬 주의를 끌 수 있습니다. 키의 정규화는 물론 토큰 자체의 마스킹도 처리됩니다.

 import torch
from local_attention import LocalAttention

qk = torch . randn ( 2 , 8 , 2048 , 64 )
v  = torch . randn ( 2 , 8 , 2048 , 64 )

attn = LocalAttention (
    dim = 64 ,
    window_size = 512 ,
    shared_qk = True ,
    causal = True
)

mask = torch . ones ( 2 , 2048 ). bool ()
out = attn ( qk , qk , v , mask = mask ) # (2, 8, 2048, 64)

모듈이 쿼리/키/값과 마스크를 자동으로 채우도록 하려면 autopad 키워드를 True 로 설정하기만 하면 됩니다.

 import torch
from local_attention import LocalAttention

q = torch . randn ( 8 , 2057 , 64 )
k = torch . randn ( 8 , 2057 , 64 )
v = torch . randn ( 8 , 2057 , 64 )

attn = LocalAttention (
    window_size = 512 ,
    causal = True ,
    autopad = True      # auto pads both inputs and mask, then truncates output appropriately
)

mask = torch . ones ( 1 , 2057 ). bool ()
out = attn ( q , k , v , mask = mask ) # (8, 2057, 64)

로컬 어텐션 트랜스포머

완전한 지역적 관심 변환기

 import torch
from local_attention import LocalTransformer

model = LocalTransformer (
    num_tokens = 256 ,
    dim = 512 ,
    depth = 6 ,
    max_seq_len = 8192 ,
    causal = True ,
    local_attn_window_size = 256
). cuda ()

x = torch . randint ( 0 , 256 , ( 1 , 8192 )). cuda ()

logits = model ( x ) # (1, 8192, 256)

4096의 Enwik8

창 크기 256, 룩백 1, 총 수용 필드 512

$ python train.py

소환

 @inproceedings { rae-razavi-2020-transformers ,
    title   = " Do Transformers Need Deep Long-Range Memory? " ,
    author  = " Rae, Jack  and Razavi, Ali " ,
    booktitle = " Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics " ,
    month   = jul,
    year    = " 2020 " ,
    address = " Online " ,
    publisher = " Association for Computational Linguistics " ,
    url     = " https://www.aclweb.org/anthology/2020.acl-main.672 "
}

 @misc { roy*2020efficient ,
    title   = { Efficient Content-Based Sparse Attention with Routing Transformers } ,
    author  = { Aurko Roy* and Mohammad Taghi Saffar* and David Grangier and Ashish Vaswani } ,
    year    = { 2020 } ,
    url     = { https://arxiv.org/pdf/2003.05997.pdf }
}

 @misc { beltagy2020longformer ,
    title   = { Longformer: The Long-Document Transformer } ,
    author  = { Iz Beltagy and Matthew E. Peters and Arman Cohan } ,
    year    = { 2020 } ,
    eprint  = { 2004.05150 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CL }
}

 @inproceedings { Sun2022ALT ,
    title     = { A Length-Extrapolatable Transformer } ,
    author    = { Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei } ,
    year      = { 2022 }
}

 @article { Bondarenko2023QuantizableTR ,
    title   = { Quantizable Transformers: Removing Outliers by Helping Attention Heads Do Nothing } ,
    author  = { Yelysei Bondarenko and Markus Nagel and Tijmen Blankevoort } ,
    journal = { ArXiv } ,
    year    = { 2023 } ,
    volume  = { abs/2306.12929 } ,
    url     = { https://api.semanticscholar.org/CorpusID:259224568 }
}