flash attention jax Download - flash attention jax Quellcode herunterladen

flash attention jax

AI-Quellcode

0.3.1

Herunterladen

Blitzaufmerksamkeit – Jax

Implementierung von Flash Attention in Jax. Es wird wahrscheinlich nicht so leistungsfähig sein wie die offizielle CUDA-Version, da es an Möglichkeiten für eine feine Speicherverwaltung mangelt. Aber nur zu Bildungszwecken und um zu sehen, wie clever der XLA-Compiler ist (oder nicht).

Installieren

$ pip install flash-attention-jax

Verwendung

 from jax import random
from flash_attention_jax import flash_attention

rng_key = random . PRNGKey ( 42 )

q = random . normal ( rng_key , ( 1 , 2 , 131072 , 512 ))  # (batch, heads, seq, dim)
k = random . normal ( rng_key , ( 1 , 2 , 131072 , 512 ))
v = random . normal ( rng_key , ( 1 , 2 , 131072 , 512 ))
mask = random . randint ( rng_key , ( 1 , 131072 ,), 0 , 2 ) # (batch, seq)

out , _ = flash_attention ( q , k , v , mask )

out . shape  # (1, 2, 131072, 512) - (batch, heads, seq, dim)

Schnelle Überprüfung der geistigen Gesundheit

 from flash_attention_jax import plain_attention , flash_attention , value_and_grad_difference

diff , ( dq_diff , dk_diff , dv_diff ) = value_and_grad_difference (
    plain_attention ,
    flash_attention ,
    seed = 42
)

print ( 'shows differences between normal and flash attention for output, dq, dk, dv' )
print ( f'o: { diff } ' )       # < 1e-4
print ( f'dq: { dq_diff } ' )   # < 1e-6
print ( f'dk: { dk_diff } ' )   # < 1e-6
print ( f'dv: { dv_diff } ' )   # < 1e-6

Autoregressive Flash Attention – GPT-ähnliche Decoder-Aufmerksamkeit

 from jax import random
from flash_attention_jax import causal_flash_attention

rng_key = random . PRNGKey ( 42 )

q = random . normal ( rng_key , ( 131072 , 512 ))
k = random . normal ( rng_key , ( 131072 , 512 ))
v = random . normal ( rng_key , ( 131072 , 512 ))

out , _ = causal_flash_attention ( q , k , v )

out . shape  # (131072, 512)

Todo

Leitdimensionen für die Variante der kausalen Blitzaufmerksamkeit
Finden Sie das Problem mit JIT und statischen Argnums heraus
Kommentar mit Verweisen auf Papieralgorithmen und Erklärungen
Stellen Sie sicher, dass es mit einköpfigen Schlüsseln/Werten funktionieren kann, wie in PaLM

Zitate

 @article { Dao2022FlashAttentionFA ,
    title   = { FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness } ,
    author  = { Tri Dao and Daniel Y. Fu and Stefano Ermon and Atri Rudra and Christopher R'e } ,
    journal = { ArXiv } ,
    year    = { 2022 } ,
    volume  = { abs/2205.14135 }
}

 @article { Rabe2021SelfattentionDN ,
    title   = { Self-attention Does Not Need O(n2) Memory } ,
    author  = { Markus N. Rabe and Charles Staats } ,
    journal = { ArXiv } ,
    year    = { 2021 } ,
    volume  = { abs/2112.05682 }
}