圍繞在 GPT 中引入多個分層預測編碼模型的簡單想法進行實驗。就這麼簡單,可能行不通。但話又說回來,深度學習的進步是建立在簡單的想法上。值得一試。
到目前為止,這個想法已經通過了一位研究朋友的試金石。將在下週左右完成。如果它不起作用,我將留下負面的實驗結果以及存儲庫,也許一些博士生可以在此基礎上進行構建。
更新:我認為它有效?
StabilityAI 贊助進行這項獨立研究
? Huggingface 的加速程式庫
$ pip install simple-hierarchical-transformer
三個層次結構,所有服務都預測下一個令牌
import torch
from simple_hierarchical_transformer import HierarchicalTransformer
model = HierarchicalTransformer (
num_tokens = 20000 , # number of tokens
dim = 512 , # model dimensions
depth = 6 , # depth
dim_head = 64 , # dimension per attention head
heads = 8 , # attention heads
seq_len = 2048 , # sequence lengths
hierarchies = ( 1 , 2 , 8 ), # hierarchies - here we have 1x (like in a regular transformer), then 2x and 8x compressed hierarchical tokens that undergo their own transformer blocks. information is pooled into one hierarchy at each layer
window_sizes = ( 32 , 64 , None ) # local attention window sizes - the idea is that the higher hierarchies can pass distant information to the local one. None stands for full receptive field. Setting 0 would turn off attention at this hierarchy altogether (while token shift will still be in effect in each layer)
)
ids = torch . randint ( 0 , 20000 , ( 1 , 2048 ))
loss , _ = model ( ids , return_loss = True )
loss . backward ()
# after much training
logits = model ( ids )
透過不指定hierarchies
和window_sizes
,您基本上預設為常規自回歸轉換器,並專注於整個序列長度。
# non-hierarchical transformer
model = HierarchicalTransformer (
num_tokens = 20000 ,
dim = 512 ,
depth = 8 ,
dim_head = 64 ,
heads = 8 ,
seq_len = 2048 ,
hierarchies = 1 , # implied 1 if not set
window_sizes = None # implied None (full sequence length) if not set
)
現在有些更複雜的事情。實驗表明,當壓縮層次結構時,您需要更大的模型維度才能獲得適當的容量。
model = HierarchicalTransformer (
num_tokens = 256 ,
dim = ( 128 , 256 , 512 , 1024 ),
depth = 8 ,
seq_len = 1024 ,
use_flash_attn = True ,
ff_mult = ( 2 , 2 , 4 , 4 ),
dim_head = ( 16 , 32 , 64 , 64 ),
heads = ( 2 , 4 , 8 , 8 ),
hierarchies = ( 1 , 2 , 4 , 16 ),
hierarchical_stride = ( 1 , 1 , 1 , 8 ), # this would determine the stride when compressing, and when concatting the hierarchical tokens to the fine tokens, the past tokens will be repeated this amount of time. causality is not violated as using the trick from hourglass transformers where sequence is shifted by compression factor - 1. recommend sticking with 1 except for highly compressed hierarchies, as it becomes very uncompetitive with baseline and generations look off
window_sizes = ( 16 , 32 , 64 , None )
). cuda ()
# hierarchies
# 1x - dim 128 - attention (2 heads, 16 dim, receptive field 16)
# 2x - dim 256 - attention (4 heads, 32 dim, receptive field 32)
# 4x - dim 512 - attention (8 heads, 64 dim, receptive field 64)
# 8x - dim 1024 - attention (8 heads, 64 dim, receptive field of all)
分支為兩條並行路徑,一條用於分層令牌,另一條用於普通精細令牌。
顯示精細+分層標記中的局部注意力可以接近完全注意力基線
簡單的 dsconv 似乎足以合併 1 個層次結構
對於精細和所有層次結構,自動將視窗大小設定為最大序列長度的一半
弄清楚在交叉熵損失之前匯集所有精細+分層令牌的效果 - 差異不大
完全能夠添加任意數量的層次結構,並指定哪個層次結構將匯集其他層次結構的資訊以進行預測
跨層次結構完全可自訂的維度,因為更高的層次結構需要更大的模型維度
為分層分支添加先知損失
允許將來為精細令牌重複層次結構令牌,因為隨著層次結構的上升,位置可能不再那麼重要。但不是優先事項,首先讓事情正常工作 - 實現為hierarchical_stride
允許某些層僅依賴令牌移位,無需注意
隨機投影 + vq,就像在 Brain 的通用語音模型論文中所做的那樣 - 用於分層預測編碼
允許指定哪個層次結構在合併期間從其他層次結構接收訊息,也許可以透過屏蔽設計專門的關注,但需要考慮跨層次結構的不同模型維度
建立簡單的本地註意力塊,用於所有層次結構
將Flash注意力加入到本地註意力庫
弄清楚是否可以跨層次結構共享注意力
做一個乾淨的 wandb 報告,顯示 2 倍壓縮,字元等級 enwik8 沒有太多損失
嘗試使用基於自註意力的壓縮器來處理 4 級或以上的層次結構
在網路的最開始使用令牌嵌入作為輸入來建立一個小型自動編碼器,然後對每個並行分層網路使用中間特徵圖
最接近的想法是沙漏變壓器。
我對分層方法重新產生的興趣來自於閱讀這篇文章。
@article { Nawrot2021HierarchicalTA ,
title = { Hierarchical Transformers Are More Efficient Language Models } ,
author = { Piotr Nawrot and Szymon Tworkowski and Michal Tyrolski and Lukasz Kaiser and Yuhuai Wu and Christian Szegedy and Henryk Michalewski } ,
journal = { ArXiv } ,
year = { 2021 } ,
volume = { abs/2110.13711 }
}
@inproceedings { dao2022flashattention ,
title = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
author = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
booktitle = { Advances in Neural Information Processing Systems } ,
year = { 2022 }
}
@misc { su2021roformer ,
title = { RoFormer: Enhanced Transformer with Rotary Position Embedding } ,
author = { Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu } ,
year = { 2021 } ,
eprint = { 2104.09864 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CL }
}
@inproceedings { Sun2022ALT ,
title = { A Length-Extrapolatable Transformer } ,
author = { Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei } ,
year = { 2022 }
}
@software { peng_bo_2021_5196578 ,
author = { PENG Bo } ,
title = { BlinkDL/RWKV-LM: 0.01 } ,
month = { aug } ,
year = { 2021 } ,
publisher = { Zenodo } ,
version = { 0.01 } ,
doi = { 10.5281/zenodo.5196578 } ,
url = { https://doi.org/10.5281/zenodo.5196578 }
}
@article { Piergiovanni2023Mirasol3BAM ,
title = { Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities } ,
author = { A. J. Piergiovanni and Isaac Noble and Dahun Kim and Michael S. Ryoo and Victor Gomes and Anelia Angelova } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2311.05698 } ,
url = { https://api.semanticscholar.org/CorpusID:265129010 }
}