围绕在 GPT 中引入多个分层预测编码模型的简单想法进行实验。就这么简单,可能行不通。但话又说回来,深度学习的进步是建立在简单想法的基础上的。值得一试。
到目前为止,这个想法已经通过了一位研究朋友的试金石。将在下周左右完成。如果它不起作用,我将留下负面的实验结果以及存储库,也许一些博士生可以在此基础上进行构建。
更新:我认为它有效?
StabilityAI 赞助开展这项独立研究
? Huggingface 的加速库
$ pip install simple-hierarchical-transformer
三个层次结构,所有服务都预测下一个令牌
import torch
from simple_hierarchical_transformer import HierarchicalTransformer
model = HierarchicalTransformer (
num_tokens = 20000 , # number of tokens
dim = 512 , # model dimensions
depth = 6 , # depth
dim_head = 64 , # dimension per attention head
heads = 8 , # attention heads
seq_len = 2048 , # sequence lengths
hierarchies = ( 1 , 2 , 8 ), # hierarchies - here we have 1x (like in a regular transformer), then 2x and 8x compressed hierarchical tokens that undergo their own transformer blocks. information is pooled into one hierarchy at each layer
window_sizes = ( 32 , 64 , None ) # local attention window sizes - the idea is that the higher hierarchies can pass distant information to the local one. None stands for full receptive field. Setting 0 would turn off attention at this hierarchy altogether (while token shift will still be in effect in each layer)
)
ids = torch . randint ( 0 , 20000 , ( 1 , 2048 ))
loss , _ = model ( ids , return_loss = True )
loss . backward ()
# after much training
logits = model ( ids )
通过不指定hierarchies
和window_sizes
,您基本上默认为常规自回归转换器,并关注整个序列长度。
# non-hierarchical transformer
model = HierarchicalTransformer (
num_tokens = 20000 ,
dim = 512 ,
depth = 8 ,
dim_head = 64 ,
heads = 8 ,
seq_len = 2048 ,
hierarchies = 1 , # implied 1 if not set
window_sizes = None # implied None (full sequence length) if not set
)
现在有些更复杂的事情。实验表明,当压缩层次结构时,您需要更大的模型维度才能获得适当的容量。
model = HierarchicalTransformer (
num_tokens = 256 ,
dim = ( 128 , 256 , 512 , 1024 ),
depth = 8 ,
seq_len = 1024 ,
use_flash_attn = True ,
ff_mult = ( 2 , 2 , 4 , 4 ),
dim_head = ( 16 , 32 , 64 , 64 ),
heads = ( 2 , 4 , 8 , 8 ),
hierarchies = ( 1 , 2 , 4 , 16 ),
hierarchical_stride = ( 1 , 1 , 1 , 8 ), # this would determine the stride when compressing, and when concatting the hierarchical tokens to the fine tokens, the past tokens will be repeated this amount of time. causality is not violated as using the trick from hourglass transformers where sequence is shifted by compression factor - 1. recommend sticking with 1 except for highly compressed hierarchies, as it becomes very uncompetitive with baseline and generations look off
window_sizes = ( 16 , 32 , 64 , None )
). cuda ()
# hierarchies
# 1x - dim 128 - attention (2 heads, 16 dim, receptive field 16)
# 2x - dim 256 - attention (4 heads, 32 dim, receptive field 32)
# 4x - dim 512 - attention (8 heads, 64 dim, receptive field 64)
# 8x - dim 1024 - attention (8 heads, 64 dim, receptive field of all)
分支为两条并行路径,一条用于分层令牌,另一条用于普通精细令牌。
表明精细+分层标记中的局部注意力可以接近完全注意力基线
简单的 dsconv 似乎足以合并 1 个层次结构
对于精细和所有层次结构,自动将窗口大小设置为最大序列长度的一半
弄清楚在交叉熵损失之前汇集所有精细+分层令牌的效果 - 差别不大
完全能够添加任意数量的层次结构,并指定哪个层次结构将汇集其他层次结构的信息以进行预测
跨层次结构完全可定制的维度,因为更高的层次结构需要更大的模型维度
为分层分支添加先知损失
允许将来为精细令牌重复层次结构令牌,因为随着层次结构的上升,位置可能不再那么重要。但不是优先事项,首先让事情正常工作 - 实现为hierarchical_stride
允许某些层仅依赖令牌移位,无需注意
随机投影 + vq,就像在 Brain 的通用语音模型论文中所做的那样 - 用于分层预测编码
允许指定哪个层次结构在合并期间从其他层次结构接收信息,也许可以通过屏蔽设计专门的关注,但需要考虑跨层次结构的不同模型维度
构建简单的本地注意力块,用于所有层次结构
将Flash注意力添加到本地注意力库
弄清楚是否可以跨层次结构共享注意力
做一个干净的 wandb 报告,显示 2 倍压缩,字符级别 enwik8 没有太多损失
尝试使用基于自注意力的压缩器来处理 4 级或以上的层次结构
在网络的最开始使用令牌嵌入作为输入构建一个小型自动编码器,然后对每个并行分层网络使用中间特征图
最接近的想法是沙漏变压器。
我对分层方法重新产生的兴趣来自于阅读这篇文章。
@article { Nawrot2021HierarchicalTA ,
title = { Hierarchical Transformers Are More Efficient Language Models } ,
author = { Piotr Nawrot and Szymon Tworkowski and Michal Tyrolski and Lukasz Kaiser and Yuhuai Wu and Christian Szegedy and Henryk Michalewski } ,
journal = { ArXiv } ,
year = { 2021 } ,
volume = { abs/2110.13711 }
}
@inproceedings { dao2022flashattention ,
title = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
author = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
booktitle = { Advances in Neural Information Processing Systems } ,
year = { 2022 }
}
@misc { su2021roformer ,
title = { RoFormer: Enhanced Transformer with Rotary Position Embedding } ,
author = { Jianlin Su and Yu Lu and Shengfeng Pan and Bo Wen and Yunfeng Liu } ,
year = { 2021 } ,
eprint = { 2104.09864 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CL }
}
@inproceedings { Sun2022ALT ,
title = { A Length-Extrapolatable Transformer } ,
author = { Yutao Sun and Li Dong and Barun Patra and Shuming Ma and Shaohan Huang and Alon Benhaim and Vishrav Chaudhary and Xia Song and Furu Wei } ,
year = { 2022 }
}
@software { peng_bo_2021_5196578 ,
author = { PENG Bo } ,
title = { BlinkDL/RWKV-LM: 0.01 } ,
month = { aug } ,
year = { 2021 } ,
publisher = { Zenodo } ,
version = { 0.01 } ,
doi = { 10.5281/zenodo.5196578 } ,
url = { https://doi.org/10.5281/zenodo.5196578 }
}
@article { Piergiovanni2023Mirasol3BAM ,
title = { Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities } ,
author = { A. J. Piergiovanni and Isaac Noble and Dahun Kim and Michael S. Ryoo and Victor Gomes and Anelia Angelova } ,
journal = { ArXiv } ,
year = { 2023 } ,
volume = { abs/2311.05698 } ,
url = { https://api.semanticscholar.org/CorpusID:265129010 }
}