magvit2 pytorch
0.4.9
从语言模型实现 MagViT2 击败扩散 - 分词器是 Pytorch 中视觉生成的关键。目前,这在视频生成/理解方面拥有 SOTA。
论文中提出的 Lookup Free Quantizer 可以在单独的存储库中找到。也许应该从音频开始探索所有其他模式
如果您有兴趣公开复制本文中提出的分词器,请加入
更新:腾讯已使用此存储库中的代码并开源了工作模型
稳定性人工智能和?感谢慷慨的赞助,以及我的其他赞助商,感谢他们为我提供了开源人工智能的独立性。
Louis Serrano 分享了一些早期的初始运行,验证了整体架构与有限标量量化的收敛性。
你?如果您是一位才华横溢的研究工程师/科学家,请随时为尖端的开源科学做出贡献!
$ pip install magvit2-pytorch
from magvit2_pytorch import (
VideoTokenizer ,
VideoTokenizerTrainer
)
tokenizer = VideoTokenizer (
image_size = 128 ,
init_dim = 64 ,
max_dim = 512 ,
codebook_size = 1024 ,
layers = (
'residual' ,
'compress_space' ,
( 'consecutive_residual' , 2 ),
'compress_space' ,
( 'consecutive_residual' , 2 ),
'linear_attend_space' ,
'compress_space' ,
( 'consecutive_residual' , 2 ),
'attend_space' ,
'compress_time' ,
( 'consecutive_residual' , 2 ),
'compress_time' ,
( 'consecutive_residual' , 2 ),
'attend_time' ,
)
)
trainer = VideoTokenizerTrainer (
tokenizer ,
dataset_folder = '/path/to/a/lot/of/media' , # folder of either videos or images, depending on setting below
dataset_type = 'videos' , # 'videos' or 'images', prior papers have shown pretraining on images to be effective for video synthesis
batch_size = 4 ,
grad_accum_every = 8 ,
learning_rate = 2e-5 ,
num_train_steps = 1_000_000
)
trainer . train ()
# after a lot of training ...
# can use the EMA of the tokenizer
ema_tokenizer = trainer . ema_tokenizer
# mock video
video = torch . randn ( 1 , 3 , 17 , 128 , 128 )
# tokenizing video to discrete codes
codes = ema_tokenizer . tokenize ( video ) # (1, 9, 16, 16) <- in this example, time downsampled by 4x and space downsampled by 8x. flatten token ids for (non)-autoregressive training
# sanity check
decoded_video = ema_tokenizer . decode_from_code_indices ( codes )
assert torch . allclose (
decoded_video ,
ema_tokenizer ( video , return_recon = True )
)
要跟踪权重和偏差的实验,请在VideoTokenizerTrainer
上设置use_wandb_tracking = True
,然后使用.trackers
上下文管理器
trainer = VideoTokenizerTrainer (
use_wandb_tracking = True ,
...
)
with trainer . trackers ( project_name = 'magvit2' , run_name = 'baseline' ):
trainer . train ()
Magvit2 分词器
decode_from_codebook_indices
应该能够接受扁平化的 id 并重塑以纠正特征图尺寸并解码回视频临时搭建一个 RQ 视频转换器,因为残差 LFQ 现在实际上有意义
掩码Git
@misc { yu2023language ,
title = { Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation } ,
author = { Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang } ,
year = { 2023 } ,
eprint = { 2310.05737 } ,
archivePrefix = { arXiv } ,
primaryClass = { cs.CV }
}
@inproceedings { dao2022flashattention ,
title = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
author = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
booktitle = { Advances in Neural Information Processing Systems } ,
year = { 2022 }
}
@article { Zhang2021TokenST ,
title = { Token Shift Transformer for Video Classification } ,
author = { Hao Zhang and Y. Hao and Chong-Wah Ngo } ,
journal = { Proceedings of the 29th ACM International Conference on Multimedia } ,
year = { 2021 }
}
@inproceedings { Arora2023ZoologyMA ,
title = { Zoology: Measuring and Improving Recall in Efficient Language Models } ,
author = { Simran Arora and Sabri Eyuboglu and Aman Timalsina and Isys Johnson and Michael Poli and James Zou and Atri Rudra and Christopher R'e } ,
year = { 2023 } ,
url = { https://api.semanticscholar.org/CorpusID:266149332 }
}