magvit2 pytorch 다운로드 - magvit2 pytorch 소스 코드 다운로드

magvit2 pytorch

AI 소스 코드

0.4.9

다운로드

MagViT2 - 파이토치

언어 모델의 MagViT2 구현이 확산을 이깁니다. Tokenizer는 Pytorch의 시각적 생성의 핵심입니다. 현재 영상 생성/이해를 위한 SOTA를 보유하고 있습니다.

논문에서 제안하는 Lookup Free Quantizer는 별도의 저장소에서 찾아보실 수 있습니다. 아마도 오디오를 시작으로 다른 모든 양식에 대해 탐구해야 할 것입니다.

본 논문에서 제안한 토크나이저를 공개적으로 복제하는데 관심이 있으신 분은 참여해 주시기 바랍니다.

업데이트: Tencent는 이 저장소의 코드를 사용하고 작업 모델을 오픈 소스로 제공했습니다.

감사

안정성AI와? 오픈 소스 인공 지능에 대한 독립성을 제공해준 다른 후원자와 관대한 후원에 대해 포옹합니다.
초기 초기 실행을 공유하고 전체 아키텍처가 유한 스칼라 양자화로 수렴되는지 검증한 Louis Serrano.
너? 당신이 재능 있는 연구 엔지니어/과학자인 경우, 최첨단 오픈소스 과학에 자유롭게 기여하세요!

설치하다

$ pip install magvit2-pytorch

용법

 from magvit2_pytorch import (
    VideoTokenizer ,
    VideoTokenizerTrainer
)

tokenizer = VideoTokenizer (
    image_size = 128 ,
    init_dim = 64 ,
    max_dim = 512 ,
    codebook_size = 1024 ,
    layers = (
        'residual' ,
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'linear_attend_space' ,
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'attend_space' ,
        'compress_time' ,
        ( 'consecutive_residual' , 2 ),
        'compress_time' ,
        ( 'consecutive_residual' , 2 ),
        'attend_time' ,
    )
)

trainer = VideoTokenizerTrainer (
    tokenizer ,
    dataset_folder = '/path/to/a/lot/of/media' ,     # folder of either videos or images, depending on setting below
    dataset_type = 'videos' ,                        # 'videos' or 'images', prior papers have shown pretraining on images to be effective for video synthesis
    batch_size = 4 ,
    grad_accum_every = 8 ,
    learning_rate = 2e-5 ,
    num_train_steps = 1_000_000
)

trainer . train ()

# after a lot of training ...
# can use the EMA of the tokenizer

ema_tokenizer = trainer . ema_tokenizer

# mock video

video = torch . randn ( 1 , 3 , 17 , 128 , 128 )

# tokenizing video to discrete codes

codes = ema_tokenizer . tokenize ( video ) # (1, 9, 16, 16) <- in this example, time downsampled by 4x and space downsampled by 8x. flatten token ids for (non)-autoregressive training

# sanity check

decoded_video = ema_tokenizer . decode_from_code_indices ( codes )

assert torch . allclose (
    decoded_video ,
    ema_tokenizer ( video , return_recon = True )
)

Weights & Biases에 대한 실험을 추적하려면 VideoTokenizerTrainer 에서 use_wandb_tracking = True 설정한 다음 .trackers 컨텍스트 관리자를 사용하세요.

 trainer = VideoTokenizerTrainer (
    use_wandb_tracking = True ,
    ...
)

with trainer . trackers ( project_name = 'magvit2' , run_name = 'baseline' ):
    trainer . train ()

토도

인용

 @misc { yu2023language ,
    title   = { Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation } , 
    author  = { Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang } ,
    year    = { 2023 } ,
    eprint  = { 2310.05737 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @inproceedings { dao2022flashattention ,
    title   = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
    author  = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
    booktitle = { Advances in Neural Information Processing Systems } ,
    year    = { 2022 }
}

 @article { Zhang2021TokenST ,
    title   = { Token Shift Transformer for Video Classification } ,
    author  = { Hao Zhang and Y. Hao and Chong-Wah Ngo } ,
    journal = { Proceedings of the 29th ACM International Conference on Multimedia } ,
    year    = { 2021 }
}

 @inproceedings { Arora2023ZoologyMA ,
    title   = { Zoology: Measuring and Improving Recall in Efficient Language Models } ,
    author  = { Simran Arora and Sabri Eyuboglu and Aman Timalsina and Isys Johnson and Michael Poli and James Zou and Atri Rudra and Christopher R'e } ,
    year    = { 2023 } ,
    url     = { https://api.semanticscholar.org/CorpusID:266149332 }
}