Загрузка magvit2 pytorch - Загрузка исходного кода magvit2 pytorch

magvit2 pytorch

AI Исходный код

0.4.9

Скачать

МагВиТ2 - Питорч

Реализация MagViT2 из языковой модели превосходит распространение: токенизатор — ключ к визуальной генерации в Pytorch. В настоящее время это относится к SOTA для создания/понимания видео.

Свободный квантователь поиска, предложенный в статье, можно найти в отдельном репозитории. Вероятно, следует изучить все другие модальности, начиная с аудио.

Присоединяйтесь, если вы заинтересованы в открытом воспроизведении токенизатора, предложенного в этой статье.

Обновление: Tencent использовала код из этого репозитория и открыла исходный код рабочей модели.

Признательность

СтабильностьAI и ? Huggingface за щедрую спонсорскую поддержку, а также другим моим спонсорам за предоставление мне независимости в области искусственного интеллекта с открытым исходным кодом.
Луи Серрано за то, что поделился некоторыми ранними первоначальными прогонами и подтвердил, что общая архитектура сходится с конечным скалярным квантованием.
Ты? Если вы талантливый инженер-исследователь/ученый, не стесняйтесь внести свой вклад в передовую науку с открытым исходным кодом!

Установить

$ pip install magvit2-pytorch

Использование

 from magvit2_pytorch import (
    VideoTokenizer ,
    VideoTokenizerTrainer
)

tokenizer = VideoTokenizer (
    image_size = 128 ,
    init_dim = 64 ,
    max_dim = 512 ,
    codebook_size = 1024 ,
    layers = (
        'residual' ,
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'linear_attend_space' ,
        'compress_space' ,
        ( 'consecutive_residual' , 2 ),
        'attend_space' ,
        'compress_time' ,
        ( 'consecutive_residual' , 2 ),
        'compress_time' ,
        ( 'consecutive_residual' , 2 ),
        'attend_time' ,
    )
)

trainer = VideoTokenizerTrainer (
    tokenizer ,
    dataset_folder = '/path/to/a/lot/of/media' ,     # folder of either videos or images, depending on setting below
    dataset_type = 'videos' ,                        # 'videos' or 'images', prior papers have shown pretraining on images to be effective for video synthesis
    batch_size = 4 ,
    grad_accum_every = 8 ,
    learning_rate = 2e-5 ,
    num_train_steps = 1_000_000
)

trainer . train ()

# after a lot of training ...
# can use the EMA of the tokenizer

ema_tokenizer = trainer . ema_tokenizer

# mock video

video = torch . randn ( 1 , 3 , 17 , 128 , 128 )

# tokenizing video to discrete codes

codes = ema_tokenizer . tokenize ( video ) # (1, 9, 16, 16) <- in this example, time downsampled by 4x and space downsampled by 8x. flatten token ids for (non)-autoregressive training

# sanity check

decoded_video = ema_tokenizer . decode_from_code_indices ( codes )

assert torch . allclose (
    decoded_video ,
    ema_tokenizer ( video , return_recon = True )
)

Чтобы отслеживать свои эксперименты по весам и смещениям, установите use_wandb_tracking = True в VideoTokenizerTrainer , а затем используйте контекстный менеджер .trackers

 trainer = VideoTokenizerTrainer (
    use_wandb_tracking = True ,
    ...
)

with trainer . trackers ( project_name = 'magvit2' , run_name = 'baseline' ):
    trainer . train ()

Тодо

Цитаты

 @misc { yu2023language ,
    title   = { Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation } , 
    author  = { Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang } ,
    year    = { 2023 } ,
    eprint  = { 2310.05737 } ,
    archivePrefix = { arXiv } ,
    primaryClass = { cs.CV }
}

 @inproceedings { dao2022flashattention ,
    title   = { Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness } ,
    author  = { Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{'e}, Christopher } ,
    booktitle = { Advances in Neural Information Processing Systems } ,
    year    = { 2022 }
}

 @article { Zhang2021TokenST ,
    title   = { Token Shift Transformer for Video Classification } ,
    author  = { Hao Zhang and Y. Hao and Chong-Wah Ngo } ,
    journal = { Proceedings of the 29th ACM International Conference on Multimedia } ,
    year    = { 2021 }
}

 @inproceedings { Arora2023ZoologyMA ,
    title   = { Zoology: Measuring and Improving Recall in Efficient Language Models } ,
    author  = { Simran Arora and Sabri Eyuboglu and Aman Timalsina and Isys Johnson and Michael Poli and James Zou and Atri Rudra and Christopher R'e } ,
    year    = { 2023 } ,
    url     = { https://api.semanticscholar.org/CorpusID:266149332 }
}