Pyramid Flow下载 - Pyramid Flow源码下载

Pyramid Flow

其他源码

下载

⚡️金字塔流程⚡️

[论文] [项目页面] [miniFLUX 模型] [SD3 模型⚡️] [演示？]

这是 Pyramid Flow 的官方存储库，Pyramid Flow 是一种基于流匹配的训练高效的自回归视频生成方法。通过仅在开源数据集上进行训练，它可以生成 768p 分辨率和 24 FPS 的高质量 10 秒视频，并且自然支持图像到视频的生成。

10 秒，768p，24fps	5 秒、768p、24 帧/秒	图像到视频
烟花.mp4	预告片.mp4	星期日.mp4

消息

2024.11.13我们发布了768p miniFLUX检查点（最长10秒）。
我们已将模型结构从 SD3 切换为 mini FLUX 以解决人体结构问题，请尝试我们的 1024p 图像检查点、384p 视频检查点（最多 5 秒）和 768p 视频检查点（最多 10 秒）。新的miniflux模型在人体结构和运动稳定性方面显示出巨大的改进
2024.10.29 ⚡️⚡️⚡️ 我们发布了 VAE 的训练代码、DiT 的微调代码以及从头开始训练的 FLUX 结构的新模型检查点。
2024.10.13支持多GPU推理和CPU卸载。使用少于 8GB的 GPU 内存，在多个 GPU 上可大幅加速。
2024.10.11 ？？？拥抱脸部演示可用。感谢@multimodalart 的提交！
2024.10.10我们发布了Pyramid Flow的技术报告、项目页面和模型检查点。

介绍

现有的视频扩散模型以全分辨率运行，在非常嘈杂的潜伏上花费大量计算。相比之下，我们的方法利用流匹配的灵活性（Lipman et al., 2023; Liu et al., 2023; Albergo & Vanden-Eijnden, 2023）在不同分辨率和噪声水平的潜在变量之间进行插值，从而允许同时生成和以更好的计算效率对视觉内容进行解压缩。整个框架使用单个 DiT 进行端到端优化（Peebles & Xie，2023），在 20.7k A100 GPU 训练小时内生成 768p 分辨率和 24 FPS 的高质量 10 秒视频。

安装

我们建议使用 conda 设置环境。代码库当前使用 Python 3.8.10 和 PyTorch 2.1.2（指南），我们正在积极努力支持更广泛的版本。

git clone https://github.com/jy0205/Pyramid-Flow
cd Pyramid-Flow

# create env using conda
conda create -n pyramid python==3.8.10
conda activate pyramid
pip install -r requirements.txt

然后，从 Huggingface 下载模型（有两种变体：miniFLUX 或 SD3）。 miniFLUX 型号支持 1024p 图像、384p 和 768p 视频生成，基于 SD3 的型号支持 768p 和 384p 视频生成。 384p 检查点以 24FPS 生成 5 秒的视频，而 768p 检查点以 24FPS 生成长达 10 秒的视频。

 from huggingface_hub import snapshot_download

model_path = 'PATH'   # The local directory to save downloaded checkpoint
snapshot_download ( "rain1011/pyramid-flow-miniflux" , local_dir = model_path , local_dir_use_symlinks = False , repo_type = 'model' )

推理

1.Gradio快速入门

首先，首先安装 Gradio，将模型路径设置为 #L36，然后在本地计算机上运行：

python app.py

Gradio 演示将在浏览器中打开。感谢@tpc2233 的提交，请参阅#48 了解详细信息。

或者，在 Hugging Face Space 上轻松尝试一下？由@multimodalart 创建。由于 GPU 限制，此在线演示只能生成 25 帧（以 8FPS 或 24FPS 导出）。复制空间以生成更长的视频。

快速入门 Google Colab

要在 Google Colab 上快速试用 Pyramid Flow，请运行以下代码：

 # Setup
!git clone https://github.com/jy0205/Pyramid-Flow
%cd Pyramid-Flow
!pip install -r requirements.txt
!pip install gradio

# This code downloads miniFLUX
from huggingface_hub import snapshot_download

model_path = '/content/Pyramid-Flow'
snapshot_download("rain1011/pyramid-flow-miniflux", local_dir=model_path, local_dir_use_symlinks=False, repo_type='model')

# Start
!python app.py

2. 推理代码

要使用我们的模型，请按照此链接中video_generation_demo.ipynb中的推理代码进行操作。我们强烈推荐您尝试最新发表的pyramid-miniflux，它对人体结构和运动稳定性都有很大的改善。将参数model_name设置为要使用的pyramid_flux 。我们进一步将其简化为以下两步过程。首先，加载下载的模型：

 import torch
from PIL import Image
from pyramid_dit import PyramidDiTForVideoGeneration
from diffusers . utils import load_image , export_to_video

torch . cuda . set_device ( 0 )
model_dtype , torch_dtype = 'bf16' , torch . bfloat16   # Use bf16 (not support fp16 yet)

model = PyramidDiTForVideoGeneration (
    'PATH' ,                                         # The downloaded checkpoint dir
    model_name = "pyramid_flux" ,
    model_dtype ,
    model_variant = 'diffusion_transformer_768p' ,
)

model . vae . enable_tiling ()
# model.vae.to("cuda")
# model.dit.to("cuda")
# model.text_encoder.to("cuda")

# if you're not using sequential offloading bellow uncomment the lines above ^
model . enable_sequential_cpu_offload ()

然后，您可以根据自己的提示尝试生成文本到视频。请注意，384p 版本现在仅支持 5 秒（将温度设置为 16）！

 prompt = "A movie trailer featuring the adventures of the 30 year old space man wearing a red wool knitted motorcycle helmet, blue sky, salt desert, cinematic style, shot on 35mm film, vivid colors"

# used for 384p model variant
# width = 640
# height = 384

# used for 768p model variant
width = 1280
height = 768

with torch . no_grad (), torch . cuda . amp . autocast ( enabled = True , dtype = torch_dtype ):
    frames = model . generate (
        prompt = prompt ,
        num_inference_steps = [ 20 , 20 , 20 ],
        video_num_inference_steps = [ 10 , 10 , 10 ],
        height = height ,     
        width = width ,
        temp = 16 ,                    # temp=16: 5s, temp=31: 10s
        guidance_scale = 7.0 ,         # The guidance for the first frame, set it to 7 for 384p variant
        video_guidance_scale = 5.0 ,   # The guidance for the other video latent
        output_type = "pil" ,
        save_memory = True ,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video ( frames , "./text_to_video_sample.mp4" , fps = 24 )

作为自回归模型，我们的模型还支持（文本条件）图像到视频的生成：

 # used for 384p model variant
# width = 640
# height = 384

# used for 768p model variant
width = 1280
height = 768

image = Image . open ( 'assets/the_great_wall.jpg' ). convert ( "RGB" ). resize (( width , height ))
prompt = "FPV flying over the Great Wall"

with torch . no_grad (), torch . cuda . amp . autocast ( enabled = True , dtype = torch_dtype ):
    frames = model . generate_i2v (
        prompt = prompt ,
        input_image = image ,
        num_inference_steps = [ 10 , 10 , 10 ],
        temp = 16 ,
        video_guidance_scale = 4.0 ,
        output_type = "pil" ,
        save_memory = True ,           # If you have enough GPU memory, set it to `False` to improve vae decoding speed
    )

export_to_video ( frames , "./image_to_video_sample.mp4" , fps = 24 )

CPU卸载

我们还支持两种类型的 CPU 卸载，以减少 GPU 内存需求。请注意，他们可能会牺牲效率。

向生成函数添加cpu_offloading=True参数允许使用少于 12GB的 GPU 内存进行推理。此功能由@Ednaordinary 贡献，有关详细信息，请参阅#23。
在上述过程之前调用model.enable_sequential_cpu_offload()允许使用小于 8GB的 GPU 内存进行推理。此功能由@rodjjo 贡献，详细信息请参阅#75。

MPS后端

感谢@niw，Apple Silicon 用户（例如配备 M2 24GB 的 MacBook Pro）也可以使用 MPS 后端尝试我们的模型！详情请参阅#113。

3. 多GPU推理

对于拥有多个 GPU 的用户，我们提供了一个推理脚本，该脚本使用序列并行性来节省每个 GPU 上的内存。这也带来了很大的加速，在 4 个 A100 GPU 上生成 5 秒、768p、24fps 的视频仅需 2.5 分钟（而在单个 A100 GPU 上则需要 5.5 分钟）。使用以下命令在 2 个 GPU 上运行它：

CUDA_VISIBLE_DEVICES=0,1 sh scripts/inference_multigpu.sh

目前支持 2 或 4 个 GPU（适用于 SD3 版本），原始脚本中提供更多配置。您还可以启动由 @tpc2233 创建的多 GPU Gradio 演示，有关详细信息，请参阅#59。

剧透：由于我们高效的金字塔流设计，我们在训练中甚至没有使用序列并行性。

4. 使用技巧

guidance_scale参数控制视觉质量。我们建议在文本到视频生成期间对 768p 检查点使用 [7, 9] 中的指南，对 384p 检查点使用 7 中的指南。
video_guidance_scale参数控制运动。较大的值会增加动态程度并减轻自回归生成退化，而较小的值会稳定视频。
对于 10 秒视频生成，我们建议使用指导比例为 7，视频指导比例为 5。

训练

1. 训练VAE

训练 VAE 的硬件要求至少为 8 个 A100 GPU。请参阅此文档。这是一个类似于 MAGVIT-v2 的连续 3D VAE，应该非常灵活。请随意在 VAE 训练代码的这一部分上构建您自己的视频生成模型。

2. 微调 DiT

微调 DiT 的硬件要求至少为 8 个 A100 GPU。请参阅此文档。我们提供金字塔流自回归和非自回归版本的说明。前者更偏向研究，后者更稳定（但没有时间金字塔效率较低）。

画廊

以下视频示例以 5 秒、768p、24fps 生成。欲了解更多结果，请访问我们的项目页面。

东京.mp4	埃菲尔铁塔.mp4
波浪.mp4	铁路.mp4

比较

在 VBench（Huang 等人，2024）上，我们的方法超越了所有比较的开源基线。即使只有公开视频数据，它也能达到与 Kling（快手，2024）和 Gen-3 Alpha（Runway，2024）等商业模型相当的性能，特别是在质量得分（84.74 vs. Gen-3 的 84.11）和运动平滑度方面。

虚拟基准

我们对 20 多名参与者进行了额外的用户研究。可以看出，我们的方法优于 Open-Sora 和 CogVideoX-2B 等开源模型，尤其是在运动平滑度方面。

用户研究

致谢

我们感谢在实施 Pyramid Flow 时实现以下出色的项目：

SD3 Medium 和 Flux 1.0：基于流匹配的最先进的图像生成模型。
Diffusion Forcing 和 GameNGen：下一个令牌预测满足全序列扩散。
WebVid-10M、OpenVid-1M 和 Open-Sora Plan：用于文本到视频生成的大型数据集。
CogVideoX：一种开源文本到视频生成模型，共享许多训练细节。
Video-LLaMA2：用于我们视频录制的开源视频法学硕士。

引文

如果它对您的研究有帮助，请考虑给这个存储库一颗星，并在您的出版物中引用 Pyramid Flow。

 @article{jin2024pyramidal,
  title={Pyramidal Flow Matching for Efficient Video Generative Modeling},
  author={Jin, Yang and Sun, Zhicheng and Li, Ningyuan and Xu, Kun and Xu, Kun and Jiang, Hao and Zhuang, Nan and Huang, Quzhe and Song, Yang and Mu, Yadong and Lin, Zhouchen},
  jounal={arXiv preprint arXiv:2410.05954},
  year={2024}
}

展开

附加信息