Chang Liu, Rui Li, Kaidong Zhang, Yunwei Lan, Dong Liu
[Paper
] / [Project
] / [Models (Huggingface)
] / [DAVIS-Edit (HuggingFace)
] / [Models (wisemodel)
] / [DAVIS-Edit (wisemodel)
] / [Models (ModelScope)
] / [DAVIS-Edit (ModelScope)
]
StableV2V
StableV2V
(Command Lines)
StableV2V
StableV2V
StableV2V
(Gradio Demo)DAVIS-Edit
If you have any questions about this work, please feel free to start a new issue or propose a PR.
StableV2V
StableV2V
presents a novel paradigm to perform video editing in a shape-consistent manner, especially handling the editing scenarios when user prompts cause significant shape changes to the edited contents.
Besides, StableV2V
shows superior flexibility in handling a wide series of down-stream applications, considering various user prompts from different modalities.
<Back to Table of Contents>
DAVIS-Edit
to ModelScope.StableV2V
, with detailed illustrations presented in this section.DAVIS-Edit
to wisemodel.cn.DAVIS-Edit
to our HuggingFace dataset repo, and uploaded all the required model weights of StableV2V
to our HuggingFace model repo.StableV2V
DAVIS-Edit
to our HuggingFace repoStableV2V
to our HuggingFace repo<Back to Table of Contents>
StableV2V
├── LICENSE
├── README.md
├── assets
├── datasets <----- Code of datasets for training of the depth refinement network
├── models <----- Code of model definitions in different components
├── runners <----- Code of engines to run different components
├── inference.py <----- Script to inference StableV2V
├── train_completion_net.py <----- Script to train the shape-guided depth completion network
└── utils <----- Code of toolkit functions
<Back to Table of Contents>
We offer an one-click command line to install all the dependencies that the code requires.
First, create the virtual environment with conda
:
conda create -n stablev2v python=3.10
Then, you can execute the following lines to install the dependencies with pip
:
bash install_pip.sh
You can also install the dependencies with conda
, following the command line below:
bash install_conda.sh
Then, you are ready to go with conda activate stablev2v
.
Before you start the inference process, you need to prepare the model weights that StableV2V
requires.
Model | Component | Link |
---|---|---|
Paint-by-Example | PFE | Fantasy-Studio/Paint-by-Example |
InstructPix2Pix | PFE | timbrooks/instruct-pix2pix |
SD Inpaint | PFE | botp/stable-diffusion-v1-5-inpainting |
ControlNet + SD Inpaint | PFE | ControlNet models at lllyasviel
|
AnyDoor | PFE | xichenhku/AnyDoor |
RAFT | ISA | Google Drive |
MiDaS | ISA | Link |
U2-Net | ISA | Link |
Depth Refinement Network | ISA | Link |
SD v1.5 | CIG | stable-diffusion-v1-5/stable-diffusion-v1-5 |
ControlNet (depth) | CIG | lllyasviel/control_v11f1p_sd15_depth |
Ctrl-Adapter | CIG |
hanlincs/Ctrl-Adapter (i2vgenxl_depth ) |
I2VGen-XL | CIG | ali-vilab/i2vgen-xl |
Once you downloaded all the model weights, put them in the checkpoints
folder.
Note
If your network environment can get access to HuggingFace, you can directly use the HuggingFace repo ID to download the models. Otherwise we highly recommend you to prepare the model weights locally.
Specfically, make sure you modify the configuration file of AnyDoor
at models/anydoor/configs/anydoor.yaml
with the path of DINO-v2 pre-trained weights:
(at line 83)
cond_stage_config:
target: models.anydoor.ldm.modules.encoders.modules.FrozenDinoV2Encoder
weight: /path/to/dinov2_vitg14_pretrain.pth
<Back to Table of Contents>
StableV2V
(Command Lines)You may refer to the following command line to run StableV2V
:
python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --image-editor-type paint-by-example --image-editor-checkpoint-path /path/to/image/editor --source-video-frames examples/frames/bear --external-guidance examples/reference-images/raccoon.jpg --prompt "a raccoon" --outdir results
Argument | Default Setting | Required or Not | Explanation |
---|---|---|---|
Model arguments | - | - | - |
--image-editor-type |
- | Yes | Argument to define the image editor type. |
--image-editor-checkpoint-path |
- | Yes | Path of model weights for the image editor, required by PFE. |
--raft-checkpoint-path |
checkpoints/raft-things.pth |
Yes | Path of model weights for RAFT, required by ISA. |
--midas-checkpoint-path |
checkpoints/dpt_swin2_large_382.pt |
Yes | Path of model weights for MiDaS, required by ISA. |
--u2net-checkpoint-path |
checkpoints/u2net.pth |
Yes | Path of model weights for U2-Net, required by ISA to obtain the segmentation masks of video frames (will be replaced by SAM in near future) |
--stable-diffusion-checkpoint-path |
stable-diffusion-v1-5/stable-diffusion-v1-5 |
Yes | Path of model weights for SD v1.5, required by CIG. |
--controlnet-checkpoint-path |
lllyasviel/control_v11f1p_sd15_depth |
Yes | Path of model weights for ControlNet (depth) required by CIG. |
--ctrl-adapter-checkpoint-path |
hanlincs/Ctrl-Adapter |
Yes | Path of model weights for Ctrl-Adapter, required by CIG. |
--i2vgenxl-checkpoint-path |
ali-vilab/i2vgen-xl |
Yes | Path of model weights for I2VGen-XL, required by CIG. |
--completion-checkpoint-path |
checkpoints/depth-refinement/50000.ckpt |
Yes | Path of model weights for I2VGen-XL, required by CIG. |
Input Arguments | - | - | - |
--source-video-frames |
- | Yes | Path of input video frames. |
--prompt |
- | Yes | Text prompt of the edited video. |
--external-guidance |
- | Yes | External inputs for the image editors if you use Paint-by-Example , InstructPix2Pix , and AnyDoor . |
--outdir |
results |
Yes | Path of output directory. |
--edited-first-frame |
- | No | Path of customized first edited frame, where the image editor will not be used if this argument is configured. |
--input-condition |
- | No | Path of cusromzied depth maps. We directly extract depth maps from the source video frames with MiDaS if this argument is not configured |
--input-condition |
- | No | Path of cusromzied depth maps. We directly extract depth maps from the source video frames with MiDaS if this argument is not configured. |
--reference-masks |
- | No | Path of segmentation masks of the reference image, required by AnyDoor . We will automatically extract segmentation mask from the reference image if this argument is not configured. |
--image-guidance-scale |
1.0 | No | Hyper-parameter required by InstructPix2Pix. |
--kernel-size |
9 | No | Kernel size of the binary dilation operation, to make sure that the pasting processes cover the regions of edited contents. |
--dilation-iteration |
1 | No | Iteration for binary dilation operation. |
--guidance-scale |
9.0 | No | Classifier-free guidance scale. |
--mixed-precision |
bf16 | No | Precision of models in StableV2V. |
--n-sample-frames |
16 | No | Number of video frames of the edited video. |
--seed |
42 | No | Random seed. |
Note
Some specific points that you may pay additional attentions to while inferencing:
--image-editor-checkpoint-path
, the path will be automatically delievered to the corresponding editor according to your --image-editor-type
. So please do not be worried about some extra arguments in the codebase.Paint-by-Example
, InstructPix2Pix
, AnyDoor
, you are required to configure the --external-guidance
argument, which corresponds to reference image and user instruction accordingly.xformers
, which might cause artifacts in the produced results. Such issue might be fixed in the future if possible.StableV2V
To obtain the human-drawn sketches, you need to manually draw them on external devices such as a tablet, and then export the result for later uses.
Particularly, we obtain the hand-drawn sketches on the iPad application Sketchbook
. An example hand-drawn sketch might look like this:
ControlNet (scribble)
to Generate the First Edited FrameOnce you obtain the hand-drawn sketch, the next step is to get the first edited frame.
In doing so, we use ControlNet (scribble)
, where you need to prepare the model weights of ControlNet (scribble)
and SD Inpaint
in advance.
Suppose we put the previously hand-drawn sketches at inputs/hand-drawn-sketches
, you can execute the following command line by running ControlNet (scribble)
:
python scripts/inference_controlnet_inpaint.py --controlnet-checkpoint-path lllyasviel/control_v11p_sd15_scribble --stable-diffusion-checkpoint-path botp/stable-diffusion-v1-5-inpainting --prompt "an elephant" --input-mask inputs/masks/bear.png --controlnet-guidance inputs/hand-drawn-sketches/bear-elephant-sketch.png --outdir results/sketch-guided-result.png
The result might seem like:
Finally, you are ready to generate the entire edited video. We offer an example command line as follows:
python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --source-video-frames examples/frames/bear --edited-first-frame inputs/edited-first-frames/bear-elephant.png --prompt "an elephant walking on the rocks in a zoo" --outdir results
By configuring the --edited-first-frame
, the codebase will automatically skip the first-frame editing process, where we visualize the source video and the edited video below:
<Back to Table of Contents>
StableV2V
Before you inpaint the first video frame, we recommend you dilate the annotated segmentation mask (if any) using the following script:
python scripts/run_dilate_mask.py --input-folder inputs/masks/car-turn.png --output-folder inputs/dilated-masks --kernel-size 15 --iterations 1
The original (left) and dilated (right) masks might look like:
IOPaint
to Generated the First Inpainted FrameWe recommend you to use the library IOPaint
for convenient use. To install it, you can simply run:
pip install iopaint
Then, you are able to execute LaMa
through the library:
iopaint run --model=lama --image inputs/frames/car-turn/00000.jpg --mask inputs/dilated-masks/car-turn.png --output inputs/edited-first-frames/
The original and inpainted first frames might look like:
Finally, you are ready to generate the entire edited video. We offer an example command line as follows:
python inference.py --raft-checkpoint-path checkpoints/raft-things.pth --midas-checkpoint-path checkpoints/dpt_swin2_large_384.pt --u2net-checkpoint-path checkpoints/u2net.pth --stable-diffusion-checkpoint-path stable-diffusion-v1-5/stable-diffusion-v1-5 --controlnet-checkpoint-path lllyasviel/control_v11f1p_sd15_depth --i2vgenxl-checkpoint-path ali-vilab/i2vgen-xl --ctrl-adapter-checkpoint-path hanlincs/Ctrl-Adapter --completion-net-checkpoint-path checkpoints/depth-refinement/50000.ckpt --source-video-frames examples/frames/car-turn --edited-first-frame inputs/edited-first-frame/car-turn-inpainted.png --prompt "an elephant walking on the rocks in a zoo" --outdir results
By configuring the --edited-first-frame
, the codebase will automatically skip the first-frame editing process, where we visualize the source video and the edited video below:
<Back to Table of Contents>
StableV2V
(Gradio Demo)We also offer a gradio demo to try StableV2V
through interactive UI. Before you go, we recommend you to follow the instructions in this section to prepare all the required model weights locally (in the ./checkpoints
folder). Then, feel free to test it out by simply running:
python app.py
In the following figure, we illustrate the functions of different modules in our Gradio demo:
Please do not hesitate to start a new issue or propose a PR if have any further questions about the demo.
<Back to Table of Contents>
DAVIS-Edit
We construct DAVIS-Edit
following the same data structure as the one of DAVIS
, as is shown below:
DAVIS-Edit
├── Annotations <----- Official annotated masks of DAVIS
├── bear
├── blackswan
├── ...
└── train
├── JPEGImages <----- Official video frames of DAVIS
├── bear
├── blackswan
├── ...
└── train
├── ReferenceImages <----- Annotated reference images for image-based editing on DAVIS-Edit
├── bear.png
├── blackswan.png
├── ...
└── train.png
├── .gitattributes
├── README.md
├── edited_video_caption_dict_image.json <----- Annotated text descriptions for image-based editing on DAVIS-Edit
└── edited_video_caption_dict_text.json <----- Annotated text descriptions for text-based editing on DAVIS-Edit
Specifically, edited_video_caption_dict_image.json
and edited_video_caption_dict_text.json
are constructed as Python dictionary, with its keys as the names of video folders in JPEGImages
. For example in edited_video_caption_dict_text.json
:
{
"bear": {
"original": "a bear walking on rocks in a zoo",
"similar": "A panda walking on rocks in a zoo",
"changing": "A rabbit walking on rocks in a zoo"
},
# ...
}
The annotations of reference images contain two sub-folders, i.e., similar
and changing
, corresponding to the annotations for DAVIS-Edit-S
and DAVIS-Edit-C
, respectively, where the structure are constructed in the same folder name as that in JPEGImages
.
We highly recommend you to index different elements in DAVIS-Edit
through the annotation files. Particularly, you may refer to the script below:
import os
import json
from tqdm import tqdm
from PIL import Image
# TODO: Modify the configurations here to your local paths
frame_root = 'JPEGImages'
mask_root = 'Annotations'
reference_image_root = 'ReferenceImages/similar' # Or 'ReferenceImages/changing'
annotation_file_path = 'edited_video_caption_dict_text.json'
# Load the annotation file
with open(annotation_file_path, 'r') as f:
annotations = json.load(f)
# Iterate all data samples in DAVIS-Edit
for video_name in tqdm(annotations.keys()):
# Load text prompts
original_prompt = annotations[video_name]['original']
similar_prompt = annotations[video_name]['similar']
changing_prompt = annotations[video_name]['changing']
# Load reference images
reference_image = Image.open(os.path.join(reference_image_root, video_name + '.png'))
# Load video frames
video_frames = []
for path in sorted(os.listdir(os.path.join(frame_root, video_name))):
if path != 'Thumbs.db' and path != '.DS_store':
video_frames.append(Image.open(os.path.join(frame_root, path)))
# Load masks
masks = []
for path in sorted(os.listdir(os.path.join(mask_root, video_name))):
if path != 'Thumbs.db' and path != '.DS_store':
masks.append(Image.open(os.path.join(frame_root, path)))
# (add further operations that you expect in the lines below)
<Back to Table of Contents>
YouTube-VOS
DatasetWe use YouTube-VOS
to conduct the training process of our shape-guided depth refinement network.
Before you start the training process, you need to first download its source videos and annotations from this link.
Once downloaded, the data follows the structures below:
youtube-vos
├── JPEGImages <----- Path of source video frames
├── Annotations <----- Path of segmentation masks
└── meta.json <----- Annotation file for the segmentation masks
MiDaS
to Annotate the Depth MapsOnce the video frames are ready, the next step is to annotate their corresponding depth maps.
Specifically, make sure you download the MiDaS
model weights from this link.
Then, you can execute the following command lines with our automatic script:
python scripts/extract_youtube_vos_depths.py --midas-path checkpoints/dpt_swin2_large_384.pt --dataset-path data/youtube-vos/JPEGImages --outdir data/youtube-vos/DepthMaps
U2-Net
to Annotate the First-frame Shape MasksOur depth refinement network uses an additional network channel to take the first-frame shape mask as guidance, thus you need to annotate them for the YouTube-VOS
dataset.
First, make sure you download the U2-Net
model weights from this link.
Then, you execute the following command lines with our automatic script:
python scripts/extract_youtube_vos_shapes.py --video-root data/youtube-vos/JPEGImages --model-dir checkpoints/u2net.pth --outdir data/youtube-vos/FirstFrameMasks
Finally, you are ready to execute the training process with the following command line:
python train_completion_net.py --video-path data/youtube-vos/JPEGImages --shape-path data/youtube-vos/FirstFrameMasks --max-train-steps 50000 --outdir results/shape-guided-depth-refinement --checkpoint-freq 2000 --validation-freq 200
The trained model weights will be saved at results/checkpoints
, and the visualizations of intermediate results can be checked via tensorboard
, with the logs saved at results/tensorboard
.
<Back to Table of Contents>
Please refer to our project page for more results and comparisons performed by StableV2V
.
<Back to Table of Contents>
If you find this work helpful to your research, or use our testing benchmark DAVIS-Edit
, please cite our paper:
@misc{liu-etal-2024-stablev2v,
title={StableV2V: Stablizing Shape Consistency in Video-to-Video Editing},
author={Chang Liu and Rui Li and Kaidong Zhang and Yunwei Lan and Dong Liu},
year={2024},
eprint={2411.11045},
archivePrefix={arXiv},
primaryClass={cs.CV},
}
<Back to Table of Contents>
<Back to Table of Contents>
This repo is heavily modified based on Diffusers, Ctrl-Adapter, AnyDoor, and RAFT. We sincerely thanks the authors for their fantasitic implementations.
<Back to Table of Contents>