Lihe Yang1 · Bingyi Kang2† · Zilong Huang2
Zhen Zhao · Xiaogang Xu · Jiashi Feng2 · Hengshuang Zhao1*
1HKU 2TikTok
†project lead *corresponding author
This work presents Depth Anything V2. It significantly outperforms V1 in fine-grained details and robustness. Compared with SD-based models, it enjoys faster inference speed, fewer parameters, and higher depth accuracy.
We provide four models of varying scales for robust relative depth estimation:
Model | Params | Checkpoint |
---|---|---|
Depth-Anything-V2-Small | 24.8M | Download |
Depth-Anything-V2-Base | 97.5M | Download |
Depth-Anything-V2-Large | 335.3M | Download |
Depth-Anything-V2-Giant | 1.3B | Coming soon |
git clone https://github.com/DepthAnything/Depth-Anything-V2
cd Depth-Anything-V2
pip install -r requirements.txt
Download the checkpoints listed here and put them under the checkpoints
directory.
import cv2
import torch
from depth_anything_v2.dpt import DepthAnythingV2
DEVICE = 'cuda' if torch.cuda.is_available() else 'mps' if torch.backends.mps.is_available() else 'cpu'
model_configs = {
'vits': {'encoder': 'vits', 'features': 64, 'out_channels': [48, 96, 192, 384]},
'vitb': {'encoder': 'vitb', 'features': 128, 'out_channels': [96, 192, 384, 768]},
'vitl': {'encoder': 'vitl', 'features': 256, 'out_channels': [256, 512, 1024, 1024]},
'vitg': {'encoder': 'vitg', 'features': 384, 'out_channels': [1536, 1536, 1536, 1536]}
}
encoder = 'vitl' # or 'vits', 'vitb', 'vitg'
model = DepthAnythingV2(**model_configs[encoder])
model.load_state_dict(torch.load(f'checkpoints/depth_anything_v2_{encoder}.pth', map_location='cpu'))
model = model.to(DEVICE).eval()
raw_img = cv2.imread('your/image/path')
depth = model.infer_image(raw_img) # HxW raw depth map in numpy
If you do not want to clone this repository, you can also load our models through Transformers. Below is a simple code snippet. Please refer to the official page for more details.
from transformers import pipeline
from PIL import Image
pipe = pipeline(task="depth-estimation", model="depth-anything/Depth-Anything-V2-Small-hf")
image = Image.open('your/image/path')
depth = pipe(image)["depth"]
python run.py
--encoder <vits | vitb | vitl | vitg>
--img-path <path> --outdir <outdir>
[--input-size <size>] [--pred-only] [--grayscale]
Options:
--img-path
: You can either 1) point it to an image directory storing all interested images, 2) point it to a single image, or 3) point it to a text file storing all image paths.--input-size
(optional): By default, we use input size 518
for model inference. You can increase the size for even more fine-grained results.--pred-only
(optional): Only save the predicted depth map, without raw image.--grayscale
(optional): Save the grayscale depth map, without applying color palette.For example:
python run.py --encoder vitl --img-path assets/examples --outdir depth_vis
python run_video.py
--encoder <vits | vitb | vitl | vitg>
--video-path assets/examples_video --outdir video_depth_vis
[--input-size <size>] [--pred-only] [--grayscale]
Our larger model has better temporal consistency on videos.
To use our gradio demo locally:
python app.py
You can also try our online demo.
Note: Compared to V1, we have made a minor modification to the DINOv2-DPT architecture (originating from this issue). In V1, we unintentionally used features from the last four layers of DINOv2 for decoding. In V2, we use intermediate features instead. Although this modification did not improve details or accuracy, we decided to follow this common practice.
Please refer to metric depth estimation.
Please refer to DA-2K benchmark.
We sincerely appreciate all the community support for our Depth Anything series. Thank you a lot!
We are sincerely grateful to the awesome Hugging Face team (@Pedro Cuenca, @Niels Rogge, @Merve Noyan, @Amy Roberts, et al.) for their huge efforts in supporting our models in Transformers and Apple Core ML.
We also thank the DINOv2 team for contributing such impressive models to our community.
Depth-Anything-V2-Small model is under the Apache-2.0 license. Depth-Anything-V2-Base/Large/Giant models are under the CC-BY-NC-4.0 license.
If you find this project useful, please consider citing:
@article{depth_anything_v2,
title={Depth Anything V2},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
journal={arXiv:2406.09414},
year={2024}
}
@inproceedings{depth_anything_v1,
title={Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data},
author={Yang, Lihe and Kang, Bingyi and Huang, Zilong and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang},
booktitle={CVPR},
year={2024}
}