This is the official implementation of the paper CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion. Based on diffusion model, we propose a method to generate entire 3D scene from scene graphs, encompassing its layout and 3D geometries holistically.
Website | Arxiv
Guangyao Zhai *, Evin Pınar Örnek *, Shun-Cheng Wu, Yan Di, Federico Tombari, Nassir Navab, and Benjamin Busam. (*Equal contribution.)
NeurIPS 2023
Download the code and go the folder.
git clone https://github.com/ymxlzgy/commonscenescd commonscenes
We have tested it on Ubuntu 20.04 with Python 3.8, PyTorch 1.11.0, CUDA 11.3 and Pytorch3D.
conda create -n commonscenes python=3.8conda activate commonscenespip install -r requirements.txt pip install einops omegaconf tensorboardx open3d
To install CLIP, follow this OpenAI CLIP repo:
pip install ftfy regex tqdmpip install git+https://github.com/openai/CLIP.git
Setup additional Chamfer Distance calculation for evaluation:
cd ./extensionpython setup.py install
Download the 3D-FRONT dataset from their official site.
Preprocess the dataset following ATISS.
Download 3D-FUTURE-SDF. This is processed by ourselves on the 3D-FUTURE meshes using tools in SDFusion.
Follow this page for downloading SG-FRONT and accessing more information.
Create a folder named FRONT
, and copy all files to it.
The structure should be similar like this:
FRONT |--3D-FRONT |--3D-FRONT_preprocessed (by ATISS) |--threed_front.pkl (by ATISS) |--3D-FRONT-texture |--3D-FUTURE-model |--3D-FUTURE-scene |--3D-FUTURE-SDF |--All SG-FRONT files (.json and .txt)
Essential: Download pretrained VQ-VAE model from here to the folder scripts/checkpoint
.
Optional: We provide two trained models of CommonScenes available here.
To train the models, run:
cd scripts python train_3dfront.py --exp /media/ymxlzgy/Data/graphto3d_models/balancing/all --room_type livingroom --dataset /path/to/FRONT --residual True --network_type v2_full --with_SDF True --with_CLIP True --batchSize 4 --workers 4 --loadmodel False --nepoch 10000 --large False
--room_type
: rooms to train, e.g., livingroom, diningroom, bedroom, and all. We train all rooms together in the implementation.
--network_type
: the network to be trained. v1_box
is Graph-to-Box, v1_full
is Graph-to-3D (DeepSDF version), v2_box
is the layout branch of CommonScenes, and v2_full
is CommonScenes.
(Note:If you want to train v1_full
, addtional reconstructed meshes and codes by DeepSDF should also be downloaded from here, and also copy to FRONT
).
--with_SDF
: set to True
if train v2_full.
--with_CLIP
: set to True
if train v2_box or v2_full, and not used in other cases.
--batch_size
: the batch size for the layout branch training. (Note: the one for the shape branch is in v2_full.yaml
and v2_full_concat.yaml
. The meaning of each batch size can be found in the Supplementary Material G.1.)
--large
: default is False
, True
means more concrete categories.
We provide three examples here: Graph-to-3D (DeepSDF version), Graph-to-Box, CommonScenes. The recommanded GPU is a single A100 for CommonScenes, though 3090 can also train the network with a lower batch size on the shape branch.
To evaluate the models run:
cd scripts python eval_3dfront.py --exp /media/ymxlzgy/Data/graphto3d_models/balancing/all --epoch 180 --visualize False --evaluate_diversity False --num_samples 5 --gen_shape False --no_stool True
--exp
: where you store the models.
--gen_shape
: set True
if you want to make diffusion-based shape branch work.
--evaluate_diversity
: set True
if you want to compute diversity. This takes a while, so it's disabled by default.
--num_samples
: the number of experiment rounds, when evaluate the diversity.
This metric aims to evaluate scene-level fidelity. To evaluate FID/KID, you first need to download top-down gt rendered images for retrieval methods and sdf rendered images for generative methods, or collect renderings by modifying and running collect_gt_sdf_images.py
. Note that the flag without_lamp
is set to True
in our experiment.
Make sure you download all the files and preprocess the 3D-FRONT. The renderings of generated scenes can be obtained inside eval_3dfront.py
.
After obtaining both ground truth images and generated scenes renderings, run compute_fid_scores_3dfront.py
.
This metric aims to evaluate object-level fidelity. Please follow the implementation in PointFlow. To evaluate this, you need to store object by object in the generated scenes, which can be done in eval_3dfront.py
.
After obtaining object meshes, run compute_mmd_cov_1nn.py
to have the results.
If you find this work useful in your research, please cite
@inproceedings{ zhai2023commonscenes, title={CommonScenes: Generating Commonsense 3D Indoor Scenes with Scene Graph Diffusion}, author={Zhai, Guangyao and {"O}rnek, Evin P{i}nar and Wu, Shun-Cheng and Di, Yan and Tombari, Federico and Navab, Nassir and Busam, Benjamin}, booktitle={Thirty-seventh Conference on Neural Information Processing Systems}, year={2023}, url={https://openreview.net/forum?id=1SF2tiopYJ} }
This repository is based on Graph-to-3D and SDFusion. We thank the authors for making their code available.
Tired students finished the pipeline in busy days...