This code was built upon a pre-existing Image to BEV deep learning model, based on the paper Translating Images Into Maps. This code was written using python 3.7. and was trained on the nuScenes dataset. Please refer to the repository's ReadMe for dependencies and datasets to install.
The first step is to create a folder named "translating-images-into-maps-main" and download all files into it. Then, due to large file size, the latest checkpoints of our training and the mini nuScenes dataset used for validation can be downloaded from this Google Drive. These folders should be added directly in the "translating-images-into-maps-main" directory.
Below is the list of required libraries for this repo:
opencv
numpy
pyquaternion
shapely
lmdb
nuscenes-devkit
pillow
matplotlib
torchvision
descartes
scipy
tensorboard
scikit-image
cv2
To use the functions of this repository, the following command line arguments may need to be changed:
--name: name of the experiment
--video-name: name of the video file within the video root and without extension
--savedir: directory to save experiments to
--val-interval: number of epochs between validation runs
--root: directory of the repository
--video-root: absolute directory to the video input
--nusc-version: nuscenes version (either “v1.0-mini” or “v1.0-trainval” for the full US dataset)
--train-split: training split (either “train_mini" or “train_roddick” for the full US dataset)
--val-split: validation split (either “val_mini" or “val_roddick” for the full US dataset)
--data-size: percentage of dataset to train on
--epochs: number of epochs to train for
--batch-size: batch size
--cuda-available: environment used (0 for cpu, 1 for cuda)
--iou: iou metric used (0 for iou, 1 for diou)
As for training the model, these command line arguments can be modified:
--optimizer: optimizer for gradient descent to run during training. Default: adam
--lr: learning rate. Default: 5e-5
--momentum: momentum for Stochastic gradient descent. Default: 0.9
--weight-decay: weight decay. Default: 1e-4
--lr-decay: learning rate decay. Default: 0.99
The NuScenes Mini and Full datasets can be found at the following locations:
NuScene Mini:
NuScenes Full US:
As the NuScene mini and full datasets do not have the same image input format (lmdb or png), some modifications need to be applied to the code to use one or the other:
mini
argument to false to use the mini dataset as well as the args paths and splits in the train.py
, validation.py
and inference.py
files. data = nuScenesMaps(
root=args.root,
split=args.val_split,
grid_size=args.grid_size,
grid_res=args.grid_res,
classes=args.load_classes_nusc,
dataset_size=args.data_size,
desired_image_size=args.desired_image_size,
mini=True,
gt_out_size=(200, 200),
)
loader = DataLoader(
data,
batch_size=args.batch_size,
shuffle=False,
num_workers=0,
collate_fn=src.data.collate_funcs.collate_nusc_s,
drop_last=True,
pin_memory=True
)
data_loader.py
function:# if mini:
image_input_key = pickle.dumps(id,protocol=3)
with self.images_db.begin() as txn:
value = txn.get(key=image_input_key)
image = Image.open(io.BytesIO(value)).convert(mode='RGB')
# else:
# original_nusenes_dir = "/work/scitas-share/datasets/Vita/civil-459/NuScenes_full/US/samples/CAM_FRONT"
# new_cam_path = os.path.join(original_nusenes_dir, Path(cam_path).name)
# image = Image.open(new_cam_path).convert(mode='RGB')
The pretrained checkpoints can be found here:
The checkpoints need to be kept within /pretrained_models/27_04_23_11_08
from the root directory of this repository. Should you want to load them from another directory, please change the following arguments:
--savedir="pretrained_models" # Careful, this path is relative in validation.py but global in train.py
--name="27_04_23_11_08"
To train on scitas, you need to launch the following script from the root directory:
sbatch job.script.sh
To train locally on cpu:
python3 train.py
Make sure to adapt the script with your command line args.
To validate a model performance on scitas:
sbatch job.validate.sh
To train locally on cpu:
python3 validate.py
Make sure to adapt the script with your command line args.
To infer on a video on scitas:
sbatch job.evaluate.sh
To train locally on cpu:
python3 inference.py
Make sure to adapt the script with your command line args, especially:
--batch-size // 1 for the test videos
--video-name
--video-root
This project was made in the context of the Deep Learning for Autonomous Vehicles course CIVIL-459, taught by Professor Alexandre Alahi at EPFL. We were supervised by doctoral student Yuejiang Liu. The main goal of the course's project is to develop a deep learning model that can be used onboard a Tesla autopilot system. As for our group, we have been looking into the transformation from monocular camera images to bird's eye view. This can be done by using semantic segmentation to classify elements such as cars, sidewalk, pedestrians and the horizon.
During our research on Monocular images to BEV deep learning models, we have noticed that information concerning pedestrians was lost during segmentation, resulting in poor classification. As seen on the image below, when evaluated, the model we selected reaches a mean of 25.7% IoU (Intersection over Union) over 14 classes of objects on the nuScenes dataset. The prediction accuracy for drivables is good (74.5%), quite poor for bikes, barriers and trailers. However the prediction accuracy for pedestrians (9.5%) is far too low. Such a low accuracy could cause accidents if someone were to cross the road without being on the crossing.
More information about our research can be found on the Drive.
As the poor detection of pedestrians seemed to be the most immediate issue with the current trained model, we aimed to improve the accuracy by looking into better suited loss functions, and training the new model on the nuScenes dataset.
The model we built upon was trained using an
Another issue with
The
It uses L2 norm to minimize the distance between predicted and target boxes, and converges much faster than
Horizontal Stretch
Vertical Stretch
Moreover, DIoU loss introduces a regularization term that encourages smooth convergence.
As can be seen in the following image, the
After the research phase, we implemented the bbox_overlaps_diou
function in the /src/utils.py
file, by using the
This function is then used to compute multiscale compute_multiscale_iou
function of the same file. For each class, the iou
input argument) is calculated over the batch size. The output of the function are a dictionary iou_dict
containing the multiscale
We then used these values in train.py
, where the val-interval
epochs. These values were also used in validation.py
where they were used to display the losses and
We trained the model on the NuScenes dataset starting with the provided checkpoint checkpoint-008.pth.gz
, once with the
Another contribution is the new format of visualization to distinguish classes better with all corresponding labels and IoU values. This was implemented in the visualization.py
file.
Lastly, we worked to implement a mode that would take .mp4
videos as input and would decompose them into individual image frames. These would then be evaluated by the model and we could visualize the segmentation result in the inference.py
file.
To have a preliminary idea of the training stratregy of this model, we first decided train it on the NuScenes mini datasets. Starting from checkpoint-008.pth.gz
, we were able to train two models different in the IoU metric used (IoU for one and DIoU for the other). The results obtained on a NuScenes mini batch after 10 epochs of training are presented in the table below.
After looking at these results, we observed that the pedestrian class, which we based our hypothesis on, did not present conclusive results at all. We therefore concluded that the minidataset was not sufficient for our needs and decided to move our training to the full dataset on Scitas.
After training our new models (with DIoU or IoU) from checkpoint-008.pth.gz
for 8 new epochs, we observed promising results. With the aim of comparing the performance of these newly trained models, we performed a validation step on the mini dataset. A visualization of the result for an image of this dataset is provided below.
Here, the
These results finally show a better performance of the
Now that we have a trained model, we can use it to predict the BEV using any input images or videos. While our ambition was to implement our method within the course's final demo, the bird's eye view maps infered were unfortunately not sufficiently performant. The figure below shows the inference result on one of the test videos provided (see test videos).
We believe this lack of performance for the inference is due to the following parameters:
Although the passage from
One option is to implement
The
Furthermore, according to research done by this paper [2], regression error for CIoU degrades faster than the rest, and will converge to
Another option is to train on datasets that are rich in crowdy environments to have a better representation of pedestrians and bicycles.
Finally, to truly validate our hypothesis, a validation run on the full NuScenes dataset could be conducted and the pedestrian IoUs of the two models could be compared.
[1] Zhaohui Zheng, Ping Wang, Wei Liu, Jinze Li, Rongguang Ye, Dongwei Ren (2020). Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression https://arxiv.org/pdf/1911.08287.pdf
[2] Zhaohui Zheng, Ping Wang, Dongwei Ren, Wei Liu, Rongguang Ye, Qinghua Hu, Wangmeng Zuo (2021). Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation https://arxiv.org/pdf/2005.03572.pdf