This is the final project for the Deep Learning course at NYU Courant taught by Yann LeCun and Alfredo Canziani.
The goal of the project is to train a model using images captured by six different cameras attached to the same car to generate a top down view of the surrounding area. The performance of the model is evaluated by (1) the ability of detecting objects (like car, trucks, bicycles, etc.) and (2) the ability to draw the road map layout.
Two datasets are provided:
Since the unlabelled dataset forms ~75% of the entire dataset, it is pivotal to effectively leverage the unlabelled dataset for pretraining. Specifically, in this project, we use self-supervised learning to pretrain the models.
The dataset is organized into three levels: scene, sample and image:
For the labelled dataset, two kinds of labels are provided:
Note that the input to the model is 6 images captured from 6 different cameras positioned around the car whereas the output is a top-down view for the entire scene. The top-down view can be obtained by combining the binary road image with the bounding boxes of surrounding objects as shown below:
There are 2 evaluation metrics used; one for binary road map segmentation and one for object detection:
We achieved threat scores of 0.762 and 0.0143 on the validation set on the road segmentation and object detection tasks respectively. Furthermore, we ranked 8th overall on the leaderboard.
The figure below shows the road segmentation pipeline:
For designing decoder architectures, we experimented with 2 strategies. Firstly, we used up-sampling and 2D convolutions to increase the spatial representations whereas, secondly, we used transposed convolutions. We observed that the latter yielded better results as shown below:
Upsampling + Convolutions | Transposed Convolutions |
---|---|
0.733 | 0.741 |
Next, we utilized self-supervised pretext tasks in order to pretrain the network on the unlabeled dataset. Here, we used 2 pretext tasks: jigsaw and our own designed pretext task which we call stereo. Similar to the jigsaw idea, the task is the following: given a randomly permuted sequence of images from the 6 cameras, the network has to predict the permutation. The intuition behind designing this task was that in order to accurately perform road segmentation for the entire scene, the network needs to know which camera represents which part of the scene. However, the stereo pretrain network could simply cheat by looking at the body of the ego-car that is visible in each image. Since this view of the ego-car is fixed across different frames, the network would not learn anything about the scene itself. In order to avoid this, we crop each of the 6 images to the central 150×150 region, thereby removing any view of the ego-car from the images. In comparison to the jigsaw pretext task, pretraining with the stereo pretext task yielded better results when both methods were pretrained for 700 permutations. However, increasing the jigsaw permutations to 1000 yielded better results.
Note: Stereo has maximum permutations = 6! = 720.
No Pretrain | Jigsaw (700)* | Stereo (700)* | Jigsaw (1000)* |
---|---|---|---|
0.741 | 0.750 | 0.753 | 0.762 |
* represents number of permutations
### Visualizations
In order to verify our hypothesis that the stereo pretrain task does indeed help the network learn the different camera views, we did a T-SNE visualization of the encoded features from the stereo pretrain network. The figure below shows that the network clusters the feature representations for the 6 camera images in different regions of the latent space, thereby helping the network perform the road segmentation task since it can distinguish between the different camera views and match each camera view to a particular region of the scene road segmentation result.
For the object detection task, two types of architectures were tested:
Faster-RCNN yielded the best results. For detailed results and analysis, please refer to the report and the slides. Here’s a 3-minute video of the project.
Inspiration was taken from the following repositories while developing this project:
Faster RCNN (a) | Faster RCNN (b) | YOLO v3 | Jigsaw Pretext Task | FCN