ﻻ يوجد ملخص باللغة العربية
For a robot deployed in the world, it is desirable to have the ability of autonomous learning to improve its initial pre-set knowledge. We formalize this as a bootstrapped self-supervised learning problem where a system is initially bootstrapped with supervised training on a labeled dataset and we look for a self-supervised training method that can subsequently improve the system over the supervised training baseline using only unlabeled data. In this work, we leverage temporal consistency between frames in monocular video to perform this bootstrapped self-supervised training. We show that a well-trained state-of-the-art semantic segmentation network can be further improved through our method. In addition, we show that the bootstrapped self-supervised training framework can help a network learn depth estimation better than pure supervised training or self-supervised training.
Self-supervised depth estimation has shown its great effectiveness in producing high quality depth maps given only image sequences as input. However, its performance usually drops when estimating on border areas or objects with thin structures due to
Image-based learning methods for autonomous vehicle perception tasks require large quantities of labelled, real data in order to properly train without overfitting, which can often be incredibly costly. While leveraging the power of simulated data ca
Training deep networks for semantic segmentation requires large amounts of labeled training data, which presents a major challenge in practice, as labeling segmentation masks is a highly labor-intensive process. To address this issue, we present a fr
In the recent years, many methods demonstrated the ability of neural networks tolearn depth and pose changes in a sequence of images, using only self-supervision as thetraining signal. Whilst the networks achieve good performance, the often over-look
Recent advances in self-supervised learning havedemonstrated that it is possible to learn accurate monoculardepth reconstruction from raw video data, without using any 3Dground truth for supervision. However, in robotics applications,multiple views o