ﻻ يوجد ملخص باللغة العربية
As a flexible passive 3D sensing means, unsupervised learning of depth from monocular videos is becoming an important research topic. It utilizes the photometric errors between the target view and the synthesized views from its adjacent source views as the loss instead of the difference from the ground truth. Occlusion and scene dynamics in real-world scenes still adversely affect the learning, despite significant progress made recently. In this paper, we show that deliberately manipulating photometric errors can efficiently deal with these difficulties better. We first propose an outlier masking technique that considers the occluded or dynamic pixels as statistical outliers in the photometric error map. With the outlier masking, the network learns the depth of objects that move in the opposite direction to the camera more accurately. To the best of our knowledge, such cases have not been seriously considered in the previous works, even though they pose a high risk in applications like autonomous driving. We also propose an efficient weighted multi-scale scheme to reduce the artifacts in the predicted depth maps. Extensive experiments on the KITTI dataset and additional experiments on the Cityscapes dataset have verified the proposed approachs effectiveness on depth or ego-motion estimation. Furthermore, for the first time, we evaluate the predicted depth on the regions of dynamic objects and static background separately for both supervised and unsupervised methods. The evaluation further verifies the effectiveness of our proposed technical approach and provides some interesting observations that might inspire future research in this direction.
A new unsupervised learning method of depth and ego-motion using multiple masks from monocular video is proposed in this paper. The depth estimation network and the ego-motion estimation network are trained according to the constraints of depth and e
Most of the deep-learning based depth and ego-motion networks have been designed for visible cameras. However, visible cameras heavily rely on the presence of an external light source. Therefore, it is challenging to use them under low-light conditio
Unsupervised learning of depth and ego-motion from unlabelled monocular videos has recently drawn great attention, which avoids the use of expensive ground truth in the supervised one. It achieves this by using the photometric errors between the targ
Recent visual pose estimation and tracking solutions provide notable results on popular datasets such as T-LESS and YCB. However, in the real world, we can find ambiguous objects that do not allow exact classification and detection from a single view
In this paper, we propose a Bidirectional Attention Network (BANet), an end-to-end framework for monocular depth estimation (MDE) that addresses the limitation of effectively integrating local and global information in convolutional neural networks.