No Arabic abstract
In this paper, we propose a novel iterative multi-task framework to complete the segmentation mask of an occluded vehicle and recover the appearance of its invisible parts. In particular, to improve the quality of the segmentation completion, we present two coupled discriminators and introduce an auxiliary 3D model pool for sampling authentic silhouettes as adversarial samples. In addition, we propose a two-path structure with a shared network to enhance the appearance recovery capability. By iteratively performing the segmentation completion and the appearance recovery, the results will be progressively refined. To evaluate our method, we present a dataset, the Occluded Vehicle dataset, containing synthetic and real-world occluded vehicle images. We conduct comparison experiments on this dataset and demonstrate that our model outperforms the state-of-the-art in tasks of recovering segmentation mask and appearance for occluded vehicles. Moreover, we also demonstrate that our appearance recovery approach can benefit the occluded vehicle tracking in real-world videos.
In this paper, we tackle the problem of human de-occlusion which reasons about occluded segmentation masks and invisible appearance content of humans. In particular, a two-stage framework is proposed to estimate the invisible portions and recover the content inside. For the stage of mask completion, a stacked network structure is devised to refine inaccurate masks from a general instance segmentation model and predict integrated masks simultaneously. Additionally, the guidance from human parsing and typical pose masks are leveraged to bring prior information. For the stage of content recovery, a novel parsing guided attention module is applied to isolate body parts and capture context information across multiple scales. Besides, an Amodal Human Perception dataset (AHP) is collected to settle the task of human de-occlusion. AHP has advantages of providing annotations from real-world scenes and the number of humans is comparatively larger than other amodal perception datasets. Based on this dataset, experiments demonstrate that our method performs over the state-of-the-art techniques in both tasks of mask completion and content recovery. Our AHP dataset is available at url{https://sydney0zq.github.io/ahp/}.
Fruit tree pruning and fruit thinning require a powerful vision system that can provide high resolution segmentation of the fruit trees and their branches. However, recent works only consider the dormant season, where there are minimal occlusions on the branches or fit a polynomial curve to reconstruct branch shape and hence, losing information about branch thickness. In this work, we apply two state-of-the-art supervised learning models U-Net and DeepLabv3, and a conditional Generative Adversarial Network Pix2Pix (with and without the discriminator) to segment partially occluded 2D-open-V apple trees. Binary accuracy, Mean IoU, Boundary F1 score and Occluded branch recall were used to evaluate the performances of the models. DeepLabv3 outperforms the other models at Binary accuracy, Mean IoU and Boundary F1 score, but is surpassed by Pix2Pix (without discriminator) and U-Net in Occluded branch recall. We define two difficulty indices to quantify the difficulty of the task: (1) Occlusion Difficulty Index and (2) Depth Difficulty Index. We analyze the worst 10 images in both difficulty indices by means of Branch Recall and Occluded Branch Recall. U-Net outperforms the other two models in the current metrics. On the other hand, Pix2Pix (without discriminator) provides more information on branch paths, which are not reflected by the metrics. This highlights the need for more specific metrics on recovering occluded information. Furthermore, this shows the usefulness of image-transfer networks for hallucination behind occlusions. Future work is required to further enhance the models to recover more information from occlusions such that this technology can be applied to automating agricultural tasks in a commercial environment.
The combination of a small unmanned ground vehicle (UGV) and a large unmanned carrier vehicle allows more flexibility in real applications such as rescue in dangerous scenarios. The autonomous recovery system, which is used to guide the small UGV back to the carrier vehicle, is an essential component to achieve a seamless combination of the two vehicles. This paper proposes a novel autonomous recovery framework with a low-cost monocular vision system to provide accurate positioning and attitude estimation of the UGV during navigation. First, we introduce a light-weight convolutional neural network called UGV-KPNet to detect the keypoints of the small UGV from the images captured by a monocular camera. UGV-KPNet is computationally efficient with a small number of parameters and provides pixel-level accurate keypoints detection results in real-time. Then, six degrees of freedom pose is estimated using the detected keypoints to obtain positioning and attitude information of the UGV. Besides, we are the first to create a large-scale real-world keypoints dataset of the UGV. The experimental results demonstrate that the proposed system achieves state-of-the-art performance in terms of both accuracy and speed on UGV keypoint detection, and can further boost the 6-DoF pose estimation for the UGV.
Person re-identification (re-id) suffers from a serious occlusion problem when applied to crowded public places. In this paper, we propose to retrieve a full-body person image by using a person image with occlusions. This differs significantly from the conventional person re-id problem where it is assumed that person images are detected without any occlusion. We thus call this new problem the occluded person re-identitification. To address this new problem, we propose a novel Attention Framework of Person Body (AFPB) based on deep learning, consisting of 1) an Occlusion Simulator (OS) which automatically generates artificial occlusions for full-body person images, and 2) multi-task losses that force the neural network not only to discriminate a persons identity but also to determine whether a sample is from the occluded data distribution or the full-body data distribution. Experiments on a new occluded person re-id dataset and three existing benchmarks modified to include full-body person images and occluded person images show the superiority of the proposed method.
Monocular object detection and tracking have improved drastically in recent years, but rely on a key assumption: that objects are visible to the camera. Many offline tracking approaches reason about occluded objects post-hoc, by linking together tracklets after the object re-appears, making use of reidentification (ReID). However, online tracking in embodied robotic agents (such as a self-driving vehicle) fundamentally requires object permanence, which is the ability to reason about occluded objects before they re-appear. In this work, we re-purpose tracking benchmarks and propose new metrics for the task of detecting invisible objects, focusing on the illustrative case of people. We demonstrate that current detection and tracking systems perform dramatically worse on this task. We introduce two key innovations to recover much of this performance drop. We treat occluded object detection in temporal sequences as a short-term forecasting challenge, bringing to bear tools from dynamic sequence prediction. Second, we build dynamic models that explicitly reason in 3D, making use of observations produced by state-of-the-art monocular depth estimation networks. To our knowledge, ours is the first work to demonstrate the effectiveness of monocular depth estimation for the task of tracking and detecting occluded objects. Our approach strongly improves by 11.4% over the baseline in ablations and by 5.0% over the state-of-the-art in F1 score.