No Arabic abstract
Keypoint detection and description is a commonly used building block in computer vision systems particularly for robotics and autonomous driving. Recently CNN based approaches have surpassed classical methods in a number of perception tasks. However, the majority of techniques to date have focused on standard cameras with little consideration given to fisheye cameras which are commonly used in autonomous driving. In this paper, we propose a novel training and evaluation pipeline for fisheye images. We make use of SuperPoint as our baseline which is a self-supervised keypoint detector and descriptor that has achieved state-of-the-art results on homography estimation. We introduce a fisheye adaptation pipeline to enable training on undistorted fisheye images. We evaluate the performance on the HPatches benchmark, and, by introducing a fisheye based evaluation methods for detection repeatability and descriptor matching correctness on the Oxford RobotCar datasets.
Researchers have attempted utilizing deep neural network (DNN) to learn novel local features from images inspired by its recent successes on a variety of vision tasks. However, existing DNN-based algorithms have not achieved such remarkable progress that could be partly attributed to insufficient utilization of the interactive characters between local feature detector and descriptor. To alleviate these difficulties, we emphasize two desired properties, i.e., repeatability and reliability, to simultaneously summarize the inherent and interactive characters of local feature detector and descriptor. Guided by these properties, a self-supervised framework, namely self-evolving keypoint detection and description (SEKD), is proposed to learn an advanced local feature model from unlabeled natural images. Additionally, to have performance guarantees, novel training strategies have also been dedicatedly designed to minimize the gap between the learned feature and its properties. We benchmark the proposed method on homography estimation, relative pose estimation, and structure-from-motion tasks. Extensive experimental results demonstrate that the proposed method outperforms popular hand-crafted and DNN-based methods by remarkable margins. Ablation studies also verify the effectiveness of each critical training strategy. We will release our code along with the trained model publicly.
Recent methods for people detection in overhead, fisheye images either use radially-aligned bounding boxes to represent people, assuming people always appear along image radius or require significant pre-/post-processing which radically increases computational complexity. In this work, we develop an end-to-end rotation-aware people detection method, named RAPiD, that detects people using arbitrarily-oriented bounding boxes. Our fully-convolutional neural network directly regresses the angle of each bounding box using a periodic loss function, which accounts for angle periodicities. We have also created a new dataset with spatio-temporal annotations of rotated bounding boxes, for people detection as well as other vision tasks in overhead fisheye videos. We show that our simple, yet effective method outperforms state-of-the-art results on three fisheye-image datasets. Code and dataset are available at http://vip.bu.edu/rapid .
In object detection, keypoint-based approaches often suffer a large number of incorrect object bounding boxes, arguably due to the lack of an additional look into the cropped regions. This paper presents an efficient solution which explores the visual patterns within each cropped region with minimal costs. We build our framework upon a representative one-stage keypoint-based detector named CornerNet. Our approach, named CenterNet, detects each object as a triplet, rather than a pair, of keypoints, which improves both precision and recall. Accordingly, we design two customized modules named cascade corner pooling and center pooling, which play the roles of enriching information collected by both top-left and bottom-right corners and providing more recognizable information at the central regions, respectively. On the MS-COCO dataset, CenterNet achieves an AP of 47.0%, which outperforms all existing one-stage detectors by at least 4.9%. Meanwhile, with a faster inference speed, CenterNet demonstrates quite comparable performance to the top-ranked two-stage detectors. Code is available at https://github.com/Duankaiwen/CenterNet.
Object detection has been vigorously investigated for years but fast accurate detection for real-world scenes remains a very challenging problem. Overcoming drawbacks of single-stage detectors, we take aim at precisely detecting objects for static and temporal scenes in real time. Firstly, as a dual refinement mechanism, a novel anchor-offset detection is designed, which includes an anchor refinement, a feature location refinement, and a deformable detection head. This new detection mode is able to simultaneously perform two-step regression and capture accurate object features. Based on the anchor-offset detection, a dual refinement network (DRNet) is developed for high-performance static detection, where a multi-deformable head is further designed to leverage contextual information for describing objects. As for temporal detection in videos, temporal refinement networks (TRNet) and temporal dual refinement networks (TDRNet) are developed by propagating the refinement information across time. We also propose a soft refinement strategy to temporally match object motion with the previous refinement. Our proposed methods are evaluated on PASCAL VOC, COCO, and ImageNet VID datasets. Extensive comparisons on static and temporal detection verify the superiority of DRNet, TRNet, and TDRNet. Consequently, our developed approaches run in a fairly fast speed, and in the meantime achieve a significantly enhanced detection accuracy, i.e., 84.4% mAP on VOC 2007, 83.6% mAP on VOC 2012, 69.4% mAP on VID 2017, and 42.4% AP on COCO. Ultimately, producing encouraging results, our methods are applied to online underwater object detection and grasping with an autonomous system. Codes are publicly available at https://github.com/SeanChenxy/TDRN.
We present the WoodScape fisheye semantic segmentation challenge for autonomous driving which was held as part of the CVPR 2021 Workshop on Omnidirectional Computer Vision (OmniCV). This challenge is one of the first opportunities for the research community to evaluate the semantic segmentation techniques targeted for fisheye camera perception. Due to strong radial distortion standard models dont generalize well to fisheye images and hence the deformations in the visual appearance of objects and entities needs to be encoded implicitly or as explicit knowledge. This challenge served as a medium to investigate the challenges and new methodologies to handle the complexities with perception on fisheye images. The challenge was hosted on CodaLab and used the recently released WoodScape dataset comprising of 10k samples. In this paper, we provide a summary of the competition which attracted the participation of 71 global teams and a total of 395 submissions. The top teams recorded significantly improved mean IoU and accuracy scores over the baseline PSPNet with ResNet-50 backbone. We summarize the methods of winning algorithms and analyze the failure cases. We conclude by providing future directions for the research.