No Arabic abstract
In many scenarios where cameras are applied, such as robot positioning and unmanned driving, camera calibration is one of the most important pre-work. The interactive calibration method based on the plane board is becoming popular in camera calibration field due to its repeatability and operation advantages. However, the existing methods select suggestions from a fixed dataset of pre-defined poses based on subjective experience, which leads to a certain degree of one-sidedness. Moreover, they does not give users clear instructions on how to place the board in the specified pose.
Camera calibration is an important prerequisite towards the solution of 3D computer vision problems. Traditional methods rely on static images of a calibration pattern. This raises interesting challenges towards the practical usage of event cameras, which notably require image change to produce sufficient measurements. The current standard for event camera calibration therefore consists of using flashing patterns. They have the advantage of simultaneously triggering events in all reprojected pattern feature locations, but it is difficult to construct or use such patterns in the field. We present the first dynamic event camera calibration algorithm. It calibrates directly from events captured during relative motion between camera and calibration pattern. The method is propelled by a novel feature extraction mechanism for calibration patterns, and leverages existing calibration tools before optimizing all parameters through a multi-segment continuous-time formulation. As demonstrated through our results on real data, the obtained calibration method is highly convenient and reliably calibrates from data sequences spanning less than 10 seconds.
Tracking of multiple objects is an important application in AI City geared towards solving salient problems related to safety and congestion in an urban environment. Frequent occlusion in traffic surveillance has been a major problem in this research field. In this challenge, we propose a model-based vehicle localization method, which builds a kernel at each patch of the 3D deformable vehicle model and associates them with constraints in 3D space. The proposed method utilizes shape fitness evaluation besides color information to track vehicle objects robustly and efficiently. To build 3D car models in a fully unsupervised manner, we also implement evolutionary camera self-calibration from tracking of walking humans to automatically compute camera parameters. Additionally, the segmented foreground masks which are crucial to 3D modeling and camera self-calibration are adaptively refined by multiple-kernel feedback from tracking. For object detection/classification, the state-of-the-art single shot multibox detector (SSD) is adopted to train and test on the NVIDIA AI City Dataset. To improve the accuracy on categories with only few objects, like bus, bicycle and motorcycle, we also employ the pretrained model from YOLO9000 with multi-scale testing. We combine the results from SSD and YOLO9000 based on ensemble learning. Experiments show that our proposed tracking system outperforms both state-of-the-art of tracking by segmentation and tracking by detection.
This paper presents a novel semantic-based online extrinsic calibration approach, SOIC (so, I see), for Light Detection and Ranging (LiDAR) and camera sensors. Previous online calibration methods usually need prior knowledge of rough initial values for optimization. The proposed approach removes this limitation by converting the initialization problem to a Perspective-n-Point (PnP) problem with the introduction of semantic centroids (SCs). The closed-form solution of this PnP problem has been well researched and can be found with existing PnP methods. Since the semantic centroid of the point cloud usually does not accurately match with that of the corresponding image, the accuracy of parameters are not improved even after a nonlinear refinement process. Thus, a cost function based on the constraint of the correspondence between semantic elements from both point cloud and image data is formulated. Subsequently, optimal extrinsic parameters are estimated by minimizing the cost function. We evaluate the proposed method either with GT or predicted semantics on KITTI dataset. Experimental results and comparisons with the baseline method verify the feasibility of the initialization strategy and the accuracy of the calibration approach. In addition, we release the source code at https://github.com/--/SOIC.
Single image camera calibration is the task of estimating the camera parameters from a single input image, such as the vanishing points, focal length, and horizon line. In this work, we propose Camera calibration TRansformer with Line-Classification (CTRL-C), an end-to-end neural network-based approach to single image camera calibration, which directly estimates the camera parameters from an image and a set of line segments. Our network adopts the transformer architecture to capture the global structure of an image with multi-modal inputs in an end-to-end manner. We also propose an auxiliary task of line classification to train the network to extract the global geometric information from lines effectively. Our experiments demonstrate that CTRL-C outperforms the previous state-of-the-art methods on the Google Street View and SUN360 benchmark datasets.
Most current single image camera calibration methods rely on specific image features or user input, and cannot be applied to natural images captured in uncontrolled settings. We propose directly inferring camera calibration parameters from a single image using a deep convolutional neural network. This network is trained using automatically generated samples from a large-scale panorama dataset, and considerably outperforms other methods, including recent deep learning-based approaches, in terms of standard L2 error. However, we argue that in many cases it is more important to consider how humans perceive errors in camera estimation. To this end, we conduct a large-scale human perception study where we ask users to judge the realism of 3D objects composited with and without ground truth camera calibration. Based on this study, we develop a new perceptual measure for camera calibration, and demonstrate that our deep calibration network outperforms other methods on this measure. Finally, we demonstrate the use of our calibration network for a number of applications including virtual object insertion, image retrieval and compositing.