No Arabic abstract
In this study, we propose a novel approach to predict the distances of the detected objects in an observed scene. The proposed approach modifies the recently proposed Convolutional Support Estimator Networks (CSENs). CSENs are designed to compute a direct mapping for the Support Estimation (SE) task in a representation-based classification problem. We further propose and demonstrate that representation-based methods (sparse or collaborative representation) can be used in well-designed regression problems. To the best of our knowledge, this is the first representation-based method proposed for performing a regression task by utilizing the modified CSENs; and hence, we name this novel approach as Representation-based Regression (RbR). The initial version of CSENs has a proxy mapping stage (i.e., a coarse estimation for the support set) that is required for the input. In this study, we improve the CSEN model by proposing Compressive Learning CSEN (CL-CSEN) that has the ability to jointly optimize the so-called proxy mapping stage along with convolutional layers. The experimental evaluations using the KITTI 3D Object Detection distance estimation dataset show that the proposed method can achieve a significantly improved distance estimation performance over all competing methods. Finally, the software implementations of the methods are publicly shared at https://github.com/meteahishali/CSENDistance.
Neural networks that map 3D coordinates to signed distance function (SDF) or occupancy values have enabled high-fidelity implicit representations of object shape. This paper develops a new shape model that allows synthesizing novel distance views by optimizing a continuous signed directional distance function (SDDF). Similar to deep SDF models, our SDDF formulation can represent whole categories of shapes and complete or interpolate across shapes from partial input data. Unlike an SDF, which measures distance to the nearest surface in any direction, an SDDF measures distance in a given direction. This allows training an SDDF model without 3D shape supervision, using only distance measurements, readily available from depth camera or Lidar sensors. Our model also removes post-processing steps like surface extraction or rendering by directly predicting distance at arbitrary locations and viewing directions. Unlike deep view-synthesis techniques, such as Neural Radiance Fields, which train high-capacity black-box models, our model encodes by construction the property that SDDF values decrease linearly along the viewing direction. This structure constraint not only results in dimensionality reduction but also provides analytical confidence about the accuracy of SDDF predictions, regardless of the distance to the object surface.
Computer vision researchers prefer to estimate age from face images because facial features provide useful information. However, estimating age from face images becomes challenging when people are distant from the camera or occluded. A persons gait is a unique biometric feature that can be perceived efficiently even at a distance. Thus, gait can be used to predict age when face images are not available. However, existing gait-based classification or regression methods ignore the ordinal relationship of different ages, which is an important clue for age estimation. This paper proposes an ordinal distribution regression with a global and local convolutional neural network for gait-based age estimation. Specifically, we decompose gait-based age regression into a series of binary classifications to incorporate the ordinal age information. Then, an ordinal distribution loss is proposed to consider the inner relationships among these classifications by penalizing the distribution discrepancy between the estimated value and the ground truth. In addition, our neural network comprises a global and three local sub-networks, and thus, is capable of learning the global structure and local details from the head, body, and feet. Experimental results indicate that the proposed approach outperforms state-of-the-art gait-based age estimation methods on the OULP-Age dataset.
Digital holography enables us to reconstruct objects in three-dimensional space from holograms captured by an imaging device. For the reconstruction, we need to know the depth position of the recoded object in advance. In this study, we propose depth prediction using convolutional neural network (CNN)-based regression. In the previous researches, the depth of an object was estimated through reconstructed images at different depth positions from a hologram using a certain metric that indicates the most focused depth position; however, such a depth search is time-consuming. The CNN of the proposed method can directly predict the depth position with millimeter precision from holograms.
Monocular 3D object detection is of great significance for autonomous driving but remains challenging. The core challenge is to predict the distance of objects in the absence of explicit depth information. Unlike regressing the distance as a single variable in most existing methods, we propose a novel geometry-based distance decomposition to recover the distance by its factors. The decomposition factors the distance of objects into the most representative and stable variables, i.e. the physical height and the projected visual height in the image plane. Moreover, the decomposition maintains the self-consistency between the two heights, leading to robust distance prediction when both predicted heights are inaccurate. The decomposition also enables us to trace the causes of the distance uncertainty for different scenarios. Such decomposition makes the distance prediction interpretable, accurate, and robust. Our method directly predicts 3D bounding boxes from RGB images with a compact architecture, making the training and inference simple and efficient. The experimental results show that our method achieves the state-of-the-art performance on the monocular 3D Object Detection and Birds Eye View tasks of the KITTI dataset, and can generalize to images with different camera intrinsics.
Convolutional neural network (CNN) based architectures, such as Mask R-CNN, constitute the state of the art in object detection and segmentation. Recently, these methods have been extended for model-based segmentation where the network outputs the parameters of a geometric model (e.g. an ellipse) directly. This work considers objects whose three-dimensional models can be represented as ellipsoids. We present a variant of Mask R-CNN for estimating the parameters of ellipsoidal objects by segmenting each object and accurately regressing the parameters of projection ellipses. We show that model regression is sensitive to the underlying occlusion scenario and that prediction quality for each object needs to be characterized individually for accurate 3D object estimation. We present a novel ellipse regression loss which can learn the offset parameters with their uncertainties and quantify the overall geometric quality of detection for each ellipse. These values, in turn, allow us to fuse multi-view detections to obtain 3D ellipsoid parameters in a principled fashion. The experiments on both synthetic and real datasets quantitatively demonstrate the high accuracy of our proposed method in estimating 3D objects under heavy occlusions compared to previous state-of-the-art methods.