No Arabic abstract
We present To The Point (TTP), a method for reconstructing 3D objects from a single image using 2D to 3D correspondences learned from weak supervision. We recover a 3D shape from a 2D image by first regressing the 2D positions corresponding to the 3D template vertices and then jointly estimating a rigid camera transform and non-rigid template deformation that optimally explain the 2D positions through the 3D shape projection. By relying on 3D-2D correspondences we use a simple per-sample optimization problem to replace CNN-based regression of camera pose and non-rigid deformation and thereby obtain substantially more accurate 3D reconstructions. We treat this optimization as a differentiable layer and train the whole system in an end-to-end manner. We report systematic quantitative improvements on multiple categories and provide qualitative results comprising diverse shape, pose and texture prediction examples. Project website: https://fkokkinos.github.io/to_the_point/.
Point signature, a representation describing the structural neighborhood of a point in 3D shapes, can be applied to establish correspondences between points in 3D shapes. Conventional methods apply a weight-sharing network, e.g., any kind of graph neural networks, across all neighborhoods to directly generate point signatures and gain the generalization ability by extensive training over a large amount of training samples from scratch. However, these methods lack the flexibility in rapidly adapting to unseen neighborhood structures and thus generalizes poorly on new point sets. In this paper, we propose a novel meta-learning based 3D point signature model, named 3Dmetapointsignature (MEPS) network, that is capable of learning robust point signatures in 3D shapes. By regarding each point signature learning process as a task, our method obtains an optimized model over the best performance on the distribution of all tasks, generating reliable signatures for new tasks, i.e., signatures of unseen point neighborhoods. Specifically, the MEPS consists of two modules: a base signature learner and a meta signature learner. During training, the base-learner is trained to perform specific signature learning tasks. In the meantime, the meta-learner is trained to update the base-learner with optimal parameters. During testing, the meta-learner that is learned with the distribution of all tasks can adaptively change parameters of the base-learner, accommodating to unseen local neighborhoods. We evaluate the MEPS model on two datasets, e.g., FAUST and TOSCA, for dense 3Dshape correspondence. Experimental results demonstrate that our method not only gains significant improvements over the baseline model and achieves state-of-the-art results, but also is capable of handling unseen 3D shapes.
This paper proposes GraviCap, i.e., a new approach for joint markerless 3D human motion capture and object trajectory estimation from monocular RGB videos. We focus on scenes with objects partially observed during a free flight. In contrast to existing monocular methods, we can recover scale, object trajectories as well as human bone lengths in meters and the ground planes orientation, thanks to the awareness of the gravity constraining object motions. Our objective function is parametrised by the objects initial velocity and position, gravity direction and focal length, and jointly optimised for one or several free flight episodes. The proposed human-object interaction constraints ensure geometric consistency of the 3D reconstructions and improved physical plausibility of human poses compared to the unconstrained case. We evaluate GraviCap on a new dataset with ground-truth annotations for persons and different objects undergoing free flights. In the experiments, our approach achieves state-of-the-art accuracy in 3D human motion capture on various metrics. We urge the reader to watch our supplementary video. Both the source code and the dataset are released; see http://4dqv.mpi-inf.mpg.de/GraviCap/.
Monocular 3D detection currently struggles with extremely lower detection rates compared to LiDAR-based methods. The poor accuracy is mainly caused by the absence of accurate location cues due to the ill-posed nature of monocular imagery. LiDAR point clouds, which provide precise spatial measurement, can offer beneficial information for the training of monocular methods. To make use of LiDAR point clouds, prior works project them to form depth map labels, subsequently training a dense depth estimator to extract explicit location features. This indirect and complicated way introduces intermediate products, i.e., depth map predictions, taking much computation costs as well as leading to suboptimal performances. In this paper, we propose LPCG (LiDAR point cloud guided monocular 3D object detection), which is a general framework for guiding the training of monocular 3D detectors with LiDAR point clouds. Specifically, we use LiDAR point clouds to generate pseudo labels, allowing monocular 3D detectors to benefit from easy-collected massive unlabeled data. LPCG works well under both supervised and unsupervised setups. Thanks to a general design, LPCG can be plugged into any monocular 3D detector, significantly boosting the performance. As a result, we take the first place on KITTI monocular 3D/BEV (birds-eye-view) detection benchmark with a considerable margin. The code will be made publicly available soon.
We present a novel framework named NeuralRecon for real-time 3D scene reconstruction from a monocular video. Unlike previous methods that estimate single-view depth maps separately on each key-frame and fuse them later, we propose to directly reconstruct local surfaces represented as sparse TSDF volumes for each video fragment sequentially by a neural network. A learning-based TSDF fusion module based on gated recurrent units is used to guide the network to fuse features from previous fragments. This design allows the network to capture local smoothness prior and global shape prior of 3D surfaces when sequentially reconstructing the surfaces, resulting in accurate, coherent, and real-time surface reconstruction. The experiments on ScanNet and 7-Scenes datasets show that our system outperforms state-of-the-art methods in terms of both accuracy and speed. To the best of our knowledge, this is the first learning-based system that is able to reconstruct dense coherent 3D geometry in real-time.
Monocular 3D reconstruction of articulated object categories is challenging due to the lack of training data and the inherent ill-posedness of the problem. In this work we use video self-supervision, forcing the consistency of consecutive 3D reconstructions by a motion-based cycle loss. This largely improves both optimization-based and learning-based 3D mesh reconstruction. We further introduce an interpretable model of 3D template deformations that controls a 3D surface through the displacement of a small number of local, learnable handles. We formulate this operation as a structured layer relying on mesh-laplacian regularization and show that it can be trained in an end-to-end manner. We finally introduce a per-sample numerical optimisation approach that jointly optimises over mesh displacements and cameras within a video, boosting accuracy both for training and also as test time post-processing. While relying exclusively on a small set of videos collected per category for supervision, we obtain state-of-the-art reconstructions with diverse shapes, viewpoints and textures for multiple articulated object categories.