No Arabic abstract
In this paper, we present Retargetable AR, a novel AR framework that yields an AR experience that is aware of scene contexts set in various real environments, achieving natural interaction between the virtual and real worlds. To this end, we characterize scene contexts with relationships among objects in 3D space, not with coordinates transformations. A context assumed by an AR content and a context formed by a real environment where users experience AR are represented as abstract graph representations, i.e. scene graphs. From RGB-D streams, our framework generates a volumetric map in which geometric and semantic information of a scene are integrated. Moreover, using the semantic map, we abstract scene objects as oriented bounding boxes and estimate their orientations. With such a scene representation, our framework constructs, in an online fashion, a 3D scene graph characterizing the context of a real environment for AR. The correspondence between the constructed graph and an AR scene graph denoting the context of AR content provides a semantically registered content arrangement, which facilitates natural interaction between the virtual and real worlds. We performed extensive evaluations on our prototype system through quantitative evaluation of the performance of the oriented bounding box estimation, subjective evaluation of the AR content arrangement based on constructed 3D scene graphs, and an online AR demonstration. The results of these evaluations showed the effectiveness of our framework, demonstrating that it can provide a context-aware AR experience in a variety of real scenes.
Panorama images have a much larger field-of-view thus naturally encode enriched scene context information compared to standard perspective images, which however is not well exploited in the previous scene understanding methods. In this paper, we propose a novel method for panoramic 3D scene understanding which recovers the 3D room layout and the shape, pose, position, and semantic category for each object from a single full-view panorama image. In order to fully utilize the rich context information, we design a novel graph neural network based context model to predict the relationship among objects and room layout, and a differentiable relationship-based optimization module to optimize object arrangement with well-designed objective functions on-the-fly. Realizing the existing data are either with incomplete ground truth or overly-simplified scene, we present a new synthetic dataset with good diversity in room layout and furniture placement, and realistic image quality for total panoramic 3D scene understanding. Experiments demonstrate that our method outperforms existing methods on panoramic scene understanding in terms of both geometry accuracy and object arrangement. Code is available at https://chengzhag.github.io/publication/dpc.
Scene flow estimation is the task to predict the point-wise 3D displacement vector between two consecutive frames of point clouds, which has important application in fields such as service robots and autonomous driving. Although many previous works have explored greatly on scene flow estimation based on point clouds, we point out two problems that have not been noticed or well solved before: 1) Points of adjacent frames in repetitive patterns may be wrongly associated due to similar spatial structure in their neighbourhoods; 2) Scene flow between adjacent frames of point clouds with long-distance movement may be inaccurately estimated. To solve the first problem, we propose a novel context-aware set conv layer to exploit contextual structure information of Euclidean space and learn soft aggregation weights for local point features. Our design is inspired by human perception of contextual structure information during scene understanding. We incorporate the context-aware set conv layer in a context-aware point feature pyramid module of 3D point clouds for scene flow estimation. For the second problem, we propose an explicit residual flow learning structure in the residual flow refinement layer to cope with long-distance movement. The experiments and ablation study on FlyingThings3D and KITTI scene flow datasets demonstrate the effectiveness of each proposed component and that we solve problem of ambiguous inter-frame association and long-distance movement estimation. Quantitative results on both FlyingThings3D and KITTI scene flow datasets show that our method achieves state-of-the-art performance, surpassing all other previous works to the best of our knowledge by at least 25%.
A key requirement for leveraging supervised deep learning methods is the availability of large, labeled datasets. Unfortunately, in the context of RGB-D scene understanding, very little data is available -- current datasets cover a small range of scene views and have limited semantic annotations. To address this issue, we introduce ScanNet, an RGB-D video dataset containing 2.5M views in 1513 scenes annotated with 3D camera poses, surface reconstructions, and semantic segmentations. To collect this data, we designed an easy-to-use and scalable RGB-D capture system that includes automated surface reconstruction and crowdsourced semantic annotation. We show that using this data helps achieve state-of-the-art performance on several 3D scene understanding tasks, including 3D object classification, semantic voxel labeling, and CAD model retrieval. The dataset is freely available at http://www.scan-net.org.
Controllable scene synthesis consists of generating 3D information that satisfy underlying specifications. Thereby, these specifications should be abstract, i.e. allowing easy user interaction, whilst providing enough interface for detailed control. Scene graphs are representations of a scene, composed of objects (nodes) and inter-object relationships (edges), proven to be particularly suited for this task, as they allow for semantic control on the generated content. Previous works tackling this task often rely on synthetic data, and retrieve object meshes, which naturally limits the generation capabilities. To circumvent this issue, we instead propose the first work that directly generates shapes from a scene graph in an end-to-end manner. In addition, we show that the same model supports scene modification, using the respective scene graph as interface. Leveraging Graph Convolutional Networks (GCN) we train a variational Auto-Encoder on top of the object and edge categories, as well as 3D shapes and scene layouts, allowing latter sampling of new scenes and shapes.
Segmentation-based scene text detection methods have been widely adopted for arbitrary-shaped text detection recently, since they make accurate pixel-level predictions on curved text instances and can facilitate real-time inference without time-consuming processing on anchors. However, current segmentation-based models are unable to learn the shapes of curved texts and often require complex label assignments or repeated feature aggregations for more accurate detection. In this paper, we propose RSCA: a Real-time Segmentation-based Context-Aware model for arbitrary-shaped scene text detection, which sets a strong baseline for scene text detection with two simple yet effective strategies: Local Context-Aware Upsampling and Dynamic Text-Spine Labeling, which model local spatial transformation and simplify label assignments separately. Based on these strategies, RSCA achieves state-of-the-art performance in both speed and accuracy, without complex label assignments or repeated feature aggregations. We conduct extensive experiments on multiple benchmarks to validate the effectiveness of our method. RSCA-640 reaches 83.9% F-measure at 48.3 FPS on CTW1500 dataset.