No Arabic abstract
We present HoliCity, a city-scale 3D dataset with rich structural information. Currently, this dataset has 6,300 real-world panoramas of resolution $13312 times 6656$ that are accurately aligned with the CAD model of downtown London with an area of more than 20 km$^2$, in which the median reprojection error of the alignment of an average image is less than half a degree. This dataset aims to be an all-in-one data platform for research of learning abstracted high-level holistic 3D structures that can be derived from city CAD models, e.g., corners, lines, wireframes, planes, and cuboids, with the ultimate goal of supporting real-world applications including city-scale reconstruction, localization, mapping, and augmented reality. The accurate alignment of the 3D CAD models and panoramas also benefits low-level 3D vision tasks such as surface normal estimation, as the surface normal extracted from previous LiDAR-based datasets is often noisy. We conduct experiments to demonstrate the applications of HoliCity, such as predicting surface segmentation, normal maps, depth maps, and vanishing points, as well as test the generalizability of methods trained on HoliCity and other related datasets. HoliCity is available at https://holicity.io.
City-scale sensing holds the promise of enabling a deeper understanding of our urban environments. However, a city-scale deployment requires physical installation, power management, and communications---all challenging tasks standing between a good idea and a realized one. This indicates the need for a platform that enables easy deployment and experimentation for applications operating at city scale. To address these challenges, we present Signpost, a modular, energy-harvesting platform for city-scale sensing. Signpost simplifies deployment by eliminating the need for connection to wired infrastructure and instead harvesting energy from an integrated solar panel. The platform furnishes the key resources necessary to support multiple, pluggable sensor modules while providing fair, safe, and reliable sharing in the face of dynamic energy constraints. We deploy Signpost with several sensor modules, showing the viability of an energy-harvesting, multi-tenant, sensing system, and evaluate its ability to support sensing applications. We believe Signpost reduces the difficulty inherent in city-scale deployments, enables new experimentation, and provides improved insights into urban health.
In this paper, we provide two case studies to demonstrate how artificial intelligence can empower civil engineering. In the first case, a machine learning-assisted framework, BRAILS, is proposed for city-scale building information modeling. Building information modeling (BIM) is an efficient way of describing buildings, which is essential to architecture, engineering, and construction. Our proposed framework employs deep learning technique to extract visual information of buildings from satellite/street view images. Further, a novel machine learning (ML)-based statistical tool, SURF, is proposed to discover the spatial patterns in building metadata. The second case focuses on the task of soft-story building classification. Soft-story buildings are a type of buildings prone to collapse during a moderate or severe earthquake. Hence, identifying and retrofitting such buildings is vital in the current earthquake preparedness efforts. For this task, we propose an automated deep learning-based procedure for identifying soft-story buildings from street view images at a regional scale. We also create a large-scale building image database and a semi-automated image labeling approach that effectively annotates new database entries. Through extensive computational experiments, we demonstrate the effectiveness of the proposed method.
Video recognition has been advanced in recent years by benchmarks with rich annotations. However, research is still mainly limited to human action or sports recognition - focusing on a highly specific video understanding task and thus leaving a significant gap towards describing the overall content of a video. We fill this gap by presenting a large-scale Holistic Video Understanding Dataset~(HVU). HVU is organized hierarchically in a semantic taxonomy that focuses on multi-label and multi-task video understanding as a comprehensive problem that encompasses the recognition of multiple semantic aspects in the dynamic scene. HVU contains approx.~572k videos in total with 9 million annotations for training, validation, and test set spanning over 3142 labels. HVU encompasses semantic aspects defined on categories of scenes, objects, actions, events, attributes, and concepts which naturally captures the real-world scenarios. We demonstrate the generalization capability of HVU on three challenging tasks: 1.) Video classification, 2.) Video captioning and 3.) Video clustering tasks. In particular for video classification, we introduce a new spatio-temporal deep neural network architecture called Holistic Appearance and Temporal Network~(HATNet) that builds on fusing 2D and 3D architectures into one by combining intermediate representations of appearance and temporal cues. HATNet focuses on the multi-label and multi-task learning problem and is trained in an end-to-end manner. Via our experiments, we validate the idea that holistic representation learning is complementary, and can play a key role in enabling many real-world applications.
Learning to estimate 3D geometry in a single frame and optical flow from consecutive frames by watching unlabeled videos via deep convolutional network has made significant progress recently. Current state-of-the-art (SoTA) methods treat the two tasks independently. One typical assumption of the existing depth estimation methods is that the scenes contain no independent moving objects. while object moving could be easily modeled using optical flow. In this paper, we propose to address the two tasks as a whole, i.e. to jointly understand per-pixel 3D geometry and motion. This eliminates the need of static scene assumption and enforces the inherent geometrical consistency during the learning process, yielding significantly improved results for both tasks. We call our method as Every Pixel Counts++ or EPC++. Specifically, during training, given two consecutive frames from a video, we adopt three parallel networks to predict the camera motion (MotionNet), dense depth map (DepthNet), and per-pixel optical flow between two frames (OptFlowNet) respectively. The three types of information are fed into a holistic 3D motion parser (HMP), and per-pixel 3D motion of both rigid background and moving objects are disentangled and recovered. Comprehensive experiments were conducted on datasets with different scenes, including driving scenario (KITTI 2012 and KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic animation (MPI Sintel dataset). Performance on the five tasks of depth estimation, optical flow estimation, odometry, moving object segmentation and scene flow estimation shows that our approach outperforms other SoTA methods. Code will be available at: https://github.com/chenxuluo/EPC.
Automatic and accurate tumor segmentation on medical images is in high demand to assist physicians with diagnosis and treatment. However, it is difficult to obtain massive amounts of annotated training data required by the deep-learning models as the manual delineation process is often tedious and expertise required. Although self-supervised learning (SSL) scheme has been widely adopted to address this problem, most SSL methods focus only on global structure information, ignoring the key distinguishing features of tumor regions: local intensity variation and large size distribution. In this paper, we propose Scale-Aware Restoration (SAR), a SSL method for 3D tumor segmentation. Specifically, a novel proxy task, i.e. scale discrimination, is formulated to pre-train the 3D neural network combined with the self-restoration task. Thus, the pre-trained model learns multi-level local representations through multi-scale inputs. Moreover, an adversarial learning module is further introduced to learn modality invariant representations from multiple unlabeled source datasets. We demonstrate the effectiveness of our methods on two downstream tasks: i) Brain tumor segmentation, ii) Pancreas tumor segmentation. Compared with the state-of-the-art 3D SSL methods, our proposed approach can significantly improve the segmentation accuracy. Besides, we analyze its advantages from multiple perspectives such as data efficiency, performance, and convergence speed.