ترغب بنشر مسار تعليمي؟ اضغط هنا

Semantic Segmentation and Data Fusion of Microsoft Bing 3D Cities and Small UAV-based Photogrammetric Data

370   0   0.0 ( 0 )
 نشر من قبل Meida Chen
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

With state-of-the-art sensing and photogrammetric techniques, Microsoft Bing Maps team has created over 125 highly detailed 3D cities from 11 different countries that cover hundreds of thousands of square kilometer areas. The 3D city models were created using the photogrammetric technique with high-resolution images that were captured from aircraft-mounted cameras. Such a large 3D city database has caught the attention of the US Army for creating virtual simulation environments to support military operations. However, the 3D city models do not have semantic information such as buildings, vegetation, and ground and cannot allow sophisticated user-level and system-level interaction. At I/ITSEC 2019, the authors presented a fully automated data segmentation and object information extraction framework for creating simulation terrain using UAV-based photogrammetric data. This paper discusses the next steps in extending our designed data segmentation framework for segmenting 3D city data. In this study, the authors first investigated the strengths and limitations of the existing framework when applied to the Bing data. The main differences between UAV-based and aircraft-based photogrammetric data are highlighted. The data quality issues in the aircraft-based photogrammetric data, which can negatively affect the segmentation performance, are identified. Based on the findings, a workflow was designed specifically for segmenting Bing data while considering its characteristics. In addition, since the ultimate goal is to combine the use of both small unmanned aerial vehicle (UAV) collected data and the Bing data in a virtual simulation environment, data from these two sources needed to be aligned and registered together. To this end, the authors also proposed a data registration workflow that utilized the traditional iterative closest point (ICP) with the extracted semantic information.



قيم البحث

اقرأ أيضاً

At I/ITSEC 2019, the authors presented a fully-automated workflow to segment 3D photogrammetric point-clouds/meshes and extract object information, including individual tree locations and ground materials (Chen et al., 2019). The ultimate goal is to create realistic virtual environments and provide the necessary information for simulation. We tested the generalizability of the previously proposed framework using a database created under the U.S. Armys One World Terrain (OWT) project with a variety of landscapes (i.e., various buildings styles, types of vegetation, and urban density) and different data qualities (i.e., flight altitudes and overlap between images). Although the database is considerably larger than existing databases, it remains unknown whether deep-learning algorithms have truly achieved their full potential in terms of accuracy, as sizable data sets for training and validation are currently lacking. Obtaining large annotated 3D point-cloud databases is time-consuming and labor-intensive, not only from a data annotation perspective in which the data must be manually labeled by well-trained personnel, but also from a raw data collection and processing perspective. Furthermore, it is generally difficult for segmentation models to differentiate objects, such as buildings and tree masses, and these types of scenarios do not always exist in the collected data set. Thus, the objective of this study is to investigate using synthetic photogrammetric data to substitute real-world data in training deep-learning algorithms. We have investigated methods for generating synthetic UAV-based photogrammetric data to provide a sufficiently sized database for training a deep-learning algorithm with the ability to enlarge the data size for scenarios in which deep-learning models have difficulties.
120 - Linqing Zhao , Jiwen Lu , Jie Zhou 2021
In this paper, we propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. Existing fusion-based methods achieve remarkable performances by integrating information from multiple modalities. However, they heavily rely on the correspondence between 2D pixels and 3D points by projection and can only perform the information fusion in a fixed manner, and thus their performances cannot be easily migrated to a more realistic scenario where the collected data often lack strict pair-wise features for prediction. To address this, we employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds and utilize them to guide the fusion of two modalities to further exploit complementary information. Specifically, we employ a geometric similarity module (GSM) to directly compare the spatial coordinate distributions of pair-wise 3D neighborhoods, and a contextual similarity module (CSM) to aggregate and compare spatial contextual information of corresponding central points. The two proposed modules can effectively measure how much image features can help predictions, enabling the network to adaptively adjust the contributions of two modalities to the final prediction of each point. Experimental results on the ScanNetV2 benchmark demonstrate that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
Semantic segmentation of 3D meshes is an important problem for 3D scene understanding. In this paper we revisit the classic multiview representation of 3D meshes and study several techniques that make them effective for 3D semantic segmentation of me shes. Given a 3D mesh reconstructed from RGBD sensors, our method effectively chooses different virtual views of the 3D mesh and renders multiple 2D channels for training an effective 2D semantic segmentation model. Features from multiple per view predictions are finally fused on 3D mesh vertices to predict mesh semantic segmentation labels. Using the large scale indoor 3D semantic segmentation benchmark of ScanNet, we show that our virtual views enable more effective training of 2D semantic segmentation networks than previous multiview approaches. When the 2D per pixel predictions are aggregated on 3D surfaces, our virtual multiview fusion method is able to achieve significantly better 3D semantic segmentation results compared to all prior multiview approaches and competitive with recent 3D convolution approaches.
In this paper, we present Generic Object Detection (GenOD), one of the largest object detection systems deployed to a web-scale general visual search engine that can detect over 900 categories for all Microsoft Bing Visual Search queries in near real -time. It acts as a fundamental visual query understanding service that provides object-centric information and shows gains in multiple production scenarios, improving upon domain-specific models. We discuss the challenges of collecting data, training, deploying and updating such a large-scale object detection model with multiple dependencies. We discuss a data collection pipeline that reduces per-bounding box labeling cost by 81.5% and latency by 61.2% while improving on annotation quality. We show that GenOD can improve weighted average precision by over 20% compared to multiple domain-specific models. We also improve the model update agility by nearly 2 times with the proposed disjoint detector training compared to joint fine-tuning. Finally we demonstrate how GenOD benefits visual search applications by significantly improving object-level search relevance by 54.9% and user engagement by 59.9%.
Semantic segmentation is a challenging vision problem that usually necessitates the collection of large amounts of finely annotated data, which is often quite expensive to obtain. Coarsely annotated data provides an interesting alternative as it is u sually substantially more cheap. In this work, we present a method to leverage coarsely annotated data along with fine supervision to produce better segmentation results than would be obtained when training using only the fine data. We validate our approach by simulating a scarce data setting with less than 200 low resolution images from the Cityscapes dataset and show that our method substantially outperforms solely training on the fine annotation data by an average of 15.52% mIoU and outperforms the coarse mask by an average of 5.28% mIoU.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا