No Arabic abstract
Drones or UAVs, equipped with different sensors, have been deployed in many places especially for urban traffic monitoring or last-mile delivery. It provides the ability to control the different aspects of traffic given real-time obeservations, an important pillar for the future of transportation and smart cities. With the increasing use of such machines, many previous state-of-the-art object detectors, who have achieved high performance on front facing cameras, are being used on UAV datasets. When applied to high-resolution aerial images captured from such datasets, they fail to generalize to the wide range of objects scales. In order to address this limitation, we propose an object detection method called Butterfly Detector that is tailored to detect objects in aerial images. We extend the concept of fields and introduce butterfly fields, a type of composite field that describes the spatial information of output features as well as the scale of the detected object. To overcome occlusion and viewing angle variations that can hinder the localization process, we employ a voting mechanism between related butterfly vectors pointing to the object center. We evaluate our Butterfly Detector on two publicly available UAV datasets (UAVDT and VisDrone2019) and show that it outperforms previous state-of-the-art methods while remaining real-time.
Humans perceive their surroundings in great detail even though most of our visual field is reduced to low-fidelity color-deprived (e.g. dichromatic) input by the retina. In contrast, most deep learning architectures are computationally wasteful in that they consider every part of the input when performing an image processing task. Yet, the human visual system is able to perform visual reasoning despite having only a small fovea of high visual acuity. With this in mind, we wish to understand the extent to which connectionist architectures are able to learn from and reason with low acuity, distorted inputs. Specifically, we train autoencoders to generate full-detail images from low-detail foveations of those images and then measure their ability to reconstruct the full-detail images from the foveat
Mapping road networks is currently both expensive and labor-intensive. High-resolution aerial imagery provides a promising avenue to automatically infer a road network. Prior work uses convolutional neural networks (CNNs) to detect which pixels belong to a road (segmentation), and then uses complex post-processing heuristics to infer graph connectivity. We show that these segmentation methods have high error rates because noisy CNN outputs are difficult to correct. We propose RoadTracer, a new method to automatically construct accurate road network maps from aerial images. RoadTracer uses an iterative search process guided by a CNN-based decision function to derive the road network graph directly from the output of the CNN. We compare our approach with a segmentation method on fifteen cities, and find that at a 5% error rate, RoadTracer correctly captures 45% more junctions across these cities.
Classical and more recently deep computer vision methods are optimized for visible spectrum images, commonly encoded in grayscale or RGB colorspaces acquired from smartphones or cameras. A more uncommon source of images exploited in the remote sensing field are satellite and aerial images. However, the development of pattern recognition approaches for these data is relatively recent, mainly due to the limited availability of this type of images, as until recently they were used exclusively for military purposes. Access to aerial imagery, including spectral information, has been increasing mainly due to the low cost of drones, cheapening of imaging satellite launch costs, and novel public datasets. Usually remote sensing applications employ computer vision techniques strictly modeled for classification tasks in closed set scenarios. However, real-world tasks rarely fit into closed set contexts, frequently presenting previously unknown classes, characterizing them as open set scenarios. Focusing on this problem, this is the first paper to study and develop semantic segmentation techniques for open set scenarios applied to remote sensing images. The main contributions of this paper are: 1) a discussion of related works in open set semantic segmentation, showing evidence that these techniques can be adapted for open set remote sensing tasks; 2) the development and evaluation of a novel approach for open set semantic segmentation. Our method yielded competitive results when compared to closed set methods for the same dataset.
Object detection in aerial images is a challenging task due to the following reasons: (1) objects are small and dense relative to images; (2) the object scale varies in a wide range; (3) the number of object in different classes is imbalanced. Many current methods adopt cropping idea: splitting high resolution images into serials subregions (chips) and detecting on them. However, some problems such as scale variation, object sparsity, and class imbalance exist in the process of training network with chips. In this work, three augmentation methods are introduced to relieve these problems. Specifically, we propose a scale adaptive module, which dynamically adjusts chip size to balance object scale, narrowing scale variation in training. In addtion, we introduce mosaic to augment datasets, relieving object sparity problem. To balance catgory, we present mask resampling to paste object in chips with panoramic segmentation. Our model achieves state-of-the-art perfomance on two popular aerial image datasets of VisDrone and UAVDT. Remarkably, three methods can be independently applied to detectiors, increasing performance steady without the sacrifice of inference efficiency.
Most publicly available datasets for image classification are with single labels, while images are inherently multi-labeled in our daily life. Such an annotation gap makes many pre-trained single-label classification models fail in practical scenarios. This annotation issue is more concerned for aerial images: Aerial data collected from sensors naturally cover a relatively large land area with multiple labels, while annotated aerial datasets, which are publicly available (e.g., UCM, AID), are single-labeled. As manually annotating multi-label aerial images would be time/labor-consuming, we propose a novel self-correction integrated domain adaptation (SCIDA) method for automatic multi-label learning. SCIDA is weakly supervised, i.e., automatically learning the multi-label image classification model from using massive, publicly available single-label images. To achieve this goal, we propose a novel Label-Wise self-Correction (LWC) module to better explore underlying label correlations. This module also makes the unsupervised domain adaptation (UDA) from single- to multi-label data possible. For model training, the proposed model only uses single-label information yet requires no prior knowledge of multi-labeled data; and it predicts labels for multi-label aerial images. In our experiments, trained with single-labeled MAI-AID-s and MAI-UCM-s datasets, the proposed model is tested directly on our collected Multi-scene Aerial Image (MAI) dataset.