No Arabic abstract
Recognizing attributes of objects and their parts is important to many computer vision applications. Although great progress has been made to apply object-level recognition, recognizing the attributes of parts remains less applicable since the training data for part attributes recognition is usually scarce especially for internet-scale applications. Furthermore, most existing part attribute recognition methods rely on the part annotation which is more expensive to obtain. To solve the data insufficiency problem and get rid of dependence on the part annotation, we introduce a novel Concept Sharing Network (CSN) for part attribute recognition. A great advantage of CSN is its capability of recognizing the part attribute (a combination of part location and appearance pattern) that has insufficient or zero training data, by learning the part location and appearance pattern respectively from the training data that usually mix them in a single label. Extensive experiments on CUB-200-2011 [51], CelebA [35] and a newly proposed human attribute dataset demonstrate the effectiveness of CSN and its advantages over other methods, especially for the attributes with few training samples. Further experiments show that CSN can also perform zero-shot part attribute recognition. The code will be made available at https://github.com/Zhaoxiangyun/Concept-Sharing-Network.
Semantic segmentation of road scenes is one of the key technologies for realizing autonomous driving scene perception, and the effectiveness of deep Convolutional Neural Networks(CNNs) for this task has been demonstrated. State-of-art CNNs for semantic segmentation suffer from excessive computations as well as large-scale training data requirement. Inspired by the ideas of Fine-tuning-based Transfer Learning (FTT) and feature-based knowledge distillation, we propose a new knowledge distillation method for cross-domain knowledge transference and efficient data-insufficient network training, named Spirit Distillation(SD), which allow the student network to mimic the teacher network to extract general features, so that a compact and accurate student network can be trained for real-time semantic segmentation of road scenes. Then, in order to further alleviate the trouble of insufficient data and improve the robustness of the student, an Enhanced Spirit Distillation (ESD) method is proposed, which commits to exploit a more comprehensive general features extraction capability by considering images from both the target and the proximity domains as input. To our knowledge, this paper is a pioneering work on the application of knowledge distillation to few-shot learning. Persuasive experiments conducted on Cityscapes semantic segmentation with the prior knowledge transferred from COCO2017 and KITTI demonstrate that our methods can train a better student network (mIOU and high-precision accuracy boost by 1.4% and 8.2% respectively, with 78.2% segmentation variance) with only 41.8% FLOPs (see Fig. 1).
We present a remote sensing pipeline that processes LiDAR (Light Detection And Ranging) data through machine & deep learning for the application of archeological feature detection on big geo-spatial data platforms such as e.g. IBM PAIRS Geoscope. Today, archeologists get overwhelmed by the task of visually surveying huge amounts of (raw) LiDAR data in order to identify areas of interest for inspection on the ground. We showcase a software system pipeline that results in significant savings in terms of expert productivity while missing only a small fraction of the artifacts. Our work employs artificial neural networks in conjunction with an efficient spatial segmentation procedure based on domain knowledge. Data processing is constraint by a limited amount of training labels and noisy LiDAR signals due to vegetation cover and decay of ancient structures. We aim at identifying geo-spatial areas with archeological artifacts in a supervised fashion allowing the domain expert to flexibly tune parameters based on her needs.
Recognizing Video events in long, complex videos with multiple sub-activities has received persistent attention recently. This task is more challenging than traditional action recognition with short, relatively homogeneous video clips. In this paper, we investigate the problem of recognizing long and complex events with varying action rhythms, which has not been considered in the literature but is a practical challenge. Our work is inspired in part by how humans identify events with varying rhythms: quickly catching frames contributing most to a specific event. We propose a two-stage emph{end-to-end} framework, in which the first stage selects the most significant frames while the second stage recognizes the event using the selected frames. Our model needs only emph{event-level labels} in the training stage, and thus is more practical when the sub-activity labels are missing or difficult to obtain. The results of extensive experiments show that our model can achieve significant improvement in event recognition from long videos while maintaining high accuracy even if the test videos suffer from severe rhythm changes. This demonstrates the potential of our method for real-world video-based applications, where test and training videos can differ drastically in rhythms of sub-activities.
Recent studies have shown remarkable success in face image generations. However, most of the existing methods only generate face images from random noise, and cannot generate face images according to the specific attributes. In this paper, we focus on the problem of face synthesis from attributes, which aims at generating faces with specific characteristics corresponding to the given attributes. To this end, we propose a novel attributes aware face image generator method with generative adversarial networks called AFGAN. Specifically, we firstly propose a two-path embedding layer and self-attention mechanism to convert binary attribute vector to rich attribute features. Then three stacked generators generate $64 times 64$, $128 times 128$ and $256 times 256$ resolution face images respectively by taking the attribute features as input. In addition, an image-attribute matching loss is proposed to enhance the correlation between the generated images and input attributes. Extensive experiments on CelebA demonstrate the superiority of our AFGAN in terms of both qualitative and quantitative evaluations.
Biometric recognition on partial captured targets is challenging, where only several partial observations of objects are available for matching. In this area, deep learning based methods are widely applied to match these partial captured objects caused by occlusions, variations of postures or just partial out of view in person re-identification and partial face recognition. However, most current methods are not able to identify an individual in case that some parts of the object are not obtainable, while the rest are specialized to certain constrained scenarios. To this end, we propose a robust general framework for arbitrary biometric matching scenarios without the limitations of alignment as well as the size of inputs. We introduce a feature post-processing step to handle the feature maps from FCN and a dictionary learning based Spatial Feature Reconstruction (SFR) to match different sized feature maps in this work. Moreover, the batch hard triplet loss function is applied to optimize the model. The applicability and effectiveness of the proposed method are demonstrated by the results from experiments on three person re-identification datasets (Market1501, CUHK03, DukeMTMC-reID), two partial person datasets (Partial REID and Partial iLIDS) and two partial face datasets (CASIA-NIR-Distance and Partial LFW), on which state-of-the-art performance is ensured in comparison with several state-of-the-art approaches. The code is released online and can be found on the website: https://github.com/lingxiao-he/Partial-Person-ReID.