No Arabic abstract
Plant species identification in the wild is a difficult problem in part due to the high variability of the input data, but also because of complications induced by the long-tail effects of the datasets distribution. Inspired by the most recent fine-grained visual classification approaches which are based on attention to mitigate the effects of data variability, we explore the idea of using object detection as a form of attention. We introduce a bottom-up approach based on detecting plant organs and fusing the predictions of a variable number of organ-based species classifiers. We also curate a new dataset with a long-tail distribution for evaluating plant organ detection and organ-based species identification, which is publicly available.
Automatic plant classification is a challenging problem due to the wide biodiversity of the existing plant species in a fine-grained scenario. Powerful deep learning architectures have been used to improve the classification performance in such a fine-grained problem, but usually building models that are highly dependent on a large training dataset and which are not scalable. In this paper, we propose a novel method based on a two-view leaf image representation and a hierarchical classification strategy for fine-grained recognition of plant species. It uses the botanical taxonomy as a basis for a coarse-to-fine strategy applied to identify the plant genus and species. The two-view representation provides complementary global and local features of leaf images. A deep metric based on Siamese convolutional neural networks is used to reduce the dependence on a large number of training samples and make the method scalable to new plant species. The experimental results on two challenging fine-grained datasets of leaf images (i.e. LifeCLEF 2015 and LeafSnap) have shown the effectiveness of the proposed method, which achieved recognition accuracy of 0.87 and 0.96 respectively.
Fine-grained visual classification aims to recognize images belonging to multiple sub-categories within a same category. It is a challenging task due to the inherently subtle variations among highly-confused categories. Most existing methods only take an individual image as input, which may limit the ability of models to recognize contrastive clues from different images. In this paper, we propose an effective method called progressive co-attention network (PCA-Net) to tackle this problem. Specifically, we calculate the channel-wise similarity by encouraging interaction between the feature channels within same-category image pairs to capture the common discriminative features. Considering that complementary information is also crucial for recognition, we erase the prominent areas enhanced by the channel interaction to force the network to focus on other discriminative regions. The proposed model has achieved competitive results on three fine-grained visual classification benchmark datasets: CUB-200-2011, Stanford Cars, and FGVC Aircraft.
Fine-grained image classification is to recognize hundreds of subcategories belonging to the same basic-level category, such as 200 subcategories belonging to the bird, which is highly challenging due to large variance in the same subcategory and small variance among different subcategories. Existing methods generally first locate the objects or parts and then discriminate which subcategory the image belongs to. However, they mainly have two limitations: (1) Relying on object or part annotations which are heavily labor consuming. (2) Ignoring the spatial relationships between the object and its parts as well as among these parts, both of which are significantly helpful for finding discriminative parts. Therefore, this paper proposes the object-part attention model (OPAM) for weakly supervised fine-grained image classification, and the main novelties are: (1) Object-part attention model integrates two level attentions: object-level attention localizes objects of images, and part-level attention selects discriminative parts of object. Both are jointly employed to learn multi-view and multi-scale features to enhance their mutual promotions. (2) Object-part spatial constraint model combines two spatial constraints: object spatial constraint ensures selected parts highly representative, and part spatial constraint eliminates redundancy and enhances discrimination of selected parts. Both are jointly employed to exploit the subtle and local differences for distinguishing the subcategories. Importantly, neither object nor part annotations are used in our proposed approach, which avoids the heavy labor consumption of labeling. Comparing with more than 10 state-of-the-art methods on 4 widely-used datasets, our OPAM approach achieves the best performance.
For fine-grained visual classification, objects usually share similar geometric structure but present variant local appearance and different pose. Therefore, localizing and extracting discriminative local features play a crucial role in accurate category prediction. Existing works either pay attention to limited object parts or train isolated networks for locating and classification. In this paper, we propose Weakly Supervised Bilinear Attention Network (WS-BAN) to solve these issues. It jointly generates a set of attention maps (region-of-interest maps) to indicate the locations of objects parts and extracts sequential part features by Bilinear Attention Pooling (BAP). Besides, we propose attention regularization and attention dropout to weakly supervise the generating process of attention maps. WS-BAN can be trained end-to-end and achieves the state-of-the-art performance on multiple fine-grained classification datasets, including CUB-200-2011, Stanford Car and FGVC-Aircraft, which demonstrated its effectiveness.
Age estimation from a single face image has been an essential task in the field of human-computer interaction and computer vision, which has a wide range of practical application values. Accuracy of age estimation of face images in the wild is relatively low for existing methods, because they only take into account the global features, while neglecting the fine-grained features of age-sensitive areas. We propose a novel method based on our attention long short-term memory (AL) network for fine-grained age estimation in the wild, inspired by the fine-grained categories and the visual attention mechanism. This method combines the residual networks (ResNets) or the residual network of residual network (RoR) models with LSTM units to construct AL-ResNets or AL-RoR networks to extract local features of age-sensitive regions, which effectively improves the age estimation accuracy. First, a ResNets or a RoR model pretrained on ImageNet dataset is selected as the basic model, which is then fine-tuned on the IMDB-WIKI-101 dataset for age estimation. Then, we fine-tune the ResNets or the RoR on the target age datasets to extract the global features of face images. To extract the local features of age-sensitive regions, the LSTM unit is then presented to obtain the coordinates of the agesensitive region automatically. Finally, the age group classification is conducted directly on the Adience dataset, and age-regression experiments are performed by the Deep EXpectation algorithm (DEX) on MORPH Album 2, FG-NET and 15/16LAP datasets. By combining the global and the local features, we obtain our final prediction results. Experimental results illustrate the effectiveness and robustness of the proposed AL-ResNets or AL-RoR for age estimation in the wild, where it achieves better state-of-the-art performance than all other convolutional neural network.