No Arabic abstract
In object recognition, Fisher vector (FV) representation is one of the state-of-art image representations ways at the expense of dense, high dimensional features and increased computation time. A simplification of FV is attractive, so we propose Sparse Fisher vector (SFV). By incorporating locality strategy, we can accelerate the Fisher coding step in image categorization which is implemented from a collective of local descriptors. Combining with pooling step, we explore the relationship between coding step and pooling step to give a theoretical explanation about SFV. Experiments on benchmark datasets have shown that SFV leads to a speedup of several-fold of magnitude compares with FV, while maintaining the categorization performance. In addition, we demonstrate how SFV preserves the consistence in representation of similar local features.
Fine-grained image categorization is challenging due to the subtle inter-class differences.We posit that exploiting the rich relationships between channels can help capture such differences since different channels correspond to different semantics. In this paper, we propose a channel interaction network (CIN), which models the channel-wise interplay both within an image and across images. For a single image, a self-channel interaction (SCI) module is proposed to explore channel-wise correlation within the image. This allows the model to learn the complementary features from the correlated channels, yielding stronger fine-grained features. Furthermore, given an image pair, we introduce a contrastive channel interaction (CCI) module to model the cross-sample channel interaction with a metric learning framework, allowing the CIN to distinguish the subtle visual differences between images. Our model can be trained efficiently in an end-to-end fashion without the need of multi-stage training and testing. Finally, comprehensive experiments are conducted on three publicly available benchmarks, where the proposed method consistently outperforms the state-of-theart approaches, such as DFL-CNN (Wang, Morariu, and Davis 2018) and NTS (Yang et al. 2018).
Fisher Vectors and related orderless visual statistics have demonstrated excellent performance in object detection, sometimes superior to established approaches such as the Deformable Part Models. However, it remains unclear how these models can capture complex appearance variations using visual codebooks of limited sizes and coarse geometric information. In this work, we propose to interpret Fisher-Vector-based object detectors as part-based models. Through the use of several visualizations and experiments, we show that this is a useful insight to explain the good performance of the model. Furthermore, we reveal for the first time several interesting properties of the FV, including its ability to work well using only a small subset of input patches and visual words. Finally, we discuss the relation of the FV and DPM detectors, pointing out differences and commonalities between them.
The challenges of high intra-class variance yet low inter-class fluctuations in fine-grained visual categorization are more severe with few labeled samples, textit{i.e.,} Fine-Grained categorization problems under the Few-Shot setting (FGFS). High-order features are usually developed to uncover subtle differences between sub-categories in FGFS, but they are less effective in handling the high intra-class variance. In this paper, we propose a Target-Oriented Alignment Network (TOAN) to investigate the fine-grained relation between the target query image and support classes. The feature of each support image is transformed to match the query ones in the embedding feature space, which reduces the disparity explicitly within each category. Moreover, different from existing FGFS approaches devise the high-order features over the global image with less explicit consideration of discriminative parts, we generate discriminative fine-grained features by integrating compositional concept representations to global second-order pooling. Extensive experiments are conducted on four fine-grained benchmarks to demonstrate the effectiveness of TOAN compared with the state-of-the-art models.
We examine in this paper the problem of image registration from the new perspective where images are given by sparse approximations in parametric dictionaries of geometric functions. We propose a registration algorithm that looks for an estimate of the global transformation between sparse images by examining the set of relative geometrical transformations between the respective features. We propose a theoretical analysis of our registration algorithm and we derive performance guarantees based on two novel important properties of redundant dictionaries, namely the robust linear independence and the transformation inconsistency. We propose several illustrations and insights about the importance of these dictionary properties and show that common properties such as coherence or restricted isometry property fail to provide sufficient information in registration problems. We finally show with illustrative experiments on simple visual objects and handwritten digits images that our algorithm outperforms baseline competitor methods in terms of transformation-invariant distance computation and classification.
In this work, we evaluate the use of superpixel pooling layers in deep network architectures for semantic segmentation. Superpixel pooling is a flexible and efficient replacement for other pooling strategies that incorporates spatial prior information. We propose a simple and efficient GPU-implementation of the layer and explore several designs for the integration of the layer into existing network architectures. We provide experimental results on the IBSR and Cityscapes dataset, demonstrating that superpixel pooling can be leveraged to consistently increase network accuracy with minimal computational overhead. Source code is available at https://github.com/bermanmaxim/superpixPool