No Arabic abstract
We propose to jointly learn multi-view geometry and warping between views of the same object instances for robust cross-view object detection. What makes multi-view object instance detection difficult are strong changes in viewpoint, lighting conditions, high similarity of neighbouring objects, and strong variability in scale. By turning object detection and instance re-identification in different views into a joint learning task, we are able to incorporate both image appearance and geometric soft constraints into a single, multi-view detection process that is learnable end-to-end. We validate our method on a new, large data set of street-level panoramas of urban objects and show superior performance compared to various baselines. Our contribution is threefold: a large-scale, publicly available data set for multi-view instance detection and re-identification; an annotation tool custom-tailored for multi-view instance detection; and a novel, holistic multi-view instance detection and re-identification method that jointly models geometry and appearance across views.
In this article, we consider the problem of few-shot learning for classification. We assume a network trained for base categories with a large number of training examples, and we aim to add novel categories to it that have only a few, e.g., one or five, training examples. This is a challenging scenario because: 1) high performance is required in both the base and novel categories; and 2) training the network for the new categories with a few training examples can contaminate the feature space trained well for the base categories. To address these challenges, we propose two geometric constraints to fine-tune the network with a few training examples. The first constraint enables features of the novel categories to cluster near the category weights, and the second maintains the weights of the novel categories far from the weights of the base categories. By applying the proposed constraints, we extract discriminative features for the novel categories while preserving the feature space learned for the base categories. Using public data sets for few-shot learning that are subsets of ImageNet, we demonstrate that the proposed method outperforms prevalent methods by a large margin.
In recent years, we have witnessed a surge of interest in multi-view representation learning, which is concerned with the problem of learning representations of multi-view data. When facing multiple views that are highly related but sightly different from each other, most of existing multi-view methods might fail to fully integrate multi-view information. Besides, correlations between features from multiple views always vary seriously, which makes multi-view representation challenging. Therefore, how to learn appropriate embedding from multi-view information is still an open problem but challenging. To handle this issue, this paper proposes a novel multi-view learning method, named Multi-view Low-rank Preserving Embedding (MvLPE). It integrates different views into one centroid view by minimizing the disagreement term, based on distance or similarity matrix among instances, between the centroid view and each view meanwhile maintaining low-rank reconstruction relations among samples for each view, which could make more full use of compatible and complementary information from multi-view features. Unlike existing methods with additive parameters, the proposed method could automatically allocate a suitable weight for each view in multi-view information fusion. However, MvLPE couldnt be directly solved, which makes the proposed MvLPE difficult to obtain an analytic solution. To this end, we approximate this solution based on stationary hypothesis and normalization post-processing to efficiently obtain the optimal solution. Furthermore, an iterative alternating strategy is provided to solve this multi-view representation problem. The experiments on six benchmark datasets demonstrate that the proposed method outperforms its counterparts while achieving very competitive performance.
Multi-typed objects Multi-view Multi-instance Multi-label Learning (M4L) deals with interconnected multi-typed objects (or bags) that are made of diverse instances, represented with heterogeneous feature views and annotated with a set of non-exclusive but semantically related labels. M4L is more general and powerful than the typical Multi-view Multi-instance Multi-label Learning (M3L), which only accommodates single-typed bags and lacks the power to jointly model the naturally interconnected multi-typed objects in the physical world. To combat with this novel and challenging learning task, we develop a joint matrix factorization based solution (M4L-JMF). Particularly, M4L-JMF firstly encodes the diverse attributes and multiple inter(intra)-associations among multi-typed bags into respective data matrices, and then jointly factorizes these matrices into low-rank ones to explore the composite latent representation of each bag and its instances (if any). In addition, it incorporates a dispatch and aggregation term to distribute the labels of bags to individual instances and reversely aggregate the labels of instances to their affiliated bags in a coherent manner. Experimental results on benchmark datasets show that M4L-JMF achieves significantly better results than simple adaptions of existing M3L solutions on this novel problem.
Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views. However, existing graph learning methods mostly focus on the multi-view consistency issue, yet often neglect the inconsistency across multiple views, which makes them vulnerable to possibly low-quality or noisy datasets. To overcome this limitation, we propose a new multi-view graph learning framework, which for the first time simultaneously and explicitly models multi-view consistency and multi-view inconsistency in a unified objective function, through which the consistent and inconsistent parts of each single-view graph as well as the unified graph that fuses the consistent parts can be iteratively learned. Though optimizing the objective function is NP-hard, we design a highly efficient optimization algorithm which is able to obtain an approximate solution with linear time complexity in the number of edges in the unified graph. Furthermore, our multi-view graph learning approach can be applied to both similarity graphs and dissimilarity graphs, which lead to two graph fusion-based variants in our framework. Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.
Recent advances in semi-supervised learning methods rely on estimating the categories of unlabeled data using a model trained on the labeled data (pseudo-labeling) and using the unlabeled data for various consistency-based regularization. In this work, we propose to explicitly leverage the structure of the data manifold based on a Manifold Graph constructed over the image instances within the feature space. Specifically, we propose an architecture based on graph networks that jointly optimizes feature extraction, graph connectivity, and feature propagation and aggregation to unlabeled data in an end-to-end manner. Further, we present a novel Prototype Generator for producing a diverse set of prototypes that compactly represent each category, which supports feature propagation. To evaluate our method, we first contribute a strong baseline that combines two consistency-based regularizers that already achieves state-of-the-art results especially with fewer labels. We then show that when combined with these regularizers, the proposed method facilitates the propagation of information from generated prototypes to image data to further improve results. We provide extensive qualitative and quantitative experimental results on semi-supervised benchmarks demonstrating the improvements arising from our design and show that our method achieves state-of-the-art performance when compared with existing methods using a single model and comparable with ensemble methods. Specifically, we achieve error rates of 3.35% on SVHN, 8.27% on CIFAR-10, and 33.83% on CIFAR-100. With much fewer labels, we surpass the state of the arts by significant margins of 41% relative error decrease on average.