No Arabic abstract
RGB-infrared person re-identification is a challenging task due to the intra-class variations and cross-modality discrepancy. Existing works mainly focus on learning modality-shared global representations by aligning image styles or feature distributions across modalities, while local feature from body part and relationships between person images are largely neglected. In this paper, we propose a Dual-level (i.e., local and global) Feature Fusion (DF^2) module by learning attention for discriminative feature from local to global manner. In particular, the attention for a local feature is determined locally, i.e., applying a learned transformation function on itself. Meanwhile, to further mining the relationships between global features from person images, we propose an Affinities Modeling (AM) module to obtain the optimal intra- and inter-modality image matching. Specifically, AM employes intra-class compactness and inter-class separability in the sample similarities as supervised information to model the affinities between intra- and inter-modality samples. Experimental results show that our proposed method outperforms state-of-the-arts by large margins on two widely used cross-modality re-ID datasets SYSU-MM01 and RegDB, respectively.
RGB-Infrared (IR) person re-identification is very challenging due to the large cross-modality variations between RGB and IR images. The key solution is to learn aligned features to the bridge RGB and IR modalities. However, due to the lack of correspondence labels between every pair of RGB and IR images, most methods try to alleviate the variations with set-level alignment by reducing the distance between the entire RGB and IR sets. However, this set-level alignment may lead to misalignment of some instances, which limits the performance for RGB-IR Re-ID. Different from existing methods, in this paper, we propose to generate cross-modality paired-images and perform both global set-level and fine-grained instance-level alignments. Our proposed method enjoys several merits. First, our method can perform set-level alignment by disentangling modality-specific and modality-invariant features. Compared with conventional methods, ours can explicitly remove the modality-specific features and the modality variation can be better reduced. Second, given cross-modality unpaired-images of a person, our method can generate cross-modality paired images from exchanged images. With them, we can directly perform instance-level alignment by minimizing distances of every pair of images. Extensive experimental results on two standard benchmarks demonstrate that the proposed model favourably against state-of-the-art methods. Especially, on SYSU-MM01 dataset, our model can achieve a gain of 9.2% and 7.7% in terms of Rank-1 and mAP. Code is available at https://github.com/wangguanan/JSIA-ReID.
RGB-Infrared (IR) cross-modality person re-identification (re-ID), which aims to search an IR image in RGB gallery or vice versa, is a challenging task due to the large discrepancy between IR and RGB modalities. Existing methods address this challenge typically by aligning feature distributions or image styles across modalities, whereas the very useful similarities among gallery samples of the same modality (i.e. intra-modality sample similarities) is largely neglected. This paper presents a novel similarity inference metric (SIM) that exploits the intra-modality sample similarities to circumvent the cross-modality discrepancy targeting optimal cross-modality image matching. SIM works by successive similarity graph reasoning and mutual nearest-neighbor reasoning that mine cross-modality sample similarities by leveraging intra-modality sample similarities from two different perspectives. Extensive experiments over two cross-modality re-ID datasets (SYSU-MM01 and RegDB) show that SIM achieves significant accuracy improvement but with little extra training as compared with the state-of-the-art.
RGB-Infrared (IR) person re-identification aims to retrieve person-of-interest from heterogeneous cameras, easily suffering from large image modality discrepancy caused by different sensing wavelength ranges. Existing work usually minimizes such discrepancy by aligning domain distribution of global features, while neglecting the intra-modality structural relations between semantic parts. This could result in the network overly focusing on local cues, without considering long-range body part dependencies, leading to meaningless region representations. In this paper, we propose a graph-enabled distribution matching solution, dubbed Geometry-Guided Dual-Alignment (G2DA) learning, for RGB-IR ReID. It can jointly encourage the cross-modal consistency between part semantics and structural relations for fine-grained modality alignment by solving a graph matching task within a multi-scale skeleton graph that embeds human topology information. Specifically, we propose to build a semantic-aligned complete graph into which all cross-modality images can be mapped via a pose-adaptive graph construction mechanism. This graph represents extracted whole-part features by nodes and expresses the node-wise similarities with associated edges. To achieve the graph-based dual-alignment learning, an Optimal Transport (OT) based structured metric is further introduced to simultaneously measure point-wise relations and group-wise structural similarities across modalities. By minimizing the cost of an inter-modality transport plan, G2DA can learn a consistent and discriminative feature subspace for cross-modality image retrieval. Furthermore, we advance a Message Fusion Attention (MFA) mechanism to adaptively reweight the information flow of semantic propagation, effectively strengthening the discriminability of extracted semantic features.
This paper pays close attention to the cross-modality visible-infrared person re-identification (VI Re-ID) task, which aims to match human samples between visible and infrared modes. In order to reduce the discrepancy between features of different modalities, most existing works usually use constraints based on Euclidean metric. Since the Euclidean based distance metric cannot effectively measure the internal angles between the embedded vectors, the above methods cannot learn the angularly discriminative feature embedding. Because the most important factor affecting the classification task based on embedding vector is whether there is an angularly discriminativ feature space, in this paper, we propose a new loss function called Enumerate Angular Triplet (EAT) loss. Also, motivated by the knowledge distillation, to narrow down the features between different modalities before feature embedding, we further present a new Cross-Modality Knowledge Distillation (CMKD) loss. The experimental results on RegDB and SYSU-MM01 datasets have shown that the proposed method is superior to the other most advanced methods in terms of impressive performance.
RGB-Infrared person re-identification (RGB-IR Re-ID) aims to match persons from heterogeneous images captured by visible and thermal cameras, which is of great significance in the surveillance system under poor light conditions. Facing great challenges in complex variances including conventional single-modality and additional inter-modality discrepancies, most of the existing RGB-IR Re-ID methods propose to impose constraints in image level, feature level or a hybrid of both. Despite the better performance of hybrid constraints, they are usually implemented with heavy network architecture. As a matter of fact, previous efforts contribute more as pioneering works in new cross-modal Re-ID area while leaving large space for improvement. This can be mainly attributed to: (1) lack of abundant person image pairs from different modalities for training, and (2) scarcity of salient modality-invariant features especially on coarse representations for effective matching. To address these issues, a novel Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global in a cascading manner, which results in a unified representation containing rich and enhanced semantic features. Furthermore, a marginal exponential centre (MeCen) loss is introduced to jointly eliminate mixed variances from intra- and inter-modal examples. Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning. Extensive experiments are conducted to demonstrate that the proposed method outperforms all the state-of-the-art by a large margin.