No Arabic abstract
Person re-identification (Re-ID) in real-world scenarios usually suffers from various degradation factors, e.g., low-resolution, weak illumination, blurring and adverse weather. On the one hand, these degradations lead to severe discriminative information loss, which significantly obstructs identity representation learning; on the other hand, the feature mismatch problem caused by low-level visual variations greatly reduces retrieval performance. An intuitive solution to this problem is to utilize low-level image restoration methods to improve the image quality. However, existing restoration methods cannot directly serve to real-world Re-ID due to various limitations, e.g., the requirements of reference samples, domain gap between synthesis and reality, and incompatibility between low-level and high-level methods. In this paper, to solve the above problem, we propose a degradation invariance learning framework for real-world person Re-ID. By introducing a self-supervised disentangled representation learning strategy, our method is able to simultaneously extract identity-related robust features and remove real-world degradations without extra supervision. We use low-resolution images as the main demonstration, and experiments show that our approach is able to achieve state-of-the-art performance on several Re-ID benchmarks. In addition, our framework can be easily extended to other real-world degradation factors, such as weak illumination, with only a few modifications.
Person re-identification has seen significant advancement in recent years. However, the ability of learned models to generalize to unknown target domains still remains limited. One possible reason for this is the lack of large-scale and diverse source training data, since manually labeling such a dataset is very expensive and privacy sensitive. To address this, we propose to automatically synthesize a large-scale person re-identification dataset following a set-up similar to real surveillance but with virtual environments, and then use the synthesized person images to train a generalizable person re-identification model. Specifically, we design a method to generate a large number of random UV texture maps and use them to create different 3D clothing models. Then, an automatic code is developed to randomly generate various different 3D characters with diverse clothes, races and attributes. Next, we simulate a number of different virtual environments using Unity3D, with customized camera networks similar to real surveillance systems, and import multiple 3D characters at the same time, with various movements and interactions along different paths through the camera networks. As a result, we obtain a virtual dataset, called RandPerson, with 1,801,816 person images of 8,000 identities. By training person re-identification models on these synthesized person images, we demonstrate, for the first time, that models trained on virtual data can generalize well to unseen target images, surpassing the models trained on various real-world datasets, including CUHK03, Market-1501, DukeMTMC-reID, and almost MSMT17. The RandPerson dataset is available at https://github.com/VideoObjectSearch/RandPerson.
Unsupervised person re-identification (re-ID) has become an important topic due to its potential to resolve the scalability problem of supervised re-ID models. However, existing methods simply utilize pseudo labels from clustering for supervision and thus have not yet fully explored the semantic information in data itself, which limits representation capabilities of learned models. To address this problem, we design a pretext task for unsupervised re-ID by learning visual consistency from still images and temporal consistency during training process, such that the clustering network can separate the images into semantic clusters automatically. Specifically, the pretext task learns semantically meaningful representations by maximizing the agreement between two encoded views of the same image via a consistency loss in latent space. Meanwhile, we optimize the model by grouping the two encoded views into same cluster, thus enhancing the visual consistency between views. Experiments on Market-1501, DukeMTMC-reID and MSMT17 datasets demonstrate that our proposed approach outperforms the state-of-the-art methods by large margins.
We address the person re-identification problem by effectively exploiting a globally discriminative feature representation from a sequence of tracked human regions/patches. This is in contrast to previous person re-id works, which rely on either single frame based person to person patch matching, or graph based sequence to sequence matching. We show that a progressive/sequential fusion framework based on long short term memory (LSTM) network aggregates the frame-wise human region representation at each time stamp and yields a sequence level human feature representation. Since LSTM nodes can remember and propagate previously accumulated good features and forget newly input inferior ones, even with simple hand-crafted features, the proposed recurrent feature aggregation network (RFA-Net) is effective in generating highly discriminative sequence level human representations. Extensive experimental results on two person re-identification benchmarks demonstrate that the proposed method performs favorably against state-of-the-art person re-identification methods.
Deep learning technology promotes the rapid development of person re-identifica-tion (re-ID). However, some challenges are still existing in the open-world. First, the existing re-ID research usually assumes only one factor variable (view, clothing, pedestrian pose, pedestrian occlusion, image resolution, RGB/IR modality) changing, ignoring the complexity of multi-factor variables in the open-world. Second, the existing re-ID methods are over depend on clothing color and other apparent features of pedestrian, which are easily disguised or changed. In addition, the lack of benchmark datasets containing multi-factor variables is also hindering the practically application of re-ID in the open-world. In this paper, we propose a low-cost and high-efficiency method to solve shortcomings of the existing re-ID research, such as unreliable feature selection, low efficiency of feature extraction, single research variable, etc. Our approach based on pose estimation model improved by group convolution to obtain the continuous key points of pedestrian, and utilize dynamic time warping (DTW) to measure the similarity of features between different pedestrians. At the same time, to verify the effectiveness of our method, we provide a miniature dataset which is closer to the real world and includes pedestrian changing clothes and cross-modality factor variables fusion. Extensive experiments are conducted and the results show that our method achieves Rank-1: 60.9%, Rank-5: 78.1%, and mAP: 49.2% on this dataset, which exceeds most existing state-of-art re-ID models.
Person re-identification (re-ID) under various occlusions has been a long-standing challenge as person images with different types of occlusions often suffer from misalignment in image matching and ranking. Most existing methods tackle this challenge by aligning spatial features of body parts according to external semantic cues or feature similarities but this alignment approach is complicated and sensitive to noises. We design DRL-Net, a disentangled representation learning network that handles occluded re-ID without requiring strict person image alignment or any additional supervision. Leveraging transformer architectures, DRL-Net achieves alignment-free re-ID via global reasoning of local features of occluded person images. It measures image similarity by automatically disentangling the representation of undefined semantic components, e.g., human body parts or obstacles, under the guidance of semantic preference object queries in the transformer. In addition, we design a decorrelation constraint in the transformer decoder and impose it over object queries for better focus on different semantic components. To better eliminate interference from occlusions, we design a contrast feature learning technique (CFL) for better separation of occlusion features and discriminative ID features. Extensive experiments over occluded and holistic re-ID benchmarks (Occluded-DukeMTMC, Market1501 and DukeMTMC) show that the DRL-Net achieves superior re-ID performance consistently and outperforms the state-of-the-art by large margins for Occluded-DukeMTMC.