ترغب بنشر مسار تعليمي؟ اضغط هنا

Combined Depth Space based Architecture Search For Person Re-identification

137   0   0.0 ( 0 )
 نشر من قبل Hanjun Li
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Most works on person re-identification (ReID) take advantage of large backbone networks such as ResNet, which are designed for image classification instead of ReID, for feature extraction. However, these backbones may not be computationally efficient or the most suitable architectures for ReID. In this work, we aim to design a lightweight and suitable network for ReID. We propose a novel search space called Combined Depth Space (CDS), based on which we search for an efficient network architecture, which we call CDNet, via a differentiable architecture search algorithm. Through the use of the combined basic building blocks in CDS, CDNet tends to focus on combined pattern information that is typically found in images of pedestrians. We then propose a low-cost search strategy named the Top-k Sample Search strategy to make full use of the search space and avoid trapping in local optimal result. Furthermore, an effective Fine-grained Balance Neck (FBLNeck), which is removable at the inference time, is presented to balance the effects of triplet loss and softmax loss during the training process. Extensive experiments show that our CDNet (~1.8M parameters) has comparable performance with state-of-the-art lightweight networks.

قيم البحث

اقرأ أيضاً

Recently, the Transformer module has been transplanted from natural language processing to computer vision. This paper applies the Transformer to video-based person re-identification, where the key issue is to extract the discriminative information f rom a tracklet. We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting, arguably due to a large number of attention parameters and insufficient training data. To solve this problem, we propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains with the perception-constrained Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID, and LS-VID, especially when the training and testing data are from different domains. More importantly, our research sheds light on the application of the Transformer on highly-structured visual data.
255 - Jinjie You , Ancong Wu , Xiang Li 2016
Most existing person re-identification (re-id) models focus on matching still person images across disjoint camera views. Since only limited information can be exploited from still images, it is hard (if not impossible) to overcome the occlusion, pos e and camera-view change, and lighting variation problems. In comparison, video-based re-id methods can utilize extra space-time information, which contains much more rich cues for matching to overcome the mentioned problems. However, we find that when using video-based representation, some inter-class difference can be much more obscure than the one when using still-image based representation, because different people could not only have similar appearance but also have similar motions and actions which are hard to align. To solve this problem, we propose a top-push distance learning model (TDL), in which we integrate a top-push constrain for matching video features of persons. The top-push constraint enforces the optimization on top-rank matching in re-id, so as to make the matching model more effective towards selecting more discriminative features to distinguish different persons. Our experiments show that the proposed video-based re-id framework outperforms the state-of-the-art video-based re-id methods.
It is prohibitively expensive to annotate a large-scale video-based person re-identification (re-ID) dataset, which makes fully supervised methods inapplicable to real-world deployment. How to maximally reduce the annotation cost while retaining the re-ID performance becomes an interesting problem. In this paper, we address this problem by integrating an active learning scheme into a deep learning framework. Noticing that the truly matched tracklet-pairs, also denoted as true positives (TP), are the most informative samples for our re-ID model, we propose a sampling criterion to choose the most TP-likely tracklet-pairs for annotation. A view-aware sampling strategy considering view-specific biases is designed to facilitate candidate selection, followed by an adaptive resampling step to leave out the selected candidates that are unnecessary to annotate. Our method learns the re-ID model and updates the annotation set iteratively. The re-ID model is supervised by the tracklets pesudo labels that are initialized by treating each tracklet as a distinct class. With the gained annotations of the actively selected candidates, the tracklets pesudo labels are updated by label merging and further used to re-train our re-ID model. While being simple, the proposed method demonstrates its effectiveness on three video-based person re-ID datasets. Experimental results show that less than 3% pairwise annotations are needed for our method to reach comparable performance with the fully-supervised setting.
Person re-identification (re-id) aims to match the same person from images taken across multiple cameras. Most existing person re-id methods generally require a large amount of identity labeled data to act as discriminative guideline for representati on learning. Difficulty in manually collecting identity labeled data leads to poor adaptability in practical scenarios. To overcome this problem, we propose an unsupervised center-based clustering approach capable of progressively learning and exploiting the underlying re-id discriminative information from temporal continuity within a camera. We call our framework Temporal Continuity based Unsupervised Learning (TCUL). Specifically, TCUL simultaneously does center based clustering of unlabeled (target) dataset and fine-tunes a convolutional neural network (CNN) pre-trained on irrelevant labeled (source) dataset to enhance discriminative capability of the CNN for the target dataset. Furthermore, it exploits temporally continuous nature of images within-camera jointly with spatial similarity of feature maps across-cameras to generate reliable pseudo-labels for training a re-identification model. As the training progresses, number of reliable samples keep on growing adaptively which in turn boosts representation ability of the CNN. Extensive experiments on three large-scale person re-id benchmark datasets are conducted to compare our framework with state-of-the-art techniques, which demonstrate superiority of TCUL over existing methods.
In this work, we present a deep convolutional pyramid person matching network (PPMN) with specially designed Pyramid Matching Module to address the problem of person re-identification. The architecture takes a pair of RGB images as input, and outputs a similiarity value indicating whether the two input images represent the same person or not. Based on deep convolutional neural networks, our approach first learns the discriminative semantic representation with the semantic-component-aware features for persons and then employs the Pyramid Matching Module to match the common semantic-components of persons, which is robust to the variation of spatial scales and misalignment of locations posed by viewpoint changes. The above two processes are jointly optimized via a unified end-to-end deep learning scheme. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our approach against the state-of-the-art approaches, especially on the rank-1 recognition rate.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا