Do you want to publish a course? Click here

Multi-Task Learning via Co-Attentive Sharing for Pedestrian Attribute Recognition

87   0   0.0 ( 0 )
 Added by Haitian Zeng
 Publication date 2020
and research's language is English




Ask ChatGPT about the research

Learning to predict multiple attributes of a pedestrian is a multi-task learning problem. To share feature representation between two individual task networks, conventional methods like Cross-Stitch and Sluice network learn a linear combination of features or feature subspaces. However, linear combination rules out the complex interdependency between channels. Moreover, spatial information exchanging is less-considered. In this paper, we propose a novel Co-Attentive Sharing (CAS) module which extracts discriminative channels and spatial regions for more effective feature sharing in multi-task learning. The module consists of three branches, which leverage different channels for between-task feature fusing, attention generation and task-specific feature enhancing, respectively. Experiments on two pedestrian attribute recognition datasets show that our module outperforms the conventional sharing units and achieves superior results compared to the state-of-the-art approaches using many metrics.



rate research

Read More

In this paper, we first tackle the problem of pedestrian attribute recognition by video-based approach. The challenge mainly lies in spatial and temporal modeling and how to integrating them for effective and dynamic pedestrian representation. To solve this problem, a novel multi-task model based on the conventional neural network and temporal attention strategy is proposed. Since publicly available dataset is rare, two new large-scale video datasets with expanded attribute definition are presented, on which the effectiveness of both video-based pedestrian attribute recognition methods and the proposed new network architecture is well demonstrated. The two datasets are published on http://irip.buaa.edu.cn/mars_duke_attributes/index.html.
In this paper, we aim to improve the dataset foundation for pedestrian attribute recognition in real surveillance scenarios. Recognition of human attributes, such as gender, and clothes types, has great prospects in real applications. However, the development of suitable benchmark datasets for attribute recognition remains lagged behind. Existing human attribute datasets are collected from various sources or an integration of pedestrian re-identification datasets. Such heterogeneous collection poses a big challenge on developing high quality fine-grained attribute recognition algorithms. Furthermore, human attribute recognition are generally severely affected by environmental or contextual factors, such as viewpoints, occlusions and body parts, while existing attribute datasets barely care about them. To tackle these problems, we build a Richly Annotated Pedestrian (RAP) dataset from real multi-camera surveillance scenarios with long term collection, where data samples are annotated with not only fine-grained human attributes but also environmental and contextual factors. RAP has in total 41,585 pedestrian samples, each of which is annotated with 72 attributes as well as viewpoints, occlusions, body parts information. To our knowledge, the RAP dataset is the largest pedestrian attribute dataset, which is expected to greatly promote the study of large-scale attribute recognition systems. Furthermore, we empirically analyze the effects of different environmental and contextual factors on pedestrian attribute recognition. Experimental results demonstrate that viewpoints, occlusions and body parts information could assist attribute recognition a lot in real applications.
While recent studies on pedestrian attribute recognition have shown remarkable progress in leveraging complicated networks and attention mechanisms, most of them neglect the inter-image relations and an important prior: spatial consistency and semantic consistency of attributes under surveillance scenarios. The spatial locations of the same attribute should be consistent between different pedestrian images, eg, the ``hat attribute and the ``boots attribute are always located at the top and bottom of the picture respectively. In addition, the inherent semantic feature of the ``hat attribute should be consistent, whether it is a baseball cap, beret, or helmet. To fully exploit inter-image relations and aggregate human prior in the model learning process, we construct a Spatial and Semantic Consistency (SSC) framework that consists of two complementary regularizations to achieve spatial and semantic consistency for each attribute. Specifically, we first propose a spatial consistency regularization to focus on reliable and stable attribute-related regions. Based on the precise attribute locations, we further propose a semantic consistency regularization to extract intrinsic and discriminative semantic features. We conduct extensive experiments on popular benchmarks including PA100K, RAP, and PETA. Results show that the proposed method performs favorably against state-of-the-art methods without increasing parameters.
Multi-task learning is an effective learning strategy for deep-learning-based facial expression recognition tasks. However, most existing methods take into limited consideration the feature selection, when transferring information between different tasks, which may lead to task interference when training the multi-task networks. To address this problem, we propose a novel selective feature-sharing method, and establish a multi-task network for facial expression recognition and facial expression synthesis. The proposed method can effectively transfer beneficial features between different tasks, while filtering out useless and harmful information. Moreover, we employ the facial expression synthesis task to enlarge and balance the training dataset to further enhance the generalization ability of the proposed method. Experimental results show that the proposed method achieves state-of-the-art performance on those commonly used facial expression recognition benchmarks, which makes it a potential solution to real-world facial expression recognition problems.
Pedestrian Attribute Recognition (PAR) has aroused extensive attention due to its important role in video surveillance scenarios. In most cases, the existence of a particular attribute is strongly related to a partial region. Recent works design complicated modules, e.g., attention mechanism and proposal of body parts to localize the attribute corresponding region. These works further prove that localization of attribute specific regions precisely will help in improving performance. However, these part-information-based methods are still not accurate as well as increasing model complexity which makes it hard to deploy on realistic applications. In this paper, we propose a Deep Template Matching based method to capture body parts features with less computation. Further, we also proposed an auxiliary supervision method that use human pose keypoints to guide the learning toward discriminative local cues. Extensive experiments show that the proposed method outperforms and has lower computational complexity, compared with the state-of-the-art approaches on large-scale pedestrian attribute datasets, including PETA, PA-100K, RAP, and RAPv2 zs.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا