ترغب بنشر مسار تعليمي؟ اضغط هنا

Shuffle Transformer with Feature Alignment for Video Face Parsing

117   0   0.0 ( 0 )
 نشر من قبل Zilong Huang
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

This is a short technical report introducing the solution of the Team TCParser for Short-video Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. In this paper, we introduce a strong backbone which is cross-window based Shuffle Transformer for presenting accurate face parsing representation. To further obtain the finer segmentation results, especially on the edges, we introduce a Feature Alignment Aggregation (FAA) module. It can effectively relieve the feature misalignment issue caused by multi-resolution feature aggregation. Benefiting from the stronger backbone and better feature aggregation, the proposed method achieves 86.9519% score in the Short-video Face Parsing track of the 3rd Person in Context (PIC) Workshop and Challenge, ranked the first place.



قيم البحث

اقرأ أيضاً

In this paper, we present a deep learning based image feature extraction method designed specifically for face images. To train the feature extraction model, we construct a large scale photo-realistic face image dataset with ground-truth corresponden ce between multi-view face images, which are synthesized from real photographs via an inverse rendering procedure. The deep face feature (DFF) is trained using correspondence between face images rendered from different views. Using the trained DFF model, we can extract a feature vector for each pixel of a face image, which distinguishes different facial regions and is shown to be more effective than general-purpose feature descriptors for face-related tasks such as matching and alignment. Based on the DFF, we develop a robust face alignment method, which iteratively updates landmarks, pose and 3D shape. Extensive experiments demonstrate that our method can achieve state-of-the-art results for face alignment under highly unconstrained face images.
Very recently, Window-based Transformers, which computed self-attention within non-overlapping local windows, demonstrated promising results on image classification, semantic segmentation, and object detection. However, less study has been devoted to the cross-window connection which is the key element to improve the representation ability. In this work, we revisit the spatial shuffle as an efficient way to build connections among windows. As a result, we propose a new vision transformer, named Shuffle Transformer, which is highly efficient and easy to implement by modifying two lines of code. Furthermore, the depth-wise convolution is introduced to complement the spatial shuffle for enhancing neighbor-window connections. The proposed architectures achieve excellent performance on a wide range of visual tasks including image-level classification, object detection, and semantic segmentation. Code will be released for reproduction.
This is a short technical report introducing the solution of Team Rat for Short-video Parsing Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. In this report, we propose an Edge-Aware Network (EANet) that u ses edge information to refine the segmentation edge. To further obtain the finer edge results, we introduce edge attention loss that only compute cross entropy on the edges, it can effectively reduce the classification error around edge and get more smooth boundary. Benefiting from the edge information and edge attention loss, the proposed EANet achieves 86.16% accuracy in the Short-video Face Parsing track of the 3rd Person in Context (PIC) Workshop and Challenge, ranked the third place.
Face parsing aims to predict pixel-wise labels for facial components of a target face in an image. Existing approaches usually crop the target face from the input image with respect to a bounding box calculated during pre-processing, and thus can onl y parse inner facial Regions of Interest~(RoIs). Peripheral regions like hair are ignored and nearby faces that are partially included in the bounding box can cause distractions. Moreover, these methods are only trained and evaluated on near-frontal portrait images and thus their performance for in-the-wild cases has been unexplored. To address these issues, this paper makes three contributions. First, we introduce iBugMask dataset for face parsing in the wild, which consists of 21,866 training images and 1,000 testing images. The training images are obtained by augmenting an existing dataset with large face poses. The testing images are manually annotated with $11$ facial regions and there are large variations in sizes, poses, expressions and background. Second, we propose RoI Tanh-polar transform that warps the whole image to a Tanh-polar representation with a fixed ratio between the face area and the context, guided by the target bounding box. The new representation contains all information in the original image, and allows for rotation equivariance in the convolutional neural networks~(CNNs). Third, we propose a hybrid residual representation learning block, coined HybridBlock, that contains convolutional layers in both the Tanh-polar space and the Tanh-Cartesian space, allowing for receptive fields of different shapes in CNNs. Through extensive experiments, we show that the proposed method improves the state-of-the-art for face parsing in the wild and does not require facial landmarks for alignment.
With the advancement of IoT and artificial intelligence technologies, and the need for rapid application growth in fields such as security entrance control and financial business trade, facial information processing has become an important means for achieving identity authentication and information security. In this paper, we propose a multi-feature fusion algorithm based on integral histograms and a real-time update tracking particle filtering module. First, edge and colour features are extracted, weighting methods are used to weight the colour histogram and edge features to describe facial features, and fusion of colour and edge features is made adaptive by using fusion coefficients to improve face tracking reliability. Then, the integral histogram is integrated into the particle filtering algorithm to simplify the calculation steps of complex particles. Finally, the tracking window size is adjusted in real time according to the change in the average distance from the particle centre to the edge of the current model and the initial model to reduce the drift problem and achieve stable tracking with significant changes in the target dimension. The results show that the algorithm improves video tracking accuracy, simplifies particle operation complexity, improves the speed, and has good anti-interference ability and robustness.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا