No Arabic abstract
In the recent years, there has been a shift in facial behavior analysis from the laboratory-controlled conditions to the challenging in-the-wild conditions due to the superior performance of deep learning based approaches for many real world applications.However, the performance of deep learning approaches relies on the amount of training data. One of the major problems with data acquisition is the requirement of annotations for large amount of training data. Labeling process of huge training data demands lot of human support with strong domain expertise for facial expressions or action units, which is difficult to obtain in real-time environments.Moreover, labeling process is highly vulnerable to ambiguity of expressions or action units, especially for intensities due to the bias induced by the domain experts. Therefore, there is an imperative need to address the problem of facial behavior analysis with weak annotations. In this paper, we provide a comprehensive review of weakly supervised learning (WSL) approaches for facial behavior analysis with both categorical as well as dimensional labels along with the challenges and potential research directions associated with it. First, we introduce various types of weak annotations in the context of facial behavior analysis and the corresponding challenges associated with it. We then systematically review the existing state-of-the-art approaches and provide a taxonomy of these approaches along with their insights and limitations. In addition, widely used data-sets in the reviewed literature and the performance of these approaches along with evaluation principles are summarized. Finally, we discuss the remaining challenges and opportunities along with the potential research directions in order to apply facial behavior analysis with weak labels in real life situations.
Affective Behavior Analysis is an important part in human-computer interaction. Existing multi-task affective behavior recognition methods suffer from the problem of incomplete labeled datasets. To tackle this problem, this paper presents a semi-supervised model with a mean teacher framework to leverage additional unlabeled data. To be specific, a multi-task model is proposed to learn three different kinds of facial affective representations simultaneously. After that, the proposed model is assigned to be student and teacher networks. When training with unlabeled data, the teacher network is employed to predict pseudo labels for student network training, which allows it to learn from unlabeled data. Experimental results showed that our proposed method achieved much better performance than baseline model and ranked 4th in both competition track 1 and track 2, and 6th in track 3, which verifies that the proposed network can effectively learn from incomplete datasets.
Reinforcement learning (RL) is a powerful framework for learning to take actions to solve tasks. However, in many settings, an agent must winnow down the inconceivably large space of all possible tasks to the single task that it is currently being asked to solve. Can we instead constrain the space of tasks to those that are semantically meaningful? In this work, we introduce a framework for using weak supervision to automatically disentangle this semantically meaningful subspace of tasks from the enormous space of nonsensical chaff tasks. We show that this learned subspace enables efficient exploration and provides a representation that captures distance between states. On a variety of challenging, vision-based continuous control problems, our approach leads to substantial performance gains, particularly as the complexity of the environment grows.
Phrase grounding, the problem of associating image regions to caption words, is a crucial component of vision-language tasks. We show that phrase grounding can be learned by optimizing word-region attention to maximize a lower bound on mutual information between images and caption words. Given pairs of images and captions, we maximize compatibility of the attention-weighted regions and the words in the corresponding caption, compared to non-corresponding pairs of images and captions. A key idea is to construct effective negative captions for learning through language model guided word substitutions. Training with our negatives yields a $sim10%$ absolute gain in accuracy over randomly-sampled negatives from the training data. Our weakly supervised phrase grounding model trained on COCO-Captions shows a healthy gain of $5.7%$ to achieve $76.7%$ accuracy on Flickr30K Entities benchmark.
COVID-19 pandemic and social distancing urge a reliable human face recognition system in different abnormal situations. However, there is no research which studies the influence of glass factor in facial recognition system. This paper provides a comprehensive review of glass factor. The study contains two steps: data collection and accuracy test. Data collection includes collecting human face images through different situations, such as clear glasses, glass with water and glass with mist. Based on the collected data, an existing state-of-the-art face detection and recognition system built upon MTCNN and Inception V1 deep nets is tested for further analysis. Experimental data supports that 1) the system is robust for classification when comparing real-time images and 2) it fails at determining if two images are of same person by comparing real-time disturbed image with the frontal ones.
We address the problem of distance metric learning in visual similarity search, defined as learning an image embedding model which projects images into Euclidean space where semantically and visually similar images are closer and dissimilar images are further from one another. We present a weakly supervised adaptive triplet loss (ATL) capable of capturing fine-grained semantic similarity that encourages the learned image embedding models to generalize well on cross-domain data. The method uses weakly labeled product description data to implicitly determine fine grained semantic classes, avoiding the need to annotate large amounts of training data. We evaluate on the Amazon fashion retrieval benchmark and DeepFashion in-shop retrieval data. The method boosts the performance of triplet loss baseline by 10.6% on cross-domain data and out-performs the state-of-art model on all evaluation metrics.