Do you want to publish a course? Click here

We present a novel facial expression recognition network, called Distract your Attention Network (DAN). Our method is based on two key observations. Firstly, multiple classes share inherently similar underlying facial appearance, and their differences could be subtle. Secondly, facial expressions exhibit themselves through multiple facial regions simultaneously, and the recognition requires a holistic approach by encoding high-order interactions among local features. To address these issues, we propose our DAN with three key components: Feature Clustering Network (FCN), Multi-head cross Attention Network (MAN), and Attention Fusion Network (AFN). The FCN extracts robust features by adopting a large-margin learning objective to maximize class separability. In addition, the MAN instantiates a number of attention heads to simultaneously attend to multiple facial areas and build attention maps on these regions. Further, the AFN distracts these attentions to multiple locations before fusing the attention maps to a comprehensive one. Extensive experiments on three public datasets (including AffectNet, RAF-DB, and SFEW 2.0) verified that the proposed method consistently achieves state-of-the-art facial expression recognition performance. Code will be made available at https://github.com/yaoing/DAN.
117 - Tao Wang , Li Yuan , Yunpeng Chen 2021
Recently, DETR pioneered the solution of vision tasks with transformers, it directly translates the image feature map into the object detection result. Though effective, translating the full feature map can be costly due to redundant computation on some area like the background. In this work, we encapsulate the idea of reducing spatial redundancy into a novel poll and pool (PnP) sampling module, with which we build an end-to-end PnP-DETR architecture that adaptively allocates its computation spatially to be more efficient. Concretely, the PnP module abstracts the image feature map into fine foreground object feature vectors and a small number of coarse background contextual feature vectors. The transformer models information interaction within the fine-coarse feature space and translates the features into the detection result. Moreover, the PnP-augmented model can instantly achieve various desired trade-offs between performance and computation with a single model by varying the sampled feature length, without requiring to train multiple models as existing methods. Thus it offers greater flexibility for deployment in diverse scenarios with varying computation constraint. We further validate the generalizability of the PnP module on panoptic segmentation and the recent transformer-based image recognition model ViT and show consistent efficiency gain. We believe our method makes a step for efficient visual analysis with transformers, wherein spatial redundancy is commonly observed. Code will be available at url{https://github.com/twangnh/pnp-detr}.
We study the tagging of Higgs exotic decay signals using different types of deep neural networks (DNNs), focusing on the $W^pm h$ associated production channel followed by Higgs decaying into $n$ $b$-quarks with $n=4$, 6 and 8. All the Higgs decay products are collected into a fat-jet, to which we apply further selection using the DNNs. Three kinds of DNNs are considered, namely convolutional neural network (CNN), recursive neural network (RecNN) and particle flow network (PFN). The PFN can achieve the best performance because its structure allows enfolding more information in addition to the four-momentums of the jet constituents, such as particle ID and tracks parameters. Using the PFN as an example, we verify that it can serve as an efficient tagger even though it is trained on a different event topology with different $b$-multiplicity from the actual signal. The projected sensitivity to the branching ratio of Higgs decaying into $n$ $b$-quarks at the HL-LHC are 10%, 3% and 1%, for $n=4$, 6 and 8, respectively.
120 - Tao Wang 2021
We obtain the lower bounds for ergodic convergence rates, including spectral gaps and convergence rates in strong ergodicity for time-changed symmetric L{e}vy processes by using harmonic function and reversible measure. As direct applications, explicit sufficient conditions for exponential and strong ergodicity are given. Some examples are also presented.
Semantic segmentation is an important task in computer vision, from which some important usage scenarios are derived, such as autonomous driving, scene parsing, etc. Due to the emphasis on the task of video semantic segmentation, we participated in this competition. In this report, we briefly introduce the solutions of team BetterThing for the ICCV2021 - Video Scene Parsing in the Wild Challenge. Transformer is used as the backbone for extracting video frame features, and the final result is the aggregation of the output of two Transformer models, SWIN and VOLO. This solution achieves 57.3% mIoU, which is ranked 3rd place in the Video Scene Parsing in the Wild Challenge.
123 - He Liu , Tao Wang , Yidong Li 2021
In recent years, powered by the learned discriminative representation via graph neural network (GNN) models, deep graph matching methods have made great progresses in the task of matching semantic features. However, these methods usually rely on heuristically generated graph patterns, which may introduce unreliable relationships to hurt the matching performance. In this paper, we propose a joint emph{graph learning and matching} network, named GLAM, to explore reliable graph structures for boosting graph matching. GLAM adopts a pure attention-based framework for both graph learning and graph matching. Specifically, it employs two types of attention mechanisms, self-attention and cross-attention for the task. The self-attention discovers the relationships between features and to further update feature representations over the learnt structures; and the cross-attention computes cross-graph correlations between the two feature sets to be matched for feature reconstruction. Moreover, the final matching solution is directly derived from the output of the cross-attention layer, without employing a specific matching decision module. The proposed method is evaluated on three popular visual matching benchmarks (Pascal VOC, Willow Object and SPair-71k), and it outperforms previous state-of-the-art graph matching methods by significant margins on all benchmarks. Furthermore, the graph patterns learnt by our model are validated to be able to remarkably enhance previous deep graph matching methods by replacing their handcrafted graph structures with the learnt ones.
433 - Xu Shi , Jintao Wang , Guozhi Chen 2021
Reconfigurable intelligent surface (RIS) has been recognized as a potential technology for 5G beyond and attracted tremendous research attention. However, channel estimation in RIS-aided system is still a critical challenge due to the excessive amount of parameters in cascaded channel. The existing compressive sensing (CS)-based RIS estimation schemes only adopt incomplete sparsity, which induces redundant pilot consumption. In this paper, we exploit the specific triple-structured sparsity of the cascaded channel, i.e., the common column sparsity, structured row sparsity after offset compensation and the common offsets among all users. Then a novel multi-user joint estimation algorithm is proposed. Simulation results show that our approach can significantly reduce pilot overhead in both ULA and UPA scenarios.
Magnetic graphene nanoribbons (GNRs) have become promising candidates for future applications, including quantum technologies. Here, we characterize magnetic states hosted by chiral graphene nanoribbons (chGNRs). The substitution of a hydrogen atom at the chGNR edge by a ketone group effectively adds one p_z electron to the {pi}-electron network, thus producing an unpaired {pi} radical. A closely related scenario occurs for regular ketone-functionalized chGNRs in which one oxygen atom is missing. Two such radical states can interact via exchange coupling and we study those interactions as a function of their relative position, which includes a remarkable dependence on the chirality, as well as on the nature of the surrounding GNR, i.e., with or without ketone functionalization. In addition, we determine the parameters whereby this type of systems with oxygen heteroatoms can be adequately described within the widely used mean-field Hubbard model. Altogether, we provide new insights to both theoretically model and devise GNR-based nanostructures with tunable magnetic properties.
The use of multi-modal data such as the combination of whole slide images (WSIs) and gene expression data for survival analysis can lead to more accurate survival predictions. Previous multi-modal survival models are not able to efficiently excavate the intrinsic information within each modality. Moreover, despite experimental results show that WSIs provide more effective information than gene expression data, previous methods regard the information from different modalities as similarly important so they cannot flexibly utilize the potential connection between the modalities. To address the above problems, we propose a new asymmetrical multi-modal method, termed as AMMASurv. Specifically, we design an asymmetrical multi-modal attention mechanism (AMMA) in Transformer encoder for multi-modal data to enable a more flexible multi-modal information fusion for survival prediction. Different from previous works, AMMASurv can effectively utilize the intrinsic information within every modality and flexibly adapts to the modalities of different importance. Extensive experiments are conducted to validate the effectiveness of the proposed model. Encouraging results demonstrate the superiority of our method over other state-of-the-art methods.
This paper presents Self-correcting Encoding (Secoco), a framework that effectively deals with input noise for robust neural machine translation by introducing self-correcting predictors. Different from previous robust approaches, Secoco enables NMT to explicitly correct noisy inputs and delete specific errors simultaneously with the translation decoding process. Secoco is able to achieve significant improvements over strong baselines on two real-world test sets and a benchmark WMT dataset with good interpretability. We will make our code and dataset publicly available soon.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا