ﻻ يوجد ملخص باللغة العربية
Image classification models have achieved satisfactory performance on many datasets, sometimes even better than human. However, The model attention is unclear since the lack of interpretability. This paper investigates the fidelity and interpretability of model attention. We propose an Explainable Attribute-based Multi-task (EAT) framework to concentrate the model attention on the discriminative image area and make the attention interpretable. We introduce attributes prediction to the multi-task learning network, helping the network to concentrate attention on the foreground objects. We generate attribute-based textual explanations for the network and ground the attributes on the image to show visual explanations. The multi-model explanation can not only improve user trust but also help to find the weakness of network and dataset. Our framework can be generalized to any basic model. We perform experiments on three datasets and five basic models. Results indicate that the EAT framework can give multi-modal explanations that interpret the network decision. The performance of several recognition approaches is improved by guiding network attention.
Recently, many plug-and-play self-attention modules are proposed to enhance the model generalization by exploiting the internal information of deep convolutional neural networks (CNNs). Previous works lay an emphasis on the design of attention module
Click-through rate (CTR) prediction is a critical problem in web search, recommendation systems and online advertisement displaying. Learning good feature interactions is essential to reflect users preferences to items. Many CTR prediction models bas
Explaining the decision of a multi-modal decision-maker requires to determine the evidence from both modalities. Recent advances in XAI provide explanations for models trained on still images. However, when it comes to modeling multiple sensory modal
Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of image captioning. First, we propose Normalized Self-Attention (NSA), a reparameterization of SA that
Attention networks have successfully boosted the performance in various vision problems. Previous works lay emphasis on designing a new attention module and individually plug them into the networks. Our paper proposes a novel-and-simple framework tha