Concise Explanations of Neural Networks using Adversarial Training

182 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Prasad Chalasani

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Prasad Chalasani - Jiefeng Chen - Amrita Roy Chowdhury

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We show new connections between adversarial learning and explainability for deep neural networks (DNNs). One form of explanation of the output of a neural network model in terms of its input features, is a vector of feature-attributions. Two desirable characteristics of an attribution-based explanation are: (1) $textit{sparseness}$: the attributions of irrelevant or weakly relevant features should be negligible, thus resulting in $textit{concise}$ explanations in terms of the significant features, and (2) $textit{stability}$: it should not vary significantly within a small local neighborhood of the input. Our first contribution is a theoretical exploration of how these two properties (when using attributions based on Integrated Gradients, or IG) are related to adversarial training, for a class of 1-layer networks (which includes logistic regression models for binary and multi-class classification); for these networks we show that (a) adversarial training using an $ell_infty$-bounded adversary produces models with sparse attribution vectors, and (b) natural model-training while encouraging stable explanations (via an extra term in the loss function), is equivalent to adversarial training. Our second contribution is an empirical verification of phenomenon (a), which we show, somewhat surprisingly, occurs $textit{not only}$ $textit{in 1-layer networks}$, $textit{but also DNNs}$ $textit{trained on }$ $textit{standard image datasets}$, and extends beyond IG-based attributions, to those based on DeepSHAP: adversarial training with $ell_infty$-bounded perturbations yields significantly sparser attribution vectors, with little degradation in performance on natural test data, compared to natural training. Moreover, the sparseness of the attribution vectors is significantly better than that achievable via $ell_1$-regularized natural training.

قيم البحث

142 - Donald Loveland , Shusen Liu , Bhavya Kailkhura 2021

Graph neural network (GNN) explanations have largely been facilitated through post-hoc introspection. While this has been deemed successful, many post-hoc explanation methods have been shown to fail in capturing a models learned representation. Due t o this problem, it is worthwhile to consider how one might train a model so that it is more amenable to post-hoc analysis. Given the success of adversarial training in the computer vision domain to train models with more reliable representations, we propose a similar training paradigm for GNNs and analyze the respective impact on a models explanations. In instances without ground truth labels, we also determine how well an explanation method is utilizing a models learned representation through a new metric and demonstrate adversarial training can help better extract domain-relevant insights in chemistry.

التعلم الآلي

Adversarial Image Generation and Training for Deep Neural Networks

89 - Hai Shu , Ronghua Shi , Hongtu Zhu 2020

Deep neural networks (DNNs) have achieved great success in image classification, but they may be very vulnerable to adversarial attacks with small perturbations to images. Moreover, the adversarial training based on adversarial image samples has been shown to improve the robustness and generalization of DNNs. The aim of this paper is to develop a novel framework based on information-geometry sensitivity analysis and the particle swarm optimization to improve two aspects of adversarial image generation and training for DNNs. The first one is customized generation of adversarial examples. It can design adversarial attacks from options of the number of perturbed pixels, the misclassification probability, and the targeted incorrect class, and hence it is more flexible and effective to locate vulnerable pixels and also enjoys certain adversarial universality. The other is targeted adversarial training. DNN models can be improved in training with the adversarial information using a manifold-based influence measure effective in vulnerable image/pixel detection as well as allowing for targeted attacks, thereby exhibiting an enhanced adversarial defense in testing.

التعلم الآلي التعلم الالي

Training Generative Adversarial Networks from Incomplete Observations using Factorised Discriminators

105 - Daniel Stoller , Sebastian Ewert , Simon Dixon 2019

Generative adversarial networks (GANs) have shown great success in applications such as image generation and inpainting. However, they typically require large datasets, which are often not available, especially in the context of prediction tasks such as image segmentation that require labels. Therefore, methods such as the CycleGAN use more easily available unlabelled data, but do not offer a way to leverage additional labelled data for improved performance. To address this shortcoming, we show how to factorise the joint data distribution into a set of lower-dimensional distributions along with their dependencies. This allows splitting the discriminator in a GAN into multiple sub-discriminators that can be independently trained from incomplete observations. Their outputs can be combined to estimate the density ratio between the joint real and the generator distribution, which enables training generators as in the original GAN framework. We apply our method to image generation, image segmentation and audio source separation, and obtain improved performance over a standard GAN when additional incomplete training examples are available. For the Cityscapes segmentation task in particular, our method also improves accuracy by an absolute 14.9% over CycleGAN while using only 25 additional paired examples.

التعلم الآلي التعلم الالي

Improving adversarial robustness of deep neural networks by using semantic information

163 - Lina Wang , Rui Tang , Yawei Yue 2020

The vulnerability of deep neural networks (DNNs) to adversarial attack, which is an attack that can mislead state-of-the-art classifiers into making an incorrect classification with high confidence by deliberately perturbing the original inputs, rais es concerns about the robustness of DNNs to such attacks. Adversarial training, which is the main heuristic method for improving adversarial robustness and the first line of defense against adversarial attacks, requires many sample-by-sample calculations to increase training size and is usually insufficiently strong for an entire network. This paper provides a new perspective on the issue of adversarial robustness, one that shifts the focus from the network as a whole to the critical part of the region close to the decision boundary corresponding to a given class. From this perspective, we propose a method to generate a single but image-agnostic adversarial perturbation that carries the semantic information implying the directions to the fragile parts on the decision boundary and causes inputs to be misclassified as a specified target. We call the adversarial training based on such perturbations region adversarial training (RAT), which resembles classical adversarial training but is distinguished in that it reinforces the semantic information missing in the relevant regions. Experimental results on the MNIST and CIFAR-10 datasets show that this approach greatly improves adversarial robustness even using a very small dataset from the training data; moreover, it can defend against FGSM adversarial attacks that have a completely different pattern from the model seen during retraining.

التعلم الآلي التعلم الالي

On Completeness-aware Concept-Based Explanations in Deep Neural Networks

110 - Chih-Kuan Yeh , Been Kim , Sercan O. Arik 2019

Human explanations of high-level decisions are often expressed in terms of key concepts the decisions are based on. In this paper, we study such concept-based explainability for Deep Neural Networks (DNNs). First, we define the notion of completeness , which quantifies how sufficient a particular set of concepts is in explaining a models prediction behavior based on the assumption that complete concept scores are sufficient statistics of the model prediction. Next, we propose a concept discovery method that aims to infer a complete set of concepts that are additionally encouraged to be interpretable, which addresses the limitations of existing methods on concept explanations. To define an importance score for each discovered concept, we adapt game-theoretic notions to aggregate over sets and propose ConceptSHAP. Via proposed metrics and user studies, on a synthetic dataset with apriori-known concept explanations, as well as on real-world image and language datasets, we validate the effectiveness of our method in finding concepts that are both complete in explaining the decisions and interpretable. (The code is released at https://github.com/chihkuanyeh/concept_exp)

التعلم الآلي التعلم الالي