Do you want to publish a course? Click here

Interpreting Undesirable Pixels for Image Classification on Black-Box Models

104   0   0.0 ( 0 )
 Added by Sin-Han Kang
 Publication date 2019
and research's language is English




Ask ChatGPT about the research

In an effort to interpret black-box models, researches for developing explanation methods have proceeded in recent years. Most studies have tried to identify input pixels that are crucial to the prediction of a classifier. While this approach is meaningful to analyse the characteristic of blackbox models, it is also important to investigate pixels that interfere with the prediction. To tackle this issue, in this paper, we propose an explanation method that visualizes undesirable regions to classify an image as a target class. To be specific, we divide the concept of undesirable regions into two terms: (1) factors for a target class, which hinder that black-box models identify intrinsic characteristics of a target class and (2) factors for non-target classes that are important regions for an image to be classified as other classes. We visualize such undesirable regions on heatmaps to qualitatively validate the proposed method. Furthermore, we present an evaluation metric to provide quantitative results on ImageNet.



rate research

Read More

Deep neural networks are vulnerable to adversarial attacks. White-box adversarial attacks can fool neural networks with small adversarial perturbations, especially for large size images. However, keeping successful adversarial perturbations imperceptible is especially challenging for transfer-based black-box adversarial attacks. Often such adversarial examples can be easily spotted due to their unpleasantly poor visual qualities, which compromises the threat of adversarial attacks in practice. In this study, to improve the image quality of black-box adversarial examples perceptually, we propose structure-aware adversarial attacks by generating adversarial images based on psychological perceptual models. Specifically, we allow higher perturbations on perceptually insignificant regions, while assigning lower or no perturbation on visually sensitive regions. In addition to the proposed spatial-constrained adversarial perturbations, we also propose a novel structure-aware frequency adversarial attack method in the discrete cosine transform (DCT) domain. Since the proposed attacks are independent of the gradient estimation, they can be directly incorporated with existing gradient-based attacks. Experimental results show that, with the comparable attack success rate (ASR), the proposed methods can produce adversarial examples with considerably improved visual quality for free. With the comparable perceptual quality, the proposed approaches achieve higher attack success rates: particularly for the frequency structure-aware attacks, the average ASR improves more than 10% over the baseline attacks.
Increasing use of ML technologies in privacy-sensitive domains such as medical diagnoses, lifestyle predictions, and business decisions highlights the need to better understand if these ML technologies are introducing leakages of sensitive and proprietary training data. In this paper, we focus on one kind of model inversion attacks, where the adversary knows non-sensitive attributes about instances in the training data and aims to infer the value of a sensitive attribute unknown to the adversary, using oracle access to the target classification model. We devise two novel model inversion attribute inference attacks -- confidence modeling-based attack and confidence score-based attack, and also extend our attack to the case where some of the other (non-sensitive) attributes are unknown to the adversary. Furthermore, while previous work uses accuracy as the metric to evaluate the effectiveness of attribute inference attacks, we find that accuracy is not informative when the sensitive attribute distribution is unbalanced. We identify two metrics that are better for evaluating attribute inference attacks, namely G-mean and Matthews correlation coefficient (MCC). We evaluate our attacks on two types of machine learning models, decision tree and deep neural network, trained with two real datasets. Experimental results show that our newly proposed attacks significantly outperform the state-of-the-art attacks. Moreover, we empirically show that specific groups in the training dataset (grouped by attributes, e.g., gender, race) could be more vulnerable to model inversion attacks. We also demonstrate that our attacks performances are not impacted significantly when some of the other (non-sensitive) attributes are also unknown to the adversary.
We study the task of replicating the functionality of black-box neural models, for which we only know the output class probabilities provided for a set of input images. We assume back-propagation through the black-box model is not possible and its training images are not available, e.g. the model could be exposed only through an API. In this context, we present a teacher-student framework that can distill the black-box (teacher) model into a student model with minimal accuracy loss. To generate useful data samples for training the student, our framework (i) learns to generate images on a proxy data set (with images and classes different from those used to train the black-box) and (ii) applies an evolutionary strategy to make sure that each generated data sample exhibits a high response for a specific class when given as input to the black box. Our framework is compared with several baseline and state-of-the-art methods on three benchmark data sets. The empirical evidence indicates that our model is superior to the considered baselines. Although our method does not back-propagate through the black-box network, it generally surpasses state-of-the-art methods that regard the teacher as a glass-box model. Our code is available at: https://github.com/antoniobarbalau/black-box-ripper.
Channel Pruning has been long studied to compress CNNs for efficient image classification. Prior works implement channel pruning in an unexplainable manner, which tends to reduce the final classification errors while failing to consider the internal influence of each channel. In this paper, we conduct channel pruning in a white box. Through deep visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we choose to preserve channels contributing to most categories. Specifically, to model the contribution of each channel to differentiating categories, we develop a class-wise mask for each channel, implemented in a dynamic training manner w.r.t. the input images category. On the basis of the learned class-wise mask, we perform a global voting mechanism to remove channels with less category discrimination. Lastly, a fine-tuning process is conducted to recover the performance of the pruned model. To our best knowledge, it is the first time that CNN interpretability theory is considered to guide channel pruning. Extensive experiments on representative image classification tasks demonstrate the superiority of our White-Box over many state-of-the-arts. For instance, on CIFAR-10, it reduces 65.23% FLOPs with even 0.62% accuracy improvement for ResNet-110. On ILSVRC-2012, White-Box achieves a 45.6% FLOPs reduction with only a small loss of 0.83% in the top-1 accuracy for ResNet-50.
Deep learning algorithms are growing in popularity in the field of exoplanetary science due to their ability to model highly non-linear relations and solve interesting problems in a data-driven manner. Several works have attempted to perform fast retrievals of atmospheric parameters with the use of machine learning algorithms like deep neural networks (DNNs). Yet, despite their high predictive power, DNNs are also infamous for being black boxes. It is their apparent lack of explainability that makes the astrophysics community reluctant to adopt them. What are their predictions based on? How confident should we be in them? When are they wrong and how wrong can they be? In this work, we present a number of general evaluation methodologies that can be applied to any trained model and answer questions like these. In particular, we train three different popular DNN architectures to retrieve atmospheric parameters from exoplanet spectra and show that all three achieve good predictive performance. We then present an extensive analysis of the predictions of DNNs, which can inform us - among other things - of the credibility limits for atmospheric parameters for a given instrument and model. Finally, we perform a perturbation-based sensitivity analysis to identify to which features of the spectrum the outcome of the retrieval is most sensitive. We conclude that for different molecules, the wavelength ranges to which the DNNs predictions are most sensitive, indeed coincide with their characteristic absorption regions. The methodologies presented in this work help to improve the evaluation of DNNs and to grant interpretability to their predictions.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا