No Arabic abstract
It is widely known that convolutional neural networks (CNNs) are vulnerable to adversarial examples: images with imperceptible perturbations crafted to fool classifiers. However, interpretability of these perturbations is less explored in the literature. This work aims to better understand the roles of adversarial perturbations and provide visual explanations from pixel, image and network perspectives. We show that adversaries have a promotion-suppression effect (PSE) on neurons activations and can be primarily categorized into three types: i) suppression-dominated perturbations that mainly reduce the classification score of the true label, ii) promotion-dominated perturbations that focus on boosting the confidence of the target label, and iii) balanced perturbations that play a dual role in suppression and promotion. We also provide image-level interpretability of adversarial examples. This links PSE of pixel-level perturbations to class-specific discriminative image regions localized by class activation mapping (Zhou et al. 2016). Further, we examine the adversarial effect through network dissection (Bau et al. 2017), which offers concept-level interpretability of hidden units. We show that there exists a tight connection between the units sensitivity to adversarial attacks and their interpretability on semantic concepts. Lastly, we provide some new insights from our interpretation to improve the adversarial robustness of networks.
There has been a rise in the use of Machine Learning as a Service (MLaaS) Vision APIs as they offer multiple services including pre-built models and algorithms, which otherwise take a huge amount of resources if built from scratch. As these APIs get deployed for high-stakes applications, its very important that they are robust to different manipulations. Recent works have only focused on typical adversarial attacks when evaluating the robustness of vision APIs. We propose two new aspects of adversarial image generation methods and evaluate them on the robustness of Google Cloud Vision APIs optical character recognition service and object detection APIs deployed in real-world settings such as sightengine.com, picpurify.com, Google Cloud Vision API, and Microsoft Azures Computer Vision API. Specifically, we go beyond the conventional small-noise adversarial attacks and introduce secret embedding and transparent adversarial examples as a simpler way to evaluate robustness. These methods are so straightforward that even non-specialists can craft such attacks. As a result, they pose a serious threat where APIs are used for high-stakes applications. Our transparent adversarial examples successfully evade state-of-the art object detections APIs such as Azure Cloud Vision (attack success rate 52%) and Google Cloud Vision (attack success rate 36%). 90% of the images have a secret embedded text that successfully fools the vision of time-limited humans but is detected by Google Cloud Vision APIs optical character recognition. Complementing to current research, our results provide simple but unconventional methods on robustness evaluation.
Fooling people with highly realistic fake images generated with Deepfake or GANs brings a great social disturbance to our society. Many methods have been proposed to detect fake images, but they are vulnerable to adversarial perturbations -- intentionally designed noises that can lead to the wrong prediction. Existing methods of attacking fake image detectors usually generate adversarial perturbations to perturb almost the entire image. This is redundant and increases the perceptibility of perturbations. In this paper, we propose a novel method to disrupt the fake image detection by determining key pixels to a fake image detector and attacking only the key pixels, which results in the $L_0$ and the $L_2$ norms of adversarial perturbations much less than those of existing works. Experiments on two public datasets with three fake image detectors indicate that our proposed method achieves state-of-the-art performance in both white-box and black-box attacks.
With the increasing attentions of deep learning models, attacks are also upcoming for such models. For example, an attacker may carefully construct images in specific ways (also referred to as adversarial examples) aiming to mislead the deep learning models to output incorrect classification results. Similarly, many efforts are proposed to detect and mitigate adversarial examples, usually for certain dedicated attacks. In this paper, we propose a novel digital watermark based method to generate adversarial examples for deep learning models. Specifically, partial main features of the watermark image are embedded into the host image invisibly, aiming to tamper and damage the recognition capabilities of the deep learning models. We devise an efficient mechanism to select host images and watermark images, and utilize the improved discrete wavelet transform (DWT) based Patchwork watermarking algorithm and the modified discrete cosine transform (DCT) based Patchwork watermarking algorithm. The experimental results showed that our scheme is able to generate a large number of adversarial examples efficiently. In addition, we find that using the extracted features of the image as the watermark images, can increase the success rate of an attack under certain conditions with minimal changes to the host image. To ensure repeatability, reproducibility, and code sharing, the source code is available on GitHub
This paper investigates the visual quality of the adversarial examples. Recent papers propose to smooth the perturbations to get rid of high frequency artefacts. In this work, smoothing has a different meaning as it perceptually shapes the perturbation according to the visual content of the image to be attacked. The perturbation becomes locally smooth on the flat areas of the input image, but it may be noisy on its textured areas and sharp across its edges. This operation relies on Laplacian smoothing, well-known in graph signal processing, which we integrate in the attack pipeline. We benchmark several attacks with and without smoothing under a white-box scenario and evaluate their transferability. Despite the additional constraint of smoothness, our attack has the same probability of success at lower distortion.
Deep neural networks (DNNs) have demonstrated impressive performance on a wide array of tasks, but they are usually considered opaque since internal structure and learned parameters are not interpretable. In this paper, we re-examine the internal representations of DNNs using adversarial images, which are generated by an ensemble-optimization algorithm. We find that: (1) the neurons in DNNs do not truly detect semantic objects/parts, but respond to objects/parts only as recurrent discriminative patches; (2) deep visual representations are not robust distributed codes of visual concepts because the representations of adversarial images are largely not consistent with those of real images, although they have similar visual appearance, both of which are different from previous findings. To further improve the interpretability of DNNs, we propose an adversarial training scheme with a consistent loss such that the neurons are endowed with human-interpretable concepts. The induced interpretable representations enable us to trace eventual outcomes back to influential neurons. Therefore, human users can know how the models make predictions, as well as when and why they make errors.