No Arabic abstract
Deep neural networks (DNNs) are under threat from adversarial example attacks. The adversary can easily change the outputs of DNNs by adding small well-designed perturbations to inputs. Adversarial example detection is a fundamental work for robust DNNs-based service. Adversarial examples show the difference between humans and DNNs in image recognition. From a human-centric perspective, image features could be divided into dominant features that are comprehensible to humans, and recessive features that are incomprehensible to humans, yet are exploited by DNNs. In this paper, we reveal that imperceptible adversarial examples are the product of recessive features misleading neural networks, and an adversarial attack is essentially a kind of method to enrich these recessive features in the image. The imperceptibility of the adversarial examples indicates that the perturbations enrich recessive features, yet hardly affect dominant features. Therefore, adversarial examples are sensitive to filtering off recessive features, while benign examples are immune to such operation. Inspired by this idea, we propose a label-only adversarial detection approach that is referred to as feature-filter. Feature-filter utilizes discrete cosine transform to approximately separate recessive features from dominant features, and gets a mutant image that is filtered off recessive features. By only comparing DNNs prediction labels on the input and its mutant, feature-filter can real-time detect imperceptible adversarial examples at high accuracy and few false positives.
Though deep neural network has hit a huge success in recent studies and applica- tions, it still remains vulnerable to adversarial perturbations which are imperceptible to humans. To address this problem, we propose a novel network called ReabsNet to achieve high classification accuracy in the face of various attacks. The approach is to augment an existing classification network with a guardian network to detect if a sample is natural or has been adversarially perturbed. Critically, instead of simply rejecting adversarial examples, we revise them to get their true labels. We exploit the observation that a sample containing adversarial perturbations has a possibility of returning to its true class after revision. We demonstrate that our ReabsNet outperforms the state-of-the-art defense method under various adversarial attacks.
Deep neural networks are vulnerable to adversarial examples, which dramatically alter model output using small input changes. We propose Neural Fingerprinting, a simple, yet effective method to detect adversarial examples by verifying whether model behavior is consistent with a set of secret fingerprints, inspired by the use of biometric and cryptographic signatures. The benefits of our method are that 1) it is fast, 2) it is prohibitively expensive for an attacker to reverse-engineer which fingerprints were used, and 3) it does not assume knowledge of the adversary. In this work, we pose a formal framework to analyze fingerprints under various threat models, and characterize Neural Fingerprinting for linear models. For complex neural networks, we empirically demonstrate that Neural Fingerprinting significantly improves on state-of-the-art detection mechanisms by detecting the strongest known adversarial attacks with 98-100% AUC-ROC scores on the MNIST, CIFAR-10 and MiniImagenet (20 classes) datasets. In particular, the detection accuracy of Neural Fingerprinting generalizes well to unseen test-data under various black- and whitebox threat models, and is robust over a wide range of hyperparameters and choices of fingerprints.
We present DeClaW, a system for detecting, classifying, and warning of adversarial inputs presented to a classification neural network. In contrast to current state-of-the-art methods that, given an input, detect whether an input is clean or adversarial, we aim to also identify the types of adversarial attack (e.g., PGD, Carlini-Wagner or clean). To achieve this, we extract statistical profiles, which we term as anomaly feature vectors, from a set of latent features. Preliminary findings suggest that AFVs can help distinguish among several types of adversarial attacks (e.g., PGD versus Carlini-Wagner) with close to 93% accuracy on the CIFAR-10 dataset. The results open the door to using AFV-based methods for exploring not only adversarial attack detection but also classification of the attack type and then design of attack-specific mitigation strategies.
Deep neural networks (DNNs) have been shown to be vulnerable against adversarial examples (AEs), which are maliciously designed to cause dramatic model output errors. In this work, we reveal that normal examples (NEs) are insensitive to the fluctuations occurring at the highly-curved region of the decision boundary, while AEs typically designed over one single domain (mostly spatial domain) exhibit exorbitant sensitivity on such fluctuations. This phenomenon motivates us to design another classifier (called dual classifier) with transformed decision boundary, which can be collaboratively used with the original classifier (called primal classifier) to detect AEs, by virtue of the sensitivity inconsistency. When comparing with the state-of-the-art algorithms based on Local Intrinsic Dimensionality (LID), Mahalanobis Distance (MD), and Feature Squeezing (FS), our proposed Sensitivity Inconsistency Detector (SID) achieves improved AE detection performance and superior generalization capabilities, especially in the challenging cases where the adversarial perturbation levels are small. Intensive experimental results on ResNet and VGG validate the superiority of the proposed SID.
Deep convolutional neural networks are susceptible to adversarial attacks. They can be easily deceived to give an incorrect output by adding a tiny perturbation to the input. This presents a great challenge in making CNNs robust against such attacks. An influx of new defense techniques have been proposed to this end. In this paper, we show that latent features in certain robust models are surprisingly susceptible to adversarial attacks. On top of this, we introduce a unified $ell_infty$-norm white-box attack algorithm which harnesses latent features in its gradient descent steps, namely LAFEAT. We show that not only is it computationally much more efficient for successful attacks, but it is also a stronger adversary than the current state-of-the-art across a wide range of defense mechanisms. This suggests that model robustness could be contingent on the effective use of the defenders hidden components, and it should no longer be viewed from a holistic perspective.