No Arabic abstract
Evaluating adversarial robustness amounts to finding the minimum perturbation needed to have an input sample misclassified. The inherent complexity of the underlying optimization requires current gradient-based attacks to be carefully tuned, initialized, and possibly executed for many computationally-demanding iterations, even if specialized to a given perturbation model. In this work, we overcome these limitations by proposing a fast minimum-norm (FMN) attack that works with different $ell_p$-norm perturbation models ($p=0, 1, 2, infty$), is robust to hyperparameter choices, does not require adversarial starting points, and converges within few lightweight steps. It works by iteratively finding the sample misclassified with maximum confidence within an $ell_p$-norm constraint of size $epsilon$, while adapting $epsilon$ to minimize the distance of the current sample to the decision boundary. Extensive experiments show that FMN significantly outperforms existing attacks in terms of convergence speed and computation time, while reporting comparable or even smaller perturbation sizes.
Adversarial attacks aim to confound machine learning systems, while remaining virtually imperceptible to humans. Attacks on image classification systems are typically gauged in terms of $p$-norm distortions in the pixel feature space. We perform a behavioral study, demonstrating that the pixel $p$-norm for any $0le p le infty$, and several alternative measures including earth movers distance, structural similarity index, and deep net embedding, do not fit human perception. Our result has the potential to improve the understanding of adversarial attack and defense strategies.
Deep Neural Networks (DNNs) could be easily fooled by Adversarial Examples (AEs) with the imperceptible difference to original samples in human eyes. To keep the difference imperceptible, the existing attacking bound the adversarial perturbations by the $ell_infty$ norm, which is then served as the standard to align different attacks for a fair comparison. However, when investigating attack transferability, i.e., the capability of the AEs from attacking one surrogate DNN to cheat other black-box DNN, we find that only using the $ell_infty$ norm is not sufficient to measure the attack strength, according to our comprehensive experiments concerning 7 transfer-based attacks, 4 white-box surrogate models, and 9 black-box victim models. Specifically, we find that the $ell_2$ norm greatly affects the transferability in $ell_infty$ attacks. Since larger-perturbed AEs naturally bring about better transferability, we advocate that the strength of all attacks should be measured by both the widely used $ell_infty$ and also the $ell_2$ norm. Despite the intuitiveness of our conclusion and advocacy, they are very necessary for the community, because common evaluations (bounding only the $ell_infty$ norm) allow tricky enhancements of the attack transferability by increasing the attack strength ($ell_2$ norm) as shown by our simple counter-example method, and the good transferability of several existing methods may be due to their large $ell_2$ distances.
Reliable evaluation of adversarial defenses is a challenging task, currently limited to an expert who manually crafts attacks that exploit the defenses inner workings, or to approaches based on ensemble of fixed attacks, none of which may be effective for the specific defense at hand. Our key observation is that custom attacks are composed from a set of reusable building blocks, such as fine-tuning relevant attack parameters, network transformations, and custom loss functions. Based on this observation, we present an extensible framework that defines a search space over these reusable building blocks and automatically discovers an effective attack on a given model with an unknown defense by searching over suitable combinations of these blocks. We evaluated our framework on 23 adversarial defenses and showed it outperforms AutoAttack, the current state-of-the-art tool for reliable evaluation of adversarial defenses: our discovered attacks are either stronger, producing 3.0%-50.8% additional adversarial examples (10 cases), or are typically 2x faster while enjoying similar adversarial robustness (13 cases).
Adversarial examples are a challenging open problem for deep neural networks. We propose in this paper to add a penalization term that forces the decision function to be at in some regions of the input space, such that it becomes, at least locally, less sensitive to attacks. Our proposition is theoretically motivated and shows on a first set of carefully conducted experiments that it behaves as expected when used alone, and seems promising when coupled with adversarial training.
There has been an ongoing cycle where stronger defenses against adversarial attacks are subsequently broken by a more advanced defense-aware attack. We present a new approach towards ending this cycle where we deflect adversarial attacks by causing the attacker to produce an input that semantically resembles the attacks target class. To this end, we first propose a stronger defense based on Capsule Networks that combines three detection mechanisms to achieve state-of-the-art detection performance on both standard and defense-aware attacks. We then show that undetected attacks against our defense often perceptually resemble the adversarial target class by performing a human study where participants are asked to label images produced by the attack. These attack images can no longer be called adversarial because our network classifies them the same way as humans do.