No Arabic abstract
This paper investigates the theory of robustness against adversarial attacks. We focus on randomized classifiers (emph{i.e.} classifiers that output random variables) and provide a thorough analysis of their behavior through the lens of statistical learning theory and information theory. To this aim, we introduce a new notion of robustness for randomized classifiers, enforcing local Lipschitzness using probability metrics. Equipped with this definition, we make two new contributions. The first one consists in devising a new upper bound on the adversarial generalization gap of randomized classifiers. More precisely, we devise bounds on the generalization gap and the adversarial gap (emph{i.e.} the gap between the risk and the worst-case risk under attack) of randomized classifiers. The second contribution presents a yet simple but efficient noise injection method to design robust randomized classifiers. We show that our results are applicable to a wide range of machine learning models under mild hypotheses. We further corroborate our findings with experimental results using deep neural networks on standard image datasets, namely CIFAR-10 and CIFAR-100. All robust models we trained models can simultaneously achieve state-of-the-art accuracy (over $0.82$ clean accuracy on CIFAR-10) and enjoy emph{guaranteed} robust accuracy bounds ($0.45$ against $ell_2$ adversaries with magnitude $0.5$ on CIFAR-10).
The goal of this paper is to analyze an intriguing phenomenon recently discovered in deep networks, namely their instability to adversarial perturbations (Szegedy et. al., 2014). We provide a theoretical framework for analyzing the robustness of classifiers to adversarial perturbations, and show fundamental upper bounds on the robustness of classifiers. Specifically, we establish a general upper bound on the robustness of classifiers to adversarial perturbations, and then illustrate the obtained upper bound on the families of linear and quadratic classifiers. In both cases, our upper bound depends on a distinguishability measure that captures the notion of difficulty of the classification task. Our results for both classes imply that in tasks involving small distinguishability, no classifier in the considered set will be robust to adversarial perturbations, even if a good accuracy is achieved. Our theoretical framework moreover suggests that the phenomenon of adversarial instability is due to the low flexibility of classifiers, compared to the difficulty of the classification task (captured by the distinguishability). Moreover, we show the existence of a clear distinction between the robustness of a classifier to random noise and its robustness to adversarial perturbations. Specifically, the former is shown to be larger than the latter by a factor that is proportional to sqrt{d} (with d being the signal dimension) for linear classifiers. This result gives a theoretical explanation for the discrepancy between the two robustness properties in high dimensional problems, which was empirically observed in the context of neural networks. To the best of our knowledge, our results provide the first theoretical work that addresses the phenomenon of adversarial instability recently observed for deep networks. Our analysis is complemented by experimental results on controlled and real-world data.
Several recent works have shown that state-of-the-art classifiers are vulnerable to worst-case (i.e., adversarial) perturbations of the datapoints. On the other hand, it has been empirically observed that these same classifiers are relatively robust to random noise. In this paper, we propose to study a textit{semi-random} noise regime that generalizes both the random and worst-case noise regimes. We propose the first quantitative analysis of the robustness of nonlinear classifiers in this general noise regime. We establish precise theoretical bounds on the robustness of classifiers in this general regime, which depend on the curvature of the classifiers decision boundary. Our bounds confirm and quantify the empirical observations that classifiers satisfying curvature constraints are robust to random noise. Moreover, we quantify the robustness of classifiers in terms of the subspace dimension in the semi-random noise regime, and show that our bounds remarkably interpolate between the worst-case and random noise regimes. We perform experiments and show that the derived bounds provide very accurate estimates when applied to various state-of-the-art deep neural networks and datasets. This result suggests bounds on the curvature of the classifiers decision boundaries that we support experimentally, and more generally offers important insights onto the geometry of high dimensional classification problems.
This short note highlights some links between two lines of research within the emerging topic of trustworthy machine learning: differential privacy and robustness to adversarial examples. By abstracting the definitions of both notions, we show that they build upon the same theoretical ground and hence results obtained so far in one domain can be transferred to the other. More precisely, our analysis is based on two key elements: probabilistic mappings (also called randomized algorithms in the differential privacy community), and the Renyi divergence which subsumes a large family of divergences. We first generalize the definition of robustness against adversarial examples to encompass probabilistic mappings. Then we observe that Renyi-differential privacy (a generalization of differential privacy recently proposed in~cite{Mironov2017RenyiDP}) and our definition of robustness share several similarities. We finally discuss how can both communities benefit from this connection to transfer technical tools from one research field to the other.
Despite the remarkable success of deep neural networks, significant concerns have emerged about their robustness to adversarial perturbations to inputs. While most attacks aim to ensure that these are imperceptible, physical perturbation attacks typically aim for being unsuspicious, even if perceptible. However, there is no universal notion of what it means for adversarial examples to be unsuspicious. We propose an approach for modeling suspiciousness by leveraging cognitive salience. Specifically, we split an image into foreground (salient region) and background (the rest), and allow significantly larger adversarial perturbations in the background, while ensuring that cognitive salience of background remains low. We describe how to compute the resulting non-salience-preserving dual-perturbation attacks on classifiers. We then experimentally demonstrate that our attacks indeed do not significantly change perceptual salience of the background, but are highly effective against classifiers robust to conventional attacks. Furthermore, we show that adversarial training with dual-perturbation attacks yields classifiers that are more robust to these than state-of-the-art robust learning approaches, and comparable in terms of robustness to conventional attacks.
Deep Neural Networks, despite their great success in diverse domains, are provably sensitive to small perturbations on correctly classified examples and lead to erroneous predictions. Recently, it was proposed that this behavior can be combatted by optimizing the worst case loss function over all possible substitutions of training examples. However, this can be prone to weighing unlikely substitutions higher, limiting the accuracy gain. In this paper, we study adversarial robustness through randomized perturbations, which has two immediate advantages: (1) by ensuring that substitution likelihood is weighted by the proximity to the original word, we circumvent optimizing the worst case guarantees and achieve performance gains; and (2) the calibrated randomness imparts differentially-private model training, which additionally improves robustness against adversarial attacks on the model outputs. Our approach uses a novel density-based mechanism based on truncated Gumbel noise, which ensures training on substitutions of both rare and dense words in the vocabulary while maintaining semantic similarity for model robustness.