No Arabic abstract
We analyze the properties of adversarial training for learning adversarially robust halfspaces in the presence of agnostic label noise. Denoting $mathsf{OPT}_{p,r}$ as the best robust classification error achieved by a halfspace that is robust to perturbations of $ell_{p}$ balls of radius $r$, we show that adversarial training on the standard binary cross-entropy loss yields adversarially robust halfspaces up to (robust) classification error $tilde O(sqrt{mathsf{OPT}_{2,r}})$ for $p=2$, and $tilde O(d^{1/4} sqrt{mathsf{OPT}_{infty, r}} + d^{1/2} mathsf{OPT}_{infty,r})$ when $p=infty$. Our results hold for distributions satisfying anti-concentration properties enjoyed by log-concave isotropic distributions among others. We additionally show that if one instead uses a nonconvex sigmoidal loss, adversarial training yields halfspaces with an improved robust classification error of $O(mathsf{OPT}_{2,r})$ for $p=2$, and $O(d^{1/4}mathsf{OPT}_{infty, r})$ when $p=infty$. To the best of our knowledge, this is the first work to show that adversarial training provably yields robust classifiers in the presence of noise.
Deep networks are well-known to be fragile to adversarial attacks. We conduct an empirical analysis of deep representations under the state-of-the-art attack method called PGD, and find that the attack causes the internal representation to shift closer to the false class. Motivated by this observation, we propose to regularize the representation space under attack with metric learning to produce more robust classifiers. By carefully sampling examples for metric learning, our learned representation not only increases robustness, but also detects previously unseen adversarial samples. Quantitative experiments show improvement of robustness accuracy by up to 4% and detection efficiency by up to 6% according to Area Under Curve score over prior work. The code of our work is available at https://github.com/columbia/Metric_Learning_Adversarial_Robustness.
We study {em online} active learning of homogeneous halfspaces in $mathbb{R}^d$ with adversarial noise where the overall probability of a noisy label is constrained to be at most $ u$. Our main contribution is a Perceptron-like online active learning algorithm that runs in polynomial time, and under the conditions that the marginal distribution is isotropic log-concave and $ u = Omega(epsilon)$, where $epsilon in (0, 1)$ is the target error rate, our algorithm PAC learns the underlying halfspace with near-optimal label complexity of $tilde{O}big(d cdot polylog(frac{1}{epsilon})big)$ and sample complexity of $tilde{O}big(frac{d}{epsilon} big)$. Prior to this work, existing online algorithms designed for tolerating the adversarial noise are subject to either label complexity polynomial in $frac{1}{epsilon}$, or suboptimal noise tolerance, or restrictive marginal distributions. With the additional prior knowledge that the underlying halfspace is $s$-sparse, we obtain attribute-efficient label complexity of $tilde{O}big( s cdot polylog(d, frac{1}{epsilon}) big)$ and sample complexity of $tilde{O}big(frac{s}{epsilon} cdot polylog(d) big)$. As an immediate corollary, we show that under the agnostic model where no assumption is made on the noise rate $ u$, our active learner achieves an error rate of $O(OPT) + epsilon$ with the same running time and label and sample complexity, where $OPT$ is the best possible error rate achievable by any homogeneous halfspace.
Todays state-of-the-art image classifiers fail to correctly classify carefully manipulated adversarial images. In this work, we develop a new, localized adversarial attack that generates adversarial examples by imperceptibly altering the backgrounds of normal images. We first use this attack to highlight the unnecessary sensitivity of neural networks to changes in the background of an image, then use it as part of a new training technique: localized adversarial training. By including locally adversarial images in the training set, we are able to create a classifier that suffers less loss than a non-adversarially trained counterpart model on both natural and adversarial inputs. The evaluation of our localized adversarial training algorithm on MNIST and CIFAR-10 datasets shows decreased accuracy loss on natural images, and increased robustness against adversarial inputs.
Training convolutional neural networks (CNNs) with a strict Lipschitz constraint under the l_{2} norm is useful for provable adversarial robustness, interpretable gradients and stable training. While 1-Lipschitz CNNs can be designed by enforcing a 1-Lipschitz constraint on each layer, training such networks requires each layer to have an orthogonal Jacobian matrix (for all inputs) to prevent gradients from vanishing during backpropagation. A layer with this property is said to be Gradient Norm Preserving (GNP). To construct expressive GNP activation functions, we first prove that the Jacobian of any GNP piecewise linear function is only allowed to change via Householder transformations for the function to be continuous. Building on this result, we introduce a class of nonlinear GNP activations with learnable Householder transformations called Householder activations. A householder activation parameterized by the vector $mathbf{v}$ outputs $(mathbf{I} - 2mathbf{v}mathbf{v}^{T})mathbf{z}$ for its input $mathbf{z}$ if $mathbf{v}^{T}mathbf{z} leq 0$; otherwise it outputs $mathbf{z}$. Existing GNP activations such as $mathrm{MaxMin}$ can be viewed as special cases of $mathrm{HH}$ activations for certain settings of these transformations. Thus, networks with $mathrm{HH}$ activations have higher expressive power than those with $mathrm{MaxMin}$ activations. Although networks with $mathrm{HH}$ activations have nontrivial provable robustness against adversarial attacks, we further boost their robustness by (i) introducing a certificate regularization and (ii) relaxing orthogonalization of the last layer of the network. Our experiments on CIFAR-10 and CIFAR-100 show that our regularized networks with $mathrm{HH}$ activations lead to significant improvements in both the standard and provable robust accuracy over the prior works (gain of 3.65% and 4.46% on CIFAR-100 respectively).
In this technical report, we evaluate the adversarial robustness of a very recent method called Geometry-aware Instance-reweighted Adversarial Training[7]. GAIRAT reports state-of-the-art results on defenses to adversarial attacks on the CIFAR-10 dataset. In fact, we find that a network trained with this method, while showing an improvement over regular adversarial training (AT), is biasing the model towards certain samples by re-scaling the loss. Indeed, this leads the model to be susceptible to attacks that scale the logits. The original model shows an accuracy of 59% under AutoAttack - when trained with additional data with pseudo-labels. We provide an analysis that shows the opposite. In particular, we craft a PGD attack multiplying the logits by a positive scalar that decreases the GAIRAT accuracy from from 55% to 44%, when trained solely on CIFAR-10. In this report, we rigorously evaluate the model and provide insights into the reasons behind the vulnerability of GAIRAT to this adversarial attack. The code to reproduce our evaluation is made available at https://github.com/giuxhub/GAIRAT-LSA