No Arabic abstract
This note considers softmax parameter estimation when little/no labeled training data is available, but a priori information about the relative geometry of class label log-odds boundaries is available. It is shown that `data-free softmax model synthesis corresponds to solving a linear system of parameter equations, wherein desired dominant class log-odds boundaries are encoded via convex polytopes that decompose the input feature space. When solvable, the linear equations yield closed-form softmax parameter solution families using class boundary polytope specifications only. This allows softmax parameter learning to be implemented without expensive brute force data sampling and numerical optimization. The linear equations can also be adapted to constrained maximum likelihood estimation in data-sparse settings. Since solutions may also fail to exist for the linear parameter equations derived from certain polytope specifications, it is thus also shown that there exist probabilistic classification problems over m convexly separable classes for which the log-odds boundaries cannot be learned using an m-class softmax model.
We address two shortcomings in online travel time estimation methods for congested urban traffic. The first shortcoming is related to the determination of the number of mixture modes, which can change dynamically, within day and from day to day. The second shortcoming is the wide-spread use of Gaussian probability densities as mixture components. Gaussian densities fail to capture the positive skew in travel time distributions and, consequently, large numbers of mixture components are needed for reasonable fitting accuracy when applied as mixture components. They also assign positive probabilities to negative travel times. To address these issues, this paper derives a mixture distribution with Gamma component densities, which are asymmetric and supported on the positive numbers. We use sparse estimation techniques to ensure parsimonious models and propose a generalization of Gamma mixture densities using Mittag-Leffler functions, which provides enhanced fitting flexibility and improved parsimony. In order to accommodate within-day variability and allow for online implementation of the proposed methodology (i.e., fast computations on streaming travel time data), we introduce a recursive algorithm which efficiently updates the fitted distribution whenever new data become available. Experimental results using real-world travel time data illustrate the efficacy of the proposed methods.
The Gumbel-Max trick is the basis of many relaxed gradient estimators. These estimators are easy to implement and low variance, but the goal of scaling them comprehensively to large combinatorial distributions is still outstanding. Working within the perturbation model framework, we introduce stochastic softmax tricks, which generalize the Gumbel-Softmax trick to combinatorial spaces. Our framework is a unified perspective on existing relaxed estimators for perturbation models, and it contains many novel relaxations. We design structured relaxations for subset selection, spanning trees, arborescences, and others. When compared to less structured baselines, we find that stochastic softmax tricks can be used to train latent variable models that perform better and discover more latent structure.
This paper proposes a fast and accurate method for sparse regression in the presence of missing data. The underlying statistical model encapsulates the low-dimensional structure of the incomplete data matrix and the sparsity of the regression coefficients, and the proposed algorithm jointly learns the low-dimensional structure of the data and a linear regressor with sparse coefficients. The proposed stochastic optimization method, Sparse Linear Regression with Missing Data (SLRM), performs an alternating minimization procedure and scales well with the problem size. Large deviation inequalities shed light on the impact of the various problem-dependent parameters on the expected squared loss of the learned regressor. Extensive simulations on both synthetic and real datasets show that SLRM performs better than competing algorithms in a variety of contexts.
We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $ell_infty$ robustness against several strong attacks via adversarial training and (ii) certified $ell_2$ and $ell_infty$ robustness via randomized smoothing. On SVHN, adding the datasets own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.
Model Stealing (MS) attacks allow an adversary with black-box access to a Machine Learning model to replicate its functionality, compromising the confidentiality of the model. Such attacks train a clone model by using the predictions of the target model for different inputs. The effectiveness of such attacks relies heavily on the availability of data necessary to query the target model. Existing attacks either assume partial access to the dataset of the target model or availability of an alternate dataset with semantic similarities. This paper proposes MAZE -- a data-free model stealing attack using zeroth-order gradient estimation. In contrast to prior works, MAZE does not require any data and instead creates synthetic data using a generative model. Inspired by recent works in data-free Knowledge Distillation (KD), we train the generative model using a disagreement objective to produce inputs that maximize disagreement between the clone and the target model. However, unlike the white-box setting of KD, where the gradient information is available, training a generator for model stealing requires performing black-box optimization, as it involves accessing the target model under attack. MAZE relies on zeroth-order gradient estimation to perform this optimization and enables a highly accurate MS attack. Our evaluation with four datasets shows that MAZE provides a normalized clone accuracy in the range of 0.91x to 0.99x, and outperforms even the recent attacks that rely on partial data (JBDA, clone accuracy 0.13x to 0.69x) and surrogate data (KnockoffNets, clone accuracy 0.52x to 0.97x). We also study an extension of MAZE in the partial-data setting and develop MAZE-PD, which generates synthetic data closer to the target distribution. MAZE-PD further improves the clone accuracy (0.97x to 1.0x) and reduces the query required for the attack by 2x-24x.