No Arabic abstract
Modern machine learning models are often so complex that they achieve vanishing classification error on the training set. Max-margin linear classifiers are among the simplest classification methods that have zero training error (with linearly separable data). Despite their simplicity, their high-dimensional behavior is not yet completely understood. We assume to be given i.i.d. data $(y_i,{boldsymbol x}_i)$, $ile n$ with ${boldsymbol x}_isim {sf N}(0,{boldsymbol Sigma})$ a $p$-dimensional feature vector, and $y_i in{+1,-1}$ a label whose distribution depends on a linear combination of the covariates $langle{boldsymboltheta}_*,{boldsymbol x}_irangle$. We consider the proportional asymptotics $n,ptoinfty$ with $p/nto psi$, and derive exact expressions for the limiting prediction error. Our asymptotic results match simulations already when $n,p$ are of the order of a few hundreds. We explore several choices for $({boldsymbol theta}_*,{boldsymbol Sigma})$, and show that the resulting generalization curve (test error error as a function of the overparametrization $psi=p/n$) is qualitatively different, depending on this choice. In particular we consider a specific structure of $({boldsymbol theta}_*,{boldsymbolSigma})$ that captures the behavior of nonlinear random feature models or, equivalently, two-layers neural networks with random first layer weights. In this case, we aim at classifying data $(y_i,{boldsymbol x}_i)$ with ${boldsymbol x}_iin{mathbb R}^d$ but we do so by first embedding them a $p$ dimensional feature space via ${boldsymbol x}_imapstosigma({boldsymbol W}{boldsymbol x}_i)$ and then finding a max-margin classifier in this space. We derive exact formulas in the proportional asymptotics $p,n,dtoinfty$ with $p/dtopsi_1$, $n/dtopsi_2$ and observe that the test error is minimized in the highly overparametrized regime $psi_1gg 0$.
Consider the classical supervised learning problem: we are given data $(y_i,{boldsymbol x}_i)$, $ile n$, with $y_i$ a response and ${boldsymbol x}_iin {mathcal X}$ a covariates vector, and try to learn a model $f:{mathcal X}to{mathbb R}$ to predict future responses. Random features methods map the covariates vector ${boldsymbol x}_i$ to a point ${boldsymbol phi}({boldsymbol x}_i)$ in a higher dimensional space ${mathbb R}^N$, via a random featurization map ${boldsymbol phi}$. We study the use of random features methods in conjunction with ridge regression in the feature space ${mathbb R}^N$. This can be viewed as a finite-dimensional approximation of kernel ridge regression (KRR), or as a stylized model for neural networks in the so called lazy training regime. We define a class of problems satisfying certain spectral conditions on the underlying kernels, and a hypercontractivity assumption on the associated eigenfunctions. These conditions are verified by classical high-dimensional examples. Under these conditions, we prove a sharp characterization of the error of random features ridge regression. In particular, we address two fundamental questions: $(1)$~What is the generalization error of KRR? $(2)$~How big $N$ should be for the random features approximation to achieve the same error as KRR? In this setting, we prove that KRR is well approximated by a projection onto the top $ell$ eigenfunctions of the kernel, where $ell$ depends on the sample size $n$. We show that the test error of random features ridge regression is dominated by its approximation error and is larger than the error of KRR as long as $Nle n^{1-delta}$ for some $delta>0$. We characterize this gap. For $Nge n^{1+delta}$, random features achieve the same error as the corresponding KRR, and further increasing $N$ does not lead to a significant change in test error.
The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for Max-Margin beyond the binary setting. We finally address these limitations by correcting the concept of Max-Margin and introducing the Restricted-Max-Margin, where the maximization of the loss-augmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss.
This paper explores the information-theoretic limitations of graph property testing in zero-field Ising models. Instead of learning the entire graph structure, sometimes testing a basic graph property such as connectivity, cycle presence or maximum clique size is a more relevant and attainable objective. Since property testing is more fundamental than graph recovery, any necessary conditions for property testing imply corresponding conditions for graph recovery, while custom property tests can be statistically and/or computationally more efficient than graph recovery based algorithms. Understanding the statistical complexity of property testing requires the distinction of ferromagnetic (i.e., positive interactions only) and general Ising models. Using combinatorial constructs such as graph packing and strong monotonicity, we characterize how target properties affect the corresponding minimax upper and lower bounds within the realm of ferromagnets. On the other hand, by studying the detection of an antiferromagnetic (i.e., negative interactions only) Curie-Weiss model buried in Rademacher noise, we show that property testing is strictly more challenging over general Ising models. In terms of methodological development, we propose two types of correlation based tests: computationally efficient screening for ferromagnets, and score type tests for general models, including a fast cycle presence test. Our correlation screening tests match the information-theoretic bounds for property testing in ferromagnets.
We consider the problem of cost sensitive multiclass classification, where we would like to increase the sensitivity of an important class at the expense of a less important one. We adopt an {em apportioned margin} framework to address this problem, which enables an efficient margin shift between classes that share the same boundary. The decision boundary between all pairs of classes divides the margin between them in accordance to a given prioritization vector, which yields a tighter error bound for the important classes while also reducing the overall out-of-sample error. In addition to demonstrating an efficient implementation of our framework, we derive generalization bounds, demonstrate Fisher consistency, adapt the framework to Mercers kernel and to neural networks, and report promising empirical results on all accounts.
We study high-dimensional linear models with error-in-variables. Such models are motivated by various applications in econometrics, finance and genetics. These models are challenging because of the need to account for measurement errors to avoid non-vanishing biases in addition to handle the high dimensionality of the parameters. A recent growing literature has proposed various estimators that achieve good rates of convergence. Our main contribution complements this literature with the construction of simultaneous confidence regions for the parameters of interest in such high-dimensional linear models with error-in-variables. These confidence regions are based on the construction of moment conditions that have an additional orthogonal property with respect to nuisance parameters. We provide a construction that requires us to estimate an additional high-dimensional linear model with error-in-variables for each component of interest. We use a multiplier bootstrap to compute critical values for simultaneous confidence intervals for a subset $S$ of the components. We show its validity despite of possible model selection mistakes, and allowing for the cardinality of $S$ to be larger than the sample size. We apply and discuss the implications of our results to two examples and conduct Monte Carlo simulations to illustrate the performance of the proposed procedure.