On the proliferation of support vectors in high dimensions

66 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Daniel Hsu

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Daniel Hsu - Vidya Muthukumar - Ji Xu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.

قيم البحث

65 - Jing Zhou , Gerda Claeskens , Jelena Bradic 2020

Robust methods, though ubiquitous in practice, are yet to be fully understood in the context of regularized estimation and high dimensions. Even simple questions become challenging very quickly. For example, classical statistical theory identifies eq uivalence between model-averaged and composite quantile estimation. However, little to nothing is known about such equivalence between methods that encourage sparsity. This paper provides a toolbox to further study robustness in these settings and focuses on prediction. In particular, we study optimally weighted model-averaged as well as composite $l_1$-regularized estimation. Optimal weights are determined by minimizing the asymptotic mean squared error. This approach incorporates the effects of regularization, without the assumption of perfect selection, as is often used in practice. Such weights are then optimal for prediction quality. Through an extensive simulation study, we show that no single method systematically outperforms others. We find, however, that model-averaged and composite quantile estimators often outperform least-squares methods, even in the case of Gaussian model noise. Real data application witnesses the methods practical use through the reconstruction of compressed audio signals.

نظرية الإحصاء الاقتصاد القياسي التعلم الالي

A New Generalization of Chebyshev Inequality for Random Vectors

297 - Xinjia Chen 2011

In this article, we derive a new generalization of Chebyshev inequality for random vectors. We demonstrate that the new generalization is much less conservative than the classical generalization.

نظرية الإحصاء التعلم الآلي الاحتمالات

On the number of variables to use in principal component regression

57 - Ji Xu , Daniel Hsu 2019

We study least squares linear regression over $N$ uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features $p$ is at most the sample size $n$, the estimator under consideration coincides w ith the principal component regression estimator; when $p>n$, the estimator is the least $ell_2$ norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as $p,n,N to infty$ with $p/N to alpha$ and $n/N to beta$, for some constants $alpha in [0,1]$ and $beta in (0,1)$. In this average-case setting, the prediction error exhibits a double descent shape as a function of $p$. We also establish conditions under which the minimum risk is achieved in the interpolating ($p>n$) regime.

نظرية الإحصاء التعلم الآلي التعلم الالي

The Price of Competition: Effect Size Heterogeneity Matters in High Dimensions

120 - Hua Wang , Yachong Yang , Weijie J. Su 2020

In high-dimensional linear regression, would increasing effect sizes always improve model selection, while maintaining all the other conditions unchanged (especially fixing the sparsity of regression coefficients)? In this paper, we answer this quest ion in the textit{negative} in the regime of linear sparsity for the Lasso method, by introducing a new notion we term effect size heterogeneity. Roughly speaking, a regression coefficient vector has high effect size heterogeneity if its nonzero entries have significantly different magnitudes. From the viewpoint of this new measure, we prove that the false and true positive rates achieve the optimal trade-off uniformly along the Lasso path when this measure is maximal in a certain sense, and the worst trade-off is achieved when it is minimal in the sense that all nonzero effect sizes are roughly equal. Moreover, we demonstrate that the first false selection occurs much earlier when effect size heterogeneity is minimal than when it is maximal. The underlying cause of these two phenomena is, metaphorically speaking, the competition among variables with effect sizes of the same magnitude in entering the model. Taken together, our findings suggest that effect size heterogeneity shall serve as an important complementary measure to the sparsity of regression coefficients in the analysis of high-dimensional regression problems. Our proofs use techniques from approximate message passing theory as well as a novel technique for estimating the rank of the first false variable.

نظرية الإحصاء نظرية المعلومات نظرية المعلومات

Surprises in High-Dimensional Ridgeless Least Squares Interpolation

116 - Trevor Hastie , Andrea Montanari , Saharon Rosset 2019

Interpolators -- estimators that achieve zero training error -- have attracted growing attention in machine learning, mainly because state-of-the art neural networks appear to be models of this type. In this paper, we study minimum $ell_2$ norm (``ri dgeless) interpolation in high-dimensional least squares regression. We consider two different models for the feature distribution: a linear model, where the feature vectors $x_i in {mathbb R}^p$ are obtained by applying a linear transform to a vector of i.i.d. entries, $x_i = Sigma^{1/2} z_i$ (with $z_i in {mathbb R}^p$); and a nonlinear model, where the feature vectors are obtained by passing the input through a random one-layer neural network, $x_i = varphi(W z_i)$ (with $z_i in {mathbb R}^d$, $W in {mathbb R}^{p times d}$ a matrix of i.i.d. entries, and $varphi$ an activation function acting componentwise on $W z_i$). We recover -- in a precise quantitative way -- several phenomena that have been observed in large-scale neural networks and kernel machines, including the double descent behavior of the prediction risk, and the potential benefits of overparametrization.

نظرية الإحصاء التعلم الآلي التعلم الالي