Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

221 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Dominic Richards

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Dominic Richards - Edgar Dobriban - Patrick Rebeschini

نظرية الإحصاء التعلم الآلي التحسين والتحكم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Modern methods for learning from data depend on many tuning parameters, such as the stepsize for optimization methods, and the regularization strength for regularized learning methods. Since performance can depend strongly on these parameters, it is important to develop comparisons between emph{classes of methods}, not just for particularly tuned ones. Here, we take aim to compare classes of estimators via the relative performance of the emph{best method in the class}. This allows us to rigorously quantify the tuning sensitivity of learning algorithms. As an illustration, we investigate the statistical estimation performance of ridge regression with a uniform grid of regularization parameters, and of gradient descent iterates with a fixed stepsize, in the standard linear model with a random isotropic ground truth parameter. (1) For orthogonal designs, we find the emph{exact minimax optimal classes of estimators}, showing they are equal to gradient descent with a polynomially decaying learning rate. We find the exact suboptimalities of ridge regression and gradient descent with a fixed stepsize, showing that they decay as either $1/k$ or $1/k^2$ for specific ranges of $k$ estimators. (2) For general designs with a large number of non-zero eigenvalues, we find that gradient descent outperforms ridge regression when the eigenvalues decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we find that ridge regression outperforms gradient descent. Our results highlight the importance of tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our case, it can be outperformed by gradient descent when both are restricted to being tuned over a finite regularization grid.

قيم البحث

205 - Lam M. Nguyen , Nam H. Nguyen , Dzung T. Phan 2018

In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a fixed, lar ge step size and propose a novel assumption on the objective function, under which this method has the improved convergence rates (to a neighborhood of the optimal solutions). We then empirically demonstrate that these assumptions hold for logistic regression and standard deep neural networks on classical data sets. Thus our analysis helps to explain when efficient behavior can be expected from the SGD method in training classification models and deep neural networks.

التعلم الالي التعلم الآلي التحسين والتحكم

When Does More Regularization Imply Fewer Degrees of Freedom? Sufficient Conditions and Counter Examples from Lasso and Ridge Regression

390 - Shachar Kaufman , Saharon Rosset 2013

Regularization aims to improve prediction performance of a given statistical modeling approach by moving to a second approach which achieves worse training error but is expected to have fewer degrees of freedom, i.e., better agreement between trainin g and prediction error. We show here, however, that this expected behavior does not hold in general. In fact, counter examples are given that show regularization can increase the degrees of freedom in simple situations, including lasso and ridge regression, which are the most common regularization approaches in use. In such situations, the regularization increases both training error and degrees of freedom, and is thus inherently without merit. On the other hand, two important regularization scenarios are described where the expected reduction in degrees of freedom is indeed guaranteed: (a) all symmetric linear smoothers, and (b) linear regression versus convex constrained linear regression (as in the constrained variant of ridge regression and lasso).

نظرية الإحصاء التعلم الالي نظرية الإحصاء

Asymptotics of Ridge Regression in Convolutional Models

108 - Mojtaba Sahraee-Ardakan , Tung Mai , Anup Rao 2021

Understanding generalization and estimation error of estimators for simple models such as linear and generalized linear models has attracted a lot of attention recently. This is in part due to an interesting observation made in machine learning commu nity that highly over-parameterized neural networks achieve zero training error, and yet they are able to generalize well over the test samples. This phenomenon is captured by the so called double descent curve, where the generalization error starts decreasing again after the interpolation threshold. A series of recent works tried to explain such phenomenon for simple models. In this work, we analyze the asymptotics of estimation error in ridge estimators for convolutional linear models. These convolutional inverse problems, also known as deconvolution, naturally arise in different fields such as seismology, imaging, and acoustics among others. Our results hold for a large class of input distributions that include i.i.d. features as a special case. We derive exact formulae for estimation error of ridge estimators that hold in a certain high-dimensional regime. We show the double descent phenomenon in our experiments for convolutional models and show that our theoretical results match the experiments.

التعلم الالي التعلم الآلي

A New Perspective on Boosting in Linear Regression via Subgradient Optimization and Relatives

347 - Robert M. Freund , Paul Grigas , Rahul Mazumder 2015

In this paper we analyze boosting algorithms in linear regression from a new perspective: that of modern first-order methods in convex optimization. We show that classic boosting algorithms in linear regression, namely the incremental forward stagewi se algorithm (FS$_varepsilon$) and least squares boosting (LS-Boost($varepsilon$)), can be viewed as subgradient descent to minimize the loss function defined as the maximum absolute correlation between the features and residuals. We also propose a modification of FS$_varepsilon$ that yields an algorithm for the Lasso, and that may be easily extended to an algorithm that computes the Lasso path for different values of the regularization parameter. Furthermore, we show that these new algorithms for the Lasso may also be interpreted as the same master algorithm (subgradient descent), applied to a regularized version of the maximum absolute correlation loss function. We derive novel, comprehensive computational guarantees for several boosting algorithms in linear regression (including LS-Boost($varepsilon$) and FS$_varepsilon$) by using techniques of modern first-order methods in convex optimization. Our computational guarantees inform us about the statistical properties of boosting algorithms. In particular they provide, for the first time, a precise theoretical description of the amount of data-fidelity and regularization imparted by running a boosting algorithm with a prespecified learning rate for a fixed but arbitrary number of iterations, for any dataset.

نظرية الإحصاء التعلم الآلي التحسين والتحكم

Asymptotic properties of bridge estimators in sparse high-dimensional regression models

317 - Jian Huang , Joel L. Horowitz , Shuangge Ma 2008

We study the asymptotic properties of bridge estimators in sparse, high-dimensional, linear regression models when the number of covariates may increase to infinity with the sample size. We are particularly interested in the use of bridge estimators to distinguish between covariates whose coefficients are zero and covariates whose coefficients are nonzero. We show that under appropriate conditions, bridge estimators correctly select covariates with nonzero coefficients with probability converging to one and that the estimators of nonzero coefficients have the same asymptotic distribution that they would have if the zero coefficients were known in advance. Thus, bridge estimators have an oracle property in the sense of Fan and Li [J. Amer. Statist. Assoc. 96 (2001) 1348--1360] and Fan and Peng [Ann. Statist. 32 (2004) 928--961]. In general, the oracle property holds only if the number of covariates is smaller than the sample size. However, under a partial orthogonality condition in which the covariates of the zero coefficients are uncorrelated or weakly correlated with the covariates of nonzero coefficients, we show that marginal bridge estimators can correctly distinguish between covariates with nonzero and zero coefficients with probability converging to one even when the number of covariates is greater than the sample size.

نظرية الإحصاء نظرية الإحصاء