A Continuous-Time View of Early Stopping for Least Squares

393 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Alnur Ali

تاريخ النشر 2018

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Alnur Ali - J. Zico Kolter - Ryan J. Tibshirani

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the statistical properties of the iterates generated by gradient descent, applied to the fundamental problem of least squares regression. We take a continuous-time view, i.e., consider infinitesimal step sizes in gradient descent, in which case the iterates form a trajectory called gradient flow. Our primary focus is to compare the risk of gradient flow to that of ridge regression. Under the calibration $t=1/lambda$---where $t$ is the time parameter in gradient flow, and $lambda$ the tuning parameter in ridge regression---we prove that the risk of gradient flow is no less than 1.69 times that of ridge, along the entire path (for all $t geq 0$). This holds in finite samples with very weak assumptions on the data model (in particular, with no assumptions on the features $X$). We prove that the same relative risk bound holds for prediction risk, in an average sense over the underlying signal $beta_0$. Finally, we examine limiting risk expressions (under standard Marchenko-Pastur asymptotics), and give supporting numerical experiments.

قيم البحث

194 - Alnur Ali , Edgar Dobriban , 2020

We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $lambda = 1/t$. The bound may be computed from explicit constants (e.g., the mini-batch size, step size, number of iterations), revealing precisely how these quantities drive the excess risk. Numerical examples show the bound can be small, indicating a tight relationship between the two estimators. We give a similar result relating the coefficients of stochastic gradient flow and ridge. These results hold under no conditions on the data matrix $X$, and across the entire optimization path (not just at convergence).

التعلم الالي التعلم الآلي التحسين والتحكم

Early Stopping is Nonparametric Variational Inference

368 - Dougal Maclaurin , David Duvenaud , Ryan P. Adams 2015

We show that unconverged stochastic gradient descent can be interpreted as a procedure that samples from a nonparametric variational approximate posterior distribution. This distribution is implicitly defined as the transformation of an initial distr ibution by a sequence of optimization updates. By tracking the change in entropy over this sequence of transformations during optimization, we form a scalable, unbiased estimate of the variational lower bound on the log marginal likelihood. We can use this bound to optimize hyperparameters instead of using cross-validation. This Bayesian interpretation of SGD suggests improved, overfitting-resistant optimization procedures, and gives a theoretical foundation for popular tricks such as early stopping and ensembling. We investigate the properties of this marginal likelihood estimator on neural network models.

التعلم الالي التعلم الآلي

A Markov Chain Theory Approach to Characterizing the Minimax Optimality of Stochastic Gradient Descent (for Least Squares)

119 - Prateek Jain , Sham M. Kakade , Rahul Kidambi 2017

This work provides a simplified proof of the statistical minimax optimality of (iterate averaged) stochastic gradient descent (SGD), for the special case of least squares. This result is obtained by analyzing SGD as a stochastic process and by sharpl y characterizing the stationary covariance matrix of this process. The finite rate optimality characterization captures the constant factors and addresses model mis-specification.

التعلم الالي التعلم الآلي التحسين والتحكم

Implicit Sparse Regularization: The Impact of Depth and Early Stopping

85 - Jiangyuan Li , Thanh V. Nguyen , Chinmay Hegde 2021

In this paper, we study the implicit bias of gradient descent for sparse regression. We extend results on regression with quadratic parametrization, which amounts to depth-2 diagonal linear networks, to more general depth-N networks, under more reali stic settings of noise and correlated designs. We show that early stopping is crucial for gradient descent to converge to a sparse model, a phenomenon that we call implicit sparse regularization. This result is in sharp contrast to known results for noiseless and uncorrelated-design cases. We characterize the impact of depth and early stopping and show that for a general depth parameter N, gradient descent with early stopping achieves minimax optimal sparse recovery with sufficiently small initialization and step size. In particular, we show that increasing depth enlarges the scale of working initialization and the early-stopping window, which leads to more stable gradient paths for sparse recovery.

التعلم الالي التعلم الآلي

Approximation of Functions over Manifolds: A Moving Least-Squares Approach

88 - Barak Sober , Yariv Aizenbud , David Levin 2017

We present an algorithm for approximating a function defined over a $d$-dimensional manifold utilizing only noisy function values at locations sampled from the manifold with noise. To produce the approximation we do not require any knowledge regardin g the manifold other than its dimension $d$. We use the Manifold Moving Least-Squares approach of (Sober and Levin 2016) to reconstruct the atlas of charts and the approximation is built on-top of those charts. The resulting approximant is shown to be a function defined over a neighborhood of a manifold, approximating the originally sampled manifold. In other words, given a new point, located near the manifold, the approximation can be evaluated directly on that point. We prove that our construction yields a smooth function, and in case of noiseless samples the approximation order is $mathcal{O}(h^{m+1})$, where $h$ is a local density of sample parameter (i.e., the fill distance) and $m$ is the degree of a local polynomial approximation, used in our algorithm. In addition, the proposed algorithm has linear time complexity with respect to the ambient-spaces dimension. Thus, we are able to avoid the computational complexity, commonly encountered in high dimensional approximations, without having to perform non-linear dimension reduction, which inevitably introduces distortions to the geometry of the data. Additionaly, we show numerical experiments that the proposed approach compares favorably to statistical approaches for regression over manifolds and show its potential.

التعلم الالي الهندسة الحسابية الرسم الحاسوبي