When Does Preconditioning Help or Hurt Generalization?

112 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Denny Wu

تاريخ النشر 2020

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Shun-ichi Amari - Jimmy Ba - Roger Grosse

التعلم الالي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

While second order optimizers such as natural gradient descent (NGD) often speed up optimization, their effect on generalization has been called into question. This work presents a more nuanced view on how the textit{implicit bias} of first- and second-order methods affects the comparison of generalization properties. We provide an exact asymptotic bias-variance decomposition of the generalization error of overparameterized ridgeless regression under a general class of preconditioner $boldsymbol{P}$, and consider the inverse population Fisher information matrix (used in NGD) as a particular example. We determine the optimal $boldsymbol{P}$ for both the bias and variance, and find that the relative generalization performance of different optimizers depends on the label noise and the shape of the signal (true parameters): when the labels are noisy, the model is misspecified, or the signal is misaligned with the features, NGD can achieve lower risk; conversely, GD generalizes better than NGD under clean labels, a well-specified model, or aligned signal. Based on this analysis, we discuss several approaches to manage the bias-variance tradeoff, and the potential benefit of interpolating between GD and NGD. We then extend our analysis to regression in the reproducing kernel Hilbert space and demonstrate that preconditioned GD can decrease the population risk faster than GD. Lastly, we empirically compare the generalization error of first- and second-order optimizers in neural network experiments, and observe robust trends matching our theoretical analysis.

قيم البحث

80 - Konstantin Donhauser , Alexandru c{T}ifrea , Michael Aerni 2021

Numerous recent works show that overparameterization implicitly reduces variance for min-norm interpolators and max-margin classifiers. These findings suggest that ridge regularization has vanishing benefits in high dimensions. We challenge this narr ative by showing that, even in the absence of noise, avoiding interpolation through ridge regularization can significantly improve generalization. We prove this phenomenon for the robust risk of both linear regression and classification and hence provide the first theoretical result on robust overfitting.

التعلم الالي التعلم الآلي

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

88 - Joshua Romoff , Peter Henderson , David Kanaa 2020

We investigate whether Jacobi preconditioning, accounting for the bootstrap term in temporal difference (TD) learning, can help boost performance of adaptive optimizers. Our method, TDprop, computes a per parameter learning rate based on the diagonal preconditioning of the TD update rule. We show how this can be used in both $n$-step returns and TD($lambda$). Our theoretical findings demonstrate that including this additional preconditioning information is, surprisingly, comparable to normal semi-gradient TD if the optimal learning rate is found for both via a hyperparameter search. In Deep RL experiments using Expected SARSA, TDprop meets or exceeds the performance of Adam in all tested games under near-optimal learning rates, but a well-tuned SGD can yield similar improvements -- matching our theory. Our findings suggest that Jacobi preconditioning may improve upon typical adaptive optimization methods in Deep RL, but despite incorporating additional information from the TD bootstrap term, may not always be better than SGD.

التعلم الآلي التعلم الالي

When Does Stochastic Gradient Algorithm Work Well?

205 - Lam M. Nguyen , Nam H. Nguyen , Dzung T. Phan 2018

In this paper, we consider a general stochastic optimization problem which is often at the core of supervised learning, such as deep learning and linear classification. We consider a standard stochastic gradient descent (SGD) method with a fixed, lar ge step size and propose a novel assumption on the objective function, under which this method has the improved convergence rates (to a neighborhood of the optimal solutions). We then empirically demonstrate that these assumptions hold for logistic regression and standard deep neural networks on classical data sets. Thus our analysis helps to explain when efficient behavior can be expected from the SGD method in training classification models and deep neural networks.

التعلم الالي التعلم الآلي التحسين والتحكم

How Does Mixup Help With Robustness and Generalization?

210 - Linjun Zhang , Zhun Deng , Kenji Kawaguchi 2020

Mixup is a popular data augmentation technique based on taking convex combinations of pairs of examples and their labels. This simple technique has been shown to substantially improve both the robustness and the generalization of the trained model. H owever, it is not well-understood why such improvement occurs. In this paper, we provide theoretical analysis to demonstrate how using Mixup in training helps model robustness and generalization. For robustness, we show that minimizing the Mixup loss corresponds to approximately minimizing an upper bound of the adversarial loss. This explains why models obtained by Mixup training exhibits robustness to several kinds of adversarial attacks such as Fast Gradient Sign Method (FGSM). For generalization, we prove that Mixup augmentation corresponds to a specific type of data-adaptive regularization which reduces overfitting. Our analysis provides new insights and a framework to understand Mixup.

التعلم الآلي التعلم الالي

Uncertainty in Multi-Commodity Routing Networks: When does it help?

64 - Shreyas Sekar , Liyuan Zheng , Lillian J. Ratliff 2017

We study the equilibrium behavior in a multi-commodity selfish routing game with many types of uncertain users where each user over- or under-estimates their congestion costs by a multiplicative factor. Surprisingly, we find that uncertainties in dif ferent directions have qualitatively distinct impacts on equilibria. Namely, contrary to the usual notion that uncertainty increases inefficiencies, network congestion actually decreases when users over-estimate their costs. On the other hand, under-estimation of costs leads to increased congestion. We apply these results to urban transportation networks, where drivers have different estimates about the cost of congestion. In light of the dynamic pricing policies aimed at tackling congestion, our results indicate that users perception of these prices can significantly impact the policys efficacy, and caution in the face of uncertainty leads to favorable network conditions.

علوم الكمبيوتر ونظرية الألعاب