Variance Based Samples Weighting for Supervised Deep Learning

63 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Paul Novello

تاريخ النشر 2021

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Paul Novello

التعلم الالي التعلم الآلي نظرية الإحصاء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In the context of supervised learning of a function by a Neural Network (NN), we claim and empirically justify that a NN yields better results when the distribution of the data set focuses on regions where the function to learn is steeper. We first traduce this assumption in a mathematically workable way using Taylor expansion. Then, theoretical derivations allow to construct a methodology that we call Variance Based Samples Weighting (VBSW). VBSW uses local variance of the labels to weight the training points. This methodology is general, scalable, cost effective, and significantly increases the performances of a large class of NNs for various classification and regression tasks on image, text and multivariate data. We highlight its benefits with experiments involving NNs from shallow linear NN to Resnet or Bert.

قيم البحث

132 - Miguel del Alamo 2021

We consider ill-posed inverse problems where the forward operator $T$ is unknown, and instead we have access to training data consisting of functions $f_i$ and their noisy images $Tf_i$. This is a practically relevant and challenging problem which cu rrent methods are able to solve only under strong assumptions on the training set. Here we propose a new method that requires minimal assumptions on the data, and prove reconstruction rates that depend on the number of training points and the noise level. We show that, in the regime of many training data, the method is minimax optimal. The proposed method employs a type of convolutional neural networks (U-nets) and empirical risk minimization in order to fit the unknown operator. In a nutshell, our approach is based on two ideas: the first is to relate U-nets to multiscale decompositions such as wavelets, thereby linking them to the existing theory, and the second is to use the hierarchical structure of U-nets and the low number of parameters of convolutional neural nets to prove entropy bounds that are practically useful. A significant difference with the existing works on neural networks in nonparametric statistics is that we use them to approximate operators and not functions, which we argue is mathematically more natural and technically more convenient.

التعلم الالي التعلم الآلي نظرية الإحصاء

Metric learning by Similarity Network for Deep Semi-Supervised Learning

186 - Sanyou Wu , Xingdong Feng , Fan Zhou 2020

Deep semi-supervised learning has been widely implemented in the real-world due to the rapid development of deep learning. Recently, attention has shifted to the approaches such as Mean-Teacher to penalize the inconsistency between two perturbed inpu t sets. Although these methods may achieve positive results, they ignore the relationship information between data instances. To solve this problem, we propose a novel method named Metric Learning by Similarity Network (MLSN), which aims to learn a distance metric adaptively on different domains. By co-training with the classification network, similarity network can learn more information about pairwise relationships and performs better on some empirical tasks than state-of-art methods.

التعلم الالي التعلم الآلي

On the consistency of supervised learning with missing values

129 - Julie Josse , Nicolas Prost (CMAP 2019

In many application settings, the data have missing entries which make analysis challenging. An abundant literature addresses missing values in an inferential framework: estimating parameters and their variance from incomplete tables. Here, we consid er supervised-learning settings: predicting a target when missing values appear in both training and testing data. We show the consistency of two approaches in prediction. A striking result is that the widely-used method of imputing with a constant, such as the mean prior to learning is consistent when missing values are not informative. This contrasts with inferential settings where mean imputation is pointed at for distorting the distribution of the data. That such a simple approach can be consistent is important in practice. We also show that a predictor suited for complete observations can predict optimally on incomplete data,through multiple imputation.Finally, to compare imputation with learning directly with a model that accounts for missing values, we analyze further decision trees. These can naturally tackle empirical risk minimization with missing values, due to their ability to handle the half-discrete nature of incomplete variables. After comparing theoretically and empirically different missing values strategies in trees, we recommend using the missing incorporated in attribute method as it can handle both non-informative and informative missing values.

التعلم الالي التعلم الآلي نظرية الإحصاء

Variance reduction for Random Coordinate Descent-Langevin Monte Carlo

146 - Zhiyan Ding , Qin Li 2020

Sampling from a log-concave distribution function is one core problem that has wide applications in Bayesian statistics and machine learning. While most gradient free methods have slow convergence rate, the Langevin Monte Carlo (LMC) that provides fa st convergence requires the computation of gradients. In practice one uses finite-differencing approximations as surrogates, and the method is expensive in high-dimensions. A natural strategy to reduce computational cost in each iteration is to utilize random gradient approximations, such as random coordinate descent (RCD) or simultaneous perturbation stochastic approximation (SPSA). We show by a counter-example that blindly applying RCD does not achieve the goal in the most general setting. The high variance induced by the randomness means a larger number of iterations are needed, and this balances out the saving in each iteration. We then introduce a new variance reduction approach, termed Randomized Coordinates Averaging Descent (RCAD), and incorporate it with both overdamped and underdamped LMC. The methods are termed RCAD-O-LMC and RCAD-U-LMC respectively. The methods still sit in the random gradient approximation framework, and thus the computational cost in each iteration is low. However, by employing RCAD, the variance is reduced, so the methods converge within the same number of iterations as the classical overdamped and underdamped LMC. This leads to a computational saving overall.

التعلم الالي التعلم الآلي نظرية الإحصاء

Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee

245 - Jincheng Bai , Qifan Song , Guang Cheng 2020

Sparse deep learning aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. Although tremendous empirical successes have been achieved, most sparse deep learning alg orithms are lacking of theoretical support. On the other hand, another line of works have proposed theoretical frameworks that are computationally infeasible. In this paper, we train sparse deep neural networks with a fully Bayesian treatment under spike-and-slab priors, and develop a set of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. The variational posterior contraction rate is provided, which justifies the consistency of the proposed variational Bayes method. Notably, our empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.

التعلم الالي التعلم الآلي نظرية الإحصاء