Universal Approximation in Dropout Neural Networks

148 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jaron Sanders

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Oxana A. Manita - Mark A. Peletier - Jacobus W. Portegies

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We prove two universal approximation theorems for a range of dropout neural networks. These are feed-forward neural networks in which each edge is given a random ${0,1}$-valued filter, that have two modes of operation: in the first each edge output is multiplied by its random filter, resulting in a random output, while in the second each edge output is multiplied by the expectation of its filter, leading to a deterministic output. It is common to use the random mode during training and the deterministic mode during testing and prediction. Both theorems are of the following form: Given a function to approximate and a threshold $varepsilon>0$, there exists a dropout network that is $varepsilon$-close in probability and in $L^q$. The first theorem applies to dropout networks in the random mode. It assumes little on the activation function, applies to a wide class of networks, and can even be applied to approximation schemes other than neural networks. The core is an algebraic property that shows that deterministic networks can be exactly matched in expectation by random networks. The second theorem makes stronger assumptions and gives a stronger result. Given a function to approximate, it provides existence of a network that approximates in both modes simultaneously. Proof components are a recursive replacement of edges by independent copies, and a special first-layer replacement that couples the resulting larger network to the input. The functions to be approximated are assumed to be elements of general normed spaces, and the approximations are measured in the corresponding norms. The networks are constructed explicitly. Because of the different methods of proof, the two results give independent insight into the approximation properties of random dropout networks. With this, we establish that dropout neural networks broadly satisfy a universal-approximation property.

قيم البحث

101 - Joachim Sicking , Maram Akila , Tim Wirtz 2020

Monte Carlo (MC) dropout is one of the state-of-the-art approaches for uncertainty estimation in neural networks (NNs). It has been interpreted as approximately performing Bayesian inference. Based on previous work on the approximation of Gaussian pr ocesses by wide and deep neural networks with random weights, we study the limiting distribution of wide untrained NNs under dropout more rigorously and prove that they as well converge to Gaussian processes for fixed sets of weights and biases. We sketch an argument that this property might also hold for infinitely wide feed-forward networks that are trained with (full-batch) gradient descent. The theory is contrasted by an empirical analysis in which we find correlations and non-Gaussian behaviour for the pre-activations of finite width NNs. We therefore investigate how (strongly) correlated pre-activations can induce non-Gaussian behavior in NNs with strongly correlated weights.

التعلم الآلي نظرية الإحصاء التعلم الالي

On Universal Approximation by Neural Networks with Uniform Guarantees on Approximation of Infinite Dimensional Maps

126 - William H. Guss , Ruslan Salakhutdinov 2019

The study of universal approximation of arbitrary functions $f: mathcal{X} to mathcal{Y}$ by neural networks has a rich and thorough history dating back to Kolmogorov (1957). In the case of learning finite dimensional maps, many authors have shown va rious forms of the universality of both fixed depth and fixed width neural networks. However, in many cases, these classical results fail to extend to the recent use of approximations of neural networks with infinitely many units for functional data analysis, dynamical systems identification, and other applications where either $mathcal{X}$ or $mathcal{Y}$ become infinite dimensional. Two questions naturally arise: which infinite dimensional analogues of neural networks are sufficient to approximate any map $f: mathcal{X} to mathcal{Y}$, and when do the finite approximations to these analogues used in practice approximate $f$ uniformly over its infinite dimensional domain $mathcal{X}$? In this paper, we answer the open question of universal approximation of nonlinear operators when $mathcal{X}$ and $mathcal{Y}$ are both infinite dimensional. We show that for a large class of different infinite analogues of neural networks, any continuous map can be approximated arbitrarily closely with some mild topological conditions on $mathcal{X}$. Additionally, we provide the first lower-bound on the minimal number of input and output units required by a finite approximation to an infinite neural network to guarantee that it can uniformly approximate any nonlinear operator using samples from its inputs and outputs.

التعلم الآلي تحليل وظيفي التعلم الالي

Beyond Dropout: Feature Map Distortion to Regularize Deep Neural Networks

114 - Yehui Tang , Yunhe Wang , Yixing Xu 2020

Deep neural networks often consist of a great number of trainable parameters for extracting powerful features from given datasets. On one hand, massive trainable parameters significantly enhance the performance of these deep networks. On the other ha nd, they bring the problem of over-fitting. To this end, dropout based methods disable some elements in the output feature maps during the training phase for reducing the co-adaptation of neurons. Although the generalization ability of the resulting models can be enhanced by these approaches, the conventional binary dropout is not the optimal solution. Therefore, we investigate the empirical Rademacher complexity related to intermediate layers of deep neural networks and propose a feature distortion method (Disout) for addressing the aforementioned problem. In the training period, randomly selected elements in the feature maps will be replaced with specific values by exploiting the generalization error bound. The superiority of the proposed feature map distortion for producing deep neural network with higher testing performance is analyzed and demonstrated on several benchmark image datasets.

التعلم الآلي الرؤية الحاسوبية وتمييز الأنماط التعلم الالي

Asymptotic convergence rate of Dropout on shallow linear neural networks

101 - Albert Senen-Cerda , Jaron Sanders 2020

We analyze the convergence rate of gradient flows on objective functions induced by Dropout and Dropconnect, when applying them to shallow linear Neural Networks (NNs) - which can also be viewed as doing matrix factorization using a particular regula rizer. Dropout algorithms such as these are thus regularization techniques that use 0,1-valued random variables to filter weights during training in order to avoid coadaptation of features. By leveraging a recent result on nonconvex optimization and conducting a careful analysis of the set of minimizers as well as the Hessian of the loss function, we are able to obtain (i) a local convergence proof of the gradient flow and (ii) a bound on the convergence rate that depends on the data, the dropout probability, and the width of the NN. Finally, we compare this theoretical bound to numerical simulations, which are in qualitative agreement with the convergence bound and match it when starting sufficiently close to a minimizer.

التعلم الآلي التحسين والتحكم التعلم الالي

The universal approximation theorem for complex-valued neural networks

104 - Felix Voigtlaender 2020

We generalize the classical universal approximation theorem for neural networks to the case of complex-valued neural networks. Precisely, we consider feedforward networks with a complex activation function $sigma : mathbb{C} to mathbb{C}$ in which ea ch neuron performs the operation $mathbb{C}^N to mathbb{C}, z mapsto sigma(b + w^T z)$ with weights $w in mathbb{C}^N$ and a bias $b in mathbb{C}$, and with $sigma$ applied componentwise. We completely characterize those activation functions $sigma$ for which the associated complex networks have the universal approximation property, meaning that they can uniformly approximate any continuous function on any compact subset of $mathbb{C}^d$ arbitrarily well. Unlike the classical case of real networks, the set of good activation functions which give rise to networks with the universal approximation property differs significantly depending on whether one considers deep networks or shallow networks: For deep networks with at least two hidden layers, the universal approximation property holds as long as $sigma$ is neither a polynomial, a holomorphic function, or an antiholomorphic function. Shallow networks, on the other hand, are universal if and only if the real part or the imaginary part of $sigma$ is not a polyharmonic function.

تحليل وظيفي التعلم الآلي التعلم الالي