ترغب بنشر مسار تعليمي؟ اضغط هنا

Overparameterization of deep ResNet: zero loss and mean-field analysis

258   0   0.0 ( 0 )
 نشر من قبل Zhiyan Ding
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Finding parameters in a deep neural network (NN) that fit training data is a nonconvex optimization problem, but a basic first-order optimization method (gradient descent) finds a global solution with perfect fit in many practical situations. We examine this phenomenon for the case of Residual Neural Networks (ResNet) with smooth activation functions in a limiting regime in which both the number of layers (depth) and the number of neurons in each layer (width) go to infinity. First, we use a mean-field-limit argument to prove that the gradient descent for parameter training becomes a partial differential equation (PDE) that characterizes gradient flow for a probability distribution in the large-NN limit. Next, we show that the solution to the PDE converges in the training time to a zero-loss solution. Together, these results imply that training of the ResNet also gives a near-zero loss if the Resnet is large enough. We give estimates of the depth and width needed to reduce the loss below a given threshold, with high probability.



قيم البحث

اقرأ أيضاً

76 - Liang Chen , Lesley Tan 2021
In this paper, we investigate data-driven parameterized modeling of insertion loss for transmission lines with respect to design parameters. We first show that direct application of neural networks can lead to non-physics models with negative inserti on loss. To mitigate this problem, we propose two deep learning solutions. One solution is to add a regulation term, which represents the passive condition, to the final loss function to enforce the negative quantity of insertion loss. In the second method, a third-order polynomial expression is defined first, which ensures positiveness, to approximate the insertion loss, then DeepONet neural network structure, which was proposed recently for function and system modeling, was employed to model the coefficients of polynomials. The resulting neural network is applied to predict the coefficients of the polynomial expression. The experimental results on an open-sourced SI/PI database of a PCB design show that both methods can ensure the positiveness for the insertion loss. Furthermore, both methods can achieve similar prediction results, while the polynomial-based DeepONet method is faster than DeepONet based method in training time.
Sampling algorithms based on discretizations of Stochastic Differential Equations (SDEs) compose a rich and popular subset of MCMC methods. This work provides a general framework for the non-asymptotic analysis of sampling error in 2-Wasserstein dist ance, which also leads to a bound of mixing time. The method applies to any consistent discretization of contractive SDEs. When applied to Langevin Monte Carlo algorithm, it establishes $tilde{mathcal{O}}left( frac{sqrt{d}}{epsilon} right)$ mixing time, without warm start, under the common log-smooth and log-strongly-convex conditions, plus a growth condition on the 3rd-order derivative of the potential of target measures at infinity. This bound improves the best previously known $tilde{mathcal{O}}left( frac{d}{epsilon} right)$ result and is optimal (in terms of order) in both dimension $d$ and accuracy tolerance $epsilon$ for target measures satisfying the aforementioned assumptions. Our theoretical analysis is further validated by numerical experiments.
In this paper, we present a deep autoencoder based energy method (DAEM) for the bending, vibration and buckling analysis of Kirchhoff plates. The DAEM exploits the higher order continuity of the DAEM and integrates a deep autoencoder and the minimum total potential principle in one framework yielding an unsupervised feature learning method. The DAEM is a specific type of feedforward deep neural network (DNN) and can also serve as function approximator. With robust feature extraction capacity, the DAEM can more efficiently identify patterns behind the whole energy system, such as the field variables, natural frequency and critical buckling load factor studied in this paper. The objective function is to minimize the total potential energy. The DAEM performs unsupervised learning based on random generated points inside the physical domain so that the total potential energy is minimized at all points. For vibration and buckling analysis, the loss function is constructed based on Rayleighs principle and the fundamental frequency and the critical buckling load is extracted. A scaled hyperbolic tangent activation function for the underlying mechanical model is presented which meets the continuity requirement and alleviates the gradient vanishing/explosive problems under bending analysis. The DAEM can be easily implemented and we employed the Pytorch library and the LBFGS optimizer. A comprehensive study of the DAEM configuration is performed for several numerical examples with various geometries, load conditions, and boundary conditions.
Batch Normalization (BatchNorm) is an extremely useful component of modern neural network architectures, enabling optimization using higher learning rates and achieving faster convergence. In this paper, we use mean-field theory to analytically quant ify the impact of BatchNorm on the geometry of the loss landscape for multi-layer networks consisting of fully-connected and convolutional layers. We show that it has a flattening effect on the loss landscape, as quantified by the maximum eigenvalue of the Fisher Information Matrix. These findings are then used to justify the use of larger learning rates for networks that use BatchNorm, and we provide quantitative characterization of the maximal allowable learning rate to ensure convergence. Experiments support our theoretically predicted maximum learning rate, and furthermore suggest that networks with smaller values of the BatchNorm parameter achieve lower loss after the same number of epochs of training.
116 - Zhiyan Ding , Qin Li 2019
Ensemble Kalman Sampler (EKS) is a method to find approximately $i.i.d.$ samples from a target distribution. As of today, why the algorithm works and how it converges is mostly unknown. The continuous version of the algorithm is a set of coupled stoc hastic differential equations (SDEs). In this paper, we prove the wellposedness of the SDE system, justify its mean-field limit is a Fokker-Planck equation, whose long time equilibrium is the target distribution. We further demonstrate that the convergence rate is near-optimal ($J^{-1/2}$, with $J$ being the number of particles). These results, combined with the in-time convergence of the Fokker-Planck equation to its equilibrium, justify the validity of EKS, and provide the convergence rate as a sampling method.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا