GPU Asynchronous Stochastic Gradient Descent to Speed Up Neural Network Training

204 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Tom Paine

تاريخ النشر 2013

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Thomas Paine - Hailin Jin - Jianchao Yang

الرؤية الحاسوبية وتمييز الأنماط النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The ability to train large-scale neural networks has resulted in state-of-the-art performance in many areas of computer vision. These results have largely come from computational break throughs of two forms: model parallelism, e.g. GPU accelerated training, which has seen quick adoption in computer vision circles, and data parallelism, e.g. A-SGD, whose large scale has been used mostly in industry. We report early experiments with a system that makes use of both model parallelism and data parallelism, we call GPU A-SGD. We show using GPU A-SGD it is possible to speed up training of large convolutional neural networks useful for computer vision. We believe GPU A-SGD will make it possible to train larger networks on larger training sets in a reasonable amount of time.

قيم البحث

109 - Yunfei Teng , Wenbo Gao , Francois Chalus 2019

We consider distributed optimization under communication constraints for training deep learning models. We propose a new algorithm, whose parameter updates rely on two forces: a regular gradient step, and a corrective direction dictated by the curren tly best-performing worker (leader). Our method differs from the parameter-averaging scheme EASGD in a number of ways: (i) our objective formulation does not change the location of stationary points compared to the original optimization problem; (ii) we avoid convergence decelerations caused by pulling local workers descending to different local minima to each other (i.e. to the average of their parameters); (iii) our update by design breaks the curse of symmetry (the phenomenon of being trapped in poorly generalizing sub-optimal solutions in symmetric non-convex landscapes); and (iv) our approach is more communication efficient since it broadcasts only parameters of the leader rather than all workers. We provide theoretical analysis of the batch version of the proposed algorithm, which we call Leader Gradient Descent (LGD), and its stochastic variant (LSGD). Finally, we implement an asynchronous version of our algorithm and extend it to the multi-leader setting, where we form groups of workers, each represented by its own local leader (the best performer in a group), and update each worker with a corrective direction comprised of two attractive forces: one to the local, and one to the global leader (the best performer among all workers). The multi-leader setting is well-aligned with current hardware architecture, where local workers forming a group lie within a single computational node and different groups correspond to different nodes. For training convolutional neural networks, we empirically demonstrate that our approach compares favorably to state-of-the-art baselines.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التحسين والتحكم

Parle: parallelizing stochastic gradient descent

301 - Pratik Chaudhari , Carlo Baldassi , Riccardo Zecchina 2017

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error rates that are nearly state-of-the-art on several bench marks including CIFAR-10 and CIFAR-100, without introducing any additional hyper-parameters. We exploit the phenomenon of flat minima that has been shown to lead to improved generalization error for deep networks. Parle requires very infrequent communication with the parameter server and instead performs more computation on each client, which makes it well-suited to both single-machine, multi-GPU settings and distributed implementations.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

Byzantine-Resilient Non-Convex Stochastic Gradient Descent

111 - Zeyuan Allen-Zhu , Faeze Ebrahimian , Jerry Li 2020

We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $alpha$-fraction of the machin es are $textit{Byzantine}$, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging $textit{non-convex}$ case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Optimizing Deep Neural Networks through Neuroevolution with Stochastic Gradient Descent

263 - Haichao Zhang , Kuangrong Hao , Lei Gao 2020

Deep neural networks (DNNs) have achieved remarkable success in computer vision; however, training DNNs for satisfactory performance remains challenging and suffers from sensitivity to empirical selections of an optimization algorithm for training. S tochastic gradient descent (SGD) is dominant in training a DNN by adjusting neural network weights to minimize the DNNs loss function. As an alternative approach, neuroevolution is more in line with an evolutionary process and provides some key capabilities that are often unavailable in SGD, such as the heuristic black-box search strategy based on individual collaboration in neuroevolution. This paper proposes a novel approach that combines the merits of both neuroevolution and SGD, enabling evolutionary search, parallel exploration, and an effective probe for optimal DNNs. A hierarchical cluster-based suppression algorithm is also developed to overcome similar weight updates among individuals for improving population diversity. We implement the proposed approach in four representative DNNs based on four publicly-available datasets. Experiment results demonstrate that the four DNNs optimized by the proposed approach all outperform corresponding ones optimized by only SGD on all datasets. The performance of DNNs optimized by the proposed approach also outperforms state-of-the-art deep networks. This work also presents a meaningful attempt for pursuing artificial general intelligence.

الرؤية الحاسوبية وتمييز الأنماط

Parallel and distributed asynchronous adaptive stochastic gradient methods