ترغب بنشر مسار تعليمي؟ اضغط هنا

Slow and Stale Gradients Can Win the Race

83   0   0.0 ( 0 )
 نشر من قبل Sanghamitra Dutta
 تاريخ النشر 2020
والبحث باللغة English




اسأل ChatGPT حول البحث

Distributed Stochastic Gradient Descent (SGD) when run in a synchronous manner, suffers from delays in runtime as it waits for the slowest workers (stragglers). Asynchronous methods can alleviate stragglers, but cause gradient staleness that can adversely affect the convergence error. In this work, we present a novel theoretical characterization of the speedup offered by asynchronous methods by analyzing the trade-off between the error in the trained model and the actual training runtime(wallclock time). The main novelty in our work is that our runtime analysis considers random straggling delays, which helps us design and compare distributed SGD algorithms that strike a balance between straggling and staleness. We also provide a new error convergence analysis of asynchronous SGD variants without bounded or exponential delay assumptions. Finally, based on our theoretical characterization of the error-runtime trade-off, we propose a method of gradually varying synchronicity in distributed SGD and demonstrate its performance on CIFAR10 dataset.



قيم البحث

اقرأ أيضاً

We investigate the effects of gradual heating on the evolution of turbulent molecular clouds of mass $2times 10^6$ M$_odot$ and virial parameters ranging between $0.7-1.2$. This gradual heating represents the energy output from processes such as wind s from massive stars or feedback from High Mass X-ray binaries (HMXBs), contrasting the impulsive energy injection from supernovae (SNe). For stars with a mass high enough that their lifetime is shorter than the life of the cloud, we include a SN feedback prescription. Including both effects, we investigate the interplay between slow and fast forms of feedback and their effectiveness at triggering/suppressing star formation. We find that SN feedback can carve low density chimneys in the gas, offering a path of least resistance for the energy to escape. Once this occurs the more stable, but less energetic, gradual feedback is able to keep the chimneys open. By funneling the hot destructive gas away from the centre of the cloud, chimneys can have a positive effect on both the efficiency and duration of star formation. Moreover, the critical factor is the number of high mass stars and SNe (and any subsequent HMXBs) active within the free-fall time of each cloud. This can vary from cloud to cloud due to the stochasticity of SN delay times and in HMXB formation. However, the defining factor in our simulations is the efficiency of the cooling, which can alter the Jeans mass required for sink particle formation, along with the number of massive stars in the cloud.
Network pruning is a method for reducing test-time computational resource requirements with minimal performance degradation. Conventional wisdom of pruning algorithms suggests that: (1) Pruning methods exploit information from training data to find g ood subnetworks; (2) The architecture of the pruned network is crucial for good performance. In this paper, we conduct sanity checks for the above beliefs on several recent unstructured pruning methods and surprisingly find that: (1) A set of methods which aims to find good subnetworks of the randomly-initialized network (which we call initial tickets), hardly exploits any information from the training data; (2) For the pruned networks obtained by these methods, randomly changing the preserved weights in each layer, while keeping the total number of preserved weights unchanged per layer, does not affect the final performance. These findings inspire us to choose a series of simple emph{data-independent} prune ratios for each layer, and randomly prune each layer accordingly to get a subnetwork (which we call random tickets). Experimental results show that our zero-shot random tickets outperform or attain a similar performance compared to existing initial tickets. In addition, we identify one existing pruning method that passes our sanity checks. We hybridize the ratios in our random ticket with this method and propose a new method called hybrid tickets, which achieves further improvement. (Our code is publicly available at https://github.com/JingtongSu/sanity-checking-pruning)
In this paper, we propose texttt{FedGLOMO}, the first (first-order) FL algorithm that achieves the optimal iteration complexity (i.e matching the known lower bound) on smooth non-convex objectives -- without using clients full gradient in each round. Our key algorithmic idea that enables attaining this optimal complexity is applying judicious momentum terms that promote variance reduction in both the local updates at the clients, and the global update at the server. Our algorithm is also provably optimal even with compressed communication between the clients and the server, which is an important consideration in the practical deployment of FL algorithms. Our experiments illustrate the intrinsic variance reduction effect of texttt{FedGLOMO} which implicitly suppresses client-drift in heterogeneous data distribution settings and promotes communication-efficiency. As a prequel to texttt{FedGLOMO}, we propose texttt{FedLOMO} which applies momentum only in the local client updates. We establish that texttt{FedLOMO} enjoys improved convergence rates under common non-convex settings compared to prior work, and with fewer assumptions.
Accurate predictions of reactive mixing are critical for many Earth and environmental science problems. To investigate mixing dynamics over time under different scenarios, a high-fidelity, finite-element-based numerical model is built to solve the fa st, irreversible bimolecular reaction-diffusion equations to simulate a range of reactive-mixing scenarios. A total of 2,315 simulations are performed using different sets of model input parameters comprising various spatial scales of vortex structures in the velocity field, time-scales associated with velocity oscillations, the perturbation parameter for the vortex-based velocity, anisotropic dispersion contrast, and molecular diffusion. Outputs comprise concentration profiles of the reactants and products. The inputs and outputs of these simulations are concatenated into feature and label matrices, respectively, to train 20 different machine learning (ML) emulators to approximate system behavior. The 20 ML emulators based on linear methods, Bayesian methods, ensemble learning methods, and multilayer perceptron (MLP), are compared to assess these models. The ML emulators are specifically trained to classify the state of mixing and predict three quantities of interest (QoIs) characterizing species production, decay, and degree of mixing. Linear classifiers and regressors fail to reproduce the QoIs; however, ensemble methods (classifiers and regressors) and the MLP accurately classify the state of reactive mixing and the QoIs. Among ensemble methods, random forest and decision-tree-based AdaBoost faithfully predict the QoIs. At run time, trained ML emulators are $approx10^5$ times faster than the high-fidelity numerical simulations. Speed and accuracy of the ensemble and MLP models facilitate uncertainty quantification, which usually requires 1,000s of model run, to estimate the uncertainty bounds on the QoIs.
Decentralized optimization techniques are increasingly being used to learn machine learning models from data distributed over multiple locations without gathering the data at any one location. Unfortunately, methods that are designed for faultless ne tworks typically fail in the presence of node failures. In particular, Byzantine failures---corresponding to the scenario in which faulty/compromised nodes are allowed to arbitrarily deviate from an agreed-upon protocol---are the hardest to safeguard against in decentralized settings. This paper introduces a Byzantine-resilient decentralized gradient descent (BRIDGE) method for decentralized learning that, when compared to existing works, is more efficient and scalable in higher-dimensional settings and that is deployable in networks having topologies that go beyond the star topology. The main contributions of this work include theoretical analysis of BRIDGE for strongly convex learning objectives and numerical experiments demonstrating the efficacy of BRIDGE for both convex and nonconvex learning tasks.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا