ﻻ يوجد ملخص باللغة العربية
Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configuration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srifts choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.
Cloud GPU servers have become the de facto way for deep learning practitioners to train complex models on large-scale datasets. However, it is challenging to determine the appropriate cluster configuration---e.g., server type and number---for differe
Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional stat
Data parallelism does a good job in speeding up the training. However, when it comes to the case when the memory of a single device can not host a whole model, data parallelism would not have the chance to do anything. Another option is to split the
Stochastic gradient descent (SGD) is an inherently sequential training algorithm--computing the gradient at batch $i$ depends on the model parameters learned from batch $i-1$. Prior approaches that break this dependence do not honor them (e.g., sum t
Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we fo