Srift: Swift and Thrift Cloud-Based Distributed Training


Abstract in English

Cost-efficiency and training time are primary concerns in cloud-based distributed training today. With many VM configurations to choose from, given a time constraint, what configuration achieves the lowest cost? Or, given a cost budget, which configuration leads to the highest throughput? We present a comprehensive throughput and cost-efficiency study across a wide array of instance choices in the cloud. With the insights from this study, we build Srift, a system that combines runtime instrumentation and learned performance models to accurately predict training performance and find the best choice of VMs to improve throughput and lower cost while satisfying user constraints. With Pytorch and EC2, we show Srifts choices of VM instances can lead to up to 2x better throughput and 1.6x lower cost per iteration compared to baseline choices across various DNN models in real-world scenarios, leveraging heterogeneous setups and spot instances.

Download