Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters

95 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yang You

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Zhengda Bian - Shenggui Li - Wei Wang

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACCs Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.

قيم البحث

73 - Kaixin Zhang , Hongzhi Wang , Tongxin Li 2021

Recently, deep learning has been an area of intense researching. However, as a kind of computing intensive task, deep learning highly relies on the scale of GPU memory, which is usually prohibitive and scarce. Although there are some extensive works have been proposed for dynamic GPU memory management, they are hard to be applied to systems with multiple dynamic workloads, such as in-database machine learning system. In this paper, we demonstrated TENSILE, a method of managing GPU memory in tensor granularity to reduce the GPU memory peak, with taking the multiple dynamic workloads into consideration. As far as we know, TENSILE is the first method which is designed to manage multiple workloads GPU memory using. We implement TENSILE on a deep learning framework built by ourselves, and evaluated its performance. The experiment results show that TENSILE can save more GPU memory with less extra time overhead than prior works in both single and multiple dynamic workloads scenarios.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي قواعد البيانات

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

299 - Qinghao Hu , Peng Sun , Shengen Yan 2021

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial ben efits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

111 - Myeongjae Jeon , Shivaram Venkataraman , Amar Phanishayee 2019

With widespread advances in machine learning, a number of large enterprises are beginning to incorporate machine learning models across a number of products. These models are typically trained on shared, multi-tenant GPU clusters. Similar to existing cluster computing workloads, scheduling frameworks aim to provide features like high efficiency, resource isolation, fair sharing across users, etc. However Deep Neural Network (DNN) based workloads, predominantly trained on GPUs, differ in two significant ways from traditional big data analytics workloads. First, from a cluster utilization perspective, GPUs represent a monolithic resource that cannot be shared at a fine granularity across users. Second, from a workload perspective, deep learning frameworks require gang scheduling reducing the flexibility of scheduling and making the jobs themselves inelastic to failures at runtime. In this paper we present a detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise. By correlating scheduler logs with logs from individual jobs, we study three distinct issues that affect cluster utilization for DNN training workloads on multi-tenant clusters: (1) the effect of gang scheduling and locality constraints on queuing, (2) the effect of locality on GPU utilization, and (3) failures during training. Based on our experience running a large-scale operation, we provide design guidelines pertaining to next-generation cluster schedulers for DNN training workloads.

النظم الموزعة والتوازية والحوسبة العنقودية

Efficient Orchestration of Host and Remote Shared Memory for Memory Intensive Workloads

161 - Juhyun Bae , Gong Su , Arun Iyengar 2020

Since very few contributions to the development of an unified memory orchestration framework for efficient management of both host and remote idle memory have been made, we present Valet, an efficient approach to orchestration of host and remote shar ed memory for improving performance of memory intensive workloads. The paper makes three original contributions. First, we redesign the data flow in the critical path by introducing a host-coordinated memory pool that works as a local cache to reduce the latency in the critical path of the host and remote memory orchestration. Second, Valet utilizes unused local memory across containers by managing local memory via Valet host-coordinated memory pool, which allows containers to dynamically expand and shrink their memory allocations according to the workload demands. Third, Valet provides an efficient remote memory reclaiming technique on remote peers, based on two optimizations: (1) an activity-based victim selection scheme to allow the least-active-chunk of data to be selected for serving the eviction requests and (2) a migration protocol to move the least-active-chunk of data to less-memory-pressured remote node. As a result, Valet can effectively reduce the performance impact and migration overhead on local nodes. Our extensive experiments on both NoSQL systems and Machine Learning (ML) workloads show that Valet outperforms existing representative remote paging systems with up to 226X throughput improvement and up to 98% latency decrease over conventional OS swap facility for big data and ML workloads, and by up to 5.5X throughput improvement and up to 78.4% latency decrease over the state-of-the-art remote paging systems. Valet is open sourced at https://github.com/git-disl/Valet.

النظم الموزعة والتوازية والحوسبة العنقودية

Effective Elastic Scaling of Deep Learning Workloads

131 - Vaibhav Saxena , K. R. Jayaram , Saurav Basu 2020

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to $approx 2 times$ as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to $approx 10 times$ faster than that of the baseline.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي