ترغب بنشر مسار تعليمي؟ اضغط هنا

Breaking the Computation and Communication Abstraction Barrier in Distributed Machine Learning Workloads

169   0   0.0 ( 0 )
 نشر من قبل Abhinav Jangda
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent trend towards increasing large machine learning models require both training and inference tasks to be distributed. Considering the huge cost of training these models, it is imperative to unlock optimizations in computation and communication to obtain best performance. However, current logical separation between computation and communication kernels in deep learning frameworks misses the optimization opportunities across such barrier. Breaking this abstraction with a holistic consideration can provide many optimizations to provide performance improvements in distributed workloads. Manually applying these optimizations needs modifications in underlying computation and communication libraries for each scenario, which is time consuming and error-prone. Therefore, we present CoCoNeT, with a DSL to express a program with both computation and communication. CoCoNeT contains several machine learning aware transformations to optimize a program and a compiler to generate high performance kernels. Providing both computation and communication as first class constructs allows users to work on a high-level abstraction and apply powerful optimizations, such as fusion or overlapping of communication and computation. CoCoNeT enables us to optimize data-, model-and pipeline-parallel workloads in large language models with only a few lines of code. Experiments show CoCoNeT significantly outperforms state-of-the-art distributed machine learning implementations.



قيم البحث

اقرأ أيضاً

109 - Malte S. Kurz 2021
This paper explores serverless cloud computing for double machine learning. Being based on repeated cross-fitting, double machine learning is particularly well suited to exploit the high level of parallelism achievable with serverless computing. It a llows to get fast on-demand estimations without additional cloud maintenance effort. We provide a prototype Python implementation texttt{DoubleML-Serverless} for the estimation of double machine learning models with the serverless computing platform AWS Lambda and demonstrate its utility with a case study analyzing estimation times and costs.
Machine learning (ML) tasks are becoming ubiquitous in todays network applications. Federated learning has emerged recently as a technique for training ML models at the network edge by leveraging processing capabilities across the nodes that collect the data. There are several challenges with employing conventional federated learning in contemporary networks, due to the significant heterogeneity in compute and communication capabilities that exist across devices. To address this, we advocate a new learning paradigm called fog learning which will intelligently distribute ML model training across the continuum of nodes from edge devices to cloud servers. Fog learning enhances federated learning along three major dimensions: network, heterogeneity, and proximity. It considers a multi-layer hybrid learning framework consisting of heterogeneous devices with various proximities. It accounts for the topology structures of the local networks among the heterogeneous nodes at each network layer, orchestrating them for collaborative/cooperative learning through device-to-device (D2D) communications. This migrates from star network topologies used for parameter transfers in federated learning to more distributed topologies at scale. We discuss several open research directions to realizing fog learning.
The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to $approx 2 times$ as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to $approx 10 times$ faster than that of the baseline.
Machine Learning applications on HPC systems have been gaining popularity in recent years. The upcoming large scale systems will offer tremendous parallelism for training through GPUs. However, another heavy aspect of Machine Learning is I/O, and thi s can potentially be a performance bottleneck. TensorFlow, one of the most popular Deep-Learning platforms, now offers a new profiler interface and allows instrumentation of TensorFlow operations. However, the current profiler only enables analysis at the TensorFlow platform level and does not provide system-level information. In this paper, we extend TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that performs instrumentation through Darshan. We use the same Darshan shared instrumentation library and implement a runtime attachment without using a system preload. We can extract Darshan profiling data structures during TensorFlow execution to enable analysis through the TensorFlow profiler. We visualize the performance results through TensorBoard, the web-based TensorFlow visualization tool. At the same time, we do not alter Darshans existing implementation. We illustrate tf-Darshan by performing two case studies on ImageNet image and Malware classification. We show that by guiding optimization using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by selecting data for staging on fast tier storage. We also show that Darshan has the potential of being used as a runtime library for profiling and providing information for future optimization.
Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial ben efits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13%.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا