Deep Learning Training in Facebook Data Centers: Design of Scale-up and Scale-out Systems

125 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Maxim Naumov

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Maxim Naumov - John Kim - Dheevatsa Mudigere

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Large-scale training is important to ensure high performance and accuracy of machine-learning models. At Facebook we use many different models, including computer vision, video and language models. However, in this paper we focus on the deep learning recommendation models (DLRMs), which are responsible for more than 50% of the training demand in our data centers. Recommendation models present unique challenges in training because they exercise not only compute but also memory capacity as well as memory and network bandwidth. As model size and complexity increase, efficiently scaling training becomes a challenge. To address it we design Zion - Facebooks next-generation large-memory training platform that consists of both CPUs and accelerators. Also, we discuss the design requirements of future scale-out training systems.

قيم البحث

85 - Srinivas Sridharan , Karthikeyan Vaidyanathan , Dhiraj Kalamkar 2018

The exponential growth in use of large deep neural networks has accelerated the need for training these deep neural networks in hours or even minutes. This can only be achieved through scalable and efficient distributed training, since a single node/ card cannot satisfy the compute, memory, and I/O requirements of todays state-of-the-art deep neural networks. However, scaling synchronous Stochastic Gradient Descent (SGD) is still a challenging problem and requires continued research/development. This entails innovations spanning algorithms, frameworks, communication libraries, and system design. In this paper, we describe the philosophy, design, and implementation of Intel Machine Learning Scalability Library (MLSL) and present proof-points demonstrating scaling DL training on 100s to 1000s of nodes across Cloud and HPC systems.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Distributed Training Large-Scale Deep Architectures

119 - Shang-Xuan Zou , Chun-Yen Chen , Jui-Lin Wu 2017

Scale of data and scale of computation infrastructures together enable the current deep learning renaissance. However, training large-scale deep architectures demands both algorithmic improvement and careful system configuration. In this paper, we fo cus on employing the system approach to speed up large-scale training. Via lessons learned from our routine benchmarking effort, we first identify bottlenecks and overheads that hinter data parallelism. We then devise guidelines that help practitioners to configure an effective system and fine-tune parameters to achieve desired speedup. Specifically, we develop a procedure for setting minibatch size and choosing computation algorithms. We also derive lemmas for determining the quantity of key components such as the number of GPUs and parameter servers. Experiments and examples show that these guidelines help effectively speed up large-scale deep learning training.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التعلم الالي

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

63 - Jongsoo Park , Maxim Naumov , Protonu Basu 2018

The application of deep learning techniques resulted in remarkable improvement of machine learning models. In this paper provides detailed characterizations of deep learning models used in many Facebook social network services. We present computation al characteristics of our models, describe high performance optimizations targeting existing systems, point out their limitations and make suggestions for the future general-purpose/accelerated inference hardware. Also, we highlight the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers.

التعلم الآلي التعلم الالي

SLIDE : In Defense of Smart Algorithms over Hardware Acceleration for Large-Scale Deep Learning Systems

108 - Beidi Chen , Tharun Medini , James Farwell 2019

Deep Learning (DL) algorithms are the central focus of modern machine learning systems. As data volumes keep growing, it has become customary to train large neural networks with hundreds of millions of parameters to maintain enough capacity to memori ze these volumes and obtain state-of-the-art accuracy. To get around the costly computations associated with large models and data, the community is increasingly investing in specialized hardware for model training. However, specialized hardware is expensive and hard to generalize to a multitude of tasks. The progress on the algorithmic front has failed to demonstrate a direct advantage over powerful hardware such as NVIDIA-V100 GPUs. This paper provides an exception. We propose SLIDE (Sub-LInear Deep learning Engine) that uniquely blends smart randomized algorithms, with multi-core parallelism and workload optimization. Using just a CPU, SLIDE drastically reduces the computations during both training and inference outperforming an optimized implementation of Tensorflow (TF) on the best available GPU. Our evaluations on industry-scale recommendation datasets, with large fully connected architectures, show that training with SLIDE on a 44 core CPU is more than 3.5 times (1 hour vs. 3.5 hours) faster than the same network trained using TF on Tesla V100 at any given accuracy level. On the same CPU hardware, SLIDE is over 10x faster than TF. We provide codes and scripts for reproducibility.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

H2O-Cloud: A Resource and Quality of Service-Aware Task Scheduling Framework for Warehouse-Scale Data Centers -- A Hierarchical Hybrid DRL (Deep Reinforcement Learning) based Approach

89 - Mingxi Cheng , Ji Li , Paul Bogdan 2019

Cloud computing has attracted both end-users and Cloud Service Providers (CSPs) in recent years. Improving resource utilization rate (RUtR), such as CPU and memory usages on servers, while maintaining Quality-of-Service (QoS) is one key challenge fac ed by CSPs with warehouse-scale data centers. Prior works proposed various algorithms to reduce energy cost or to improve RUtR, which either lack the fine-grained task scheduling capabilities, or fail to take a comprehensive system model into consideration. This article presents H2O-Cloud, a Hierarchical and Hybrid Online task scheduling framework for warehouse-scale CSPs, to improve resource usage effectiveness while maintaining QoS. H2O-Cloud is highly scalable and considers comprehensive information such as various workload scenarios, cloud platform configurations, user request information and dynamic pricing model. The hierarchy and hybridity of the framework, combined with its deep reinforcement learning (DRL) engines, enable H2O-Cloud to efficiently start on-the-go scheduling and learning in an unpredictable environment without pre-training. Our experiments confirm the high efficiency of the proposed H2O-Cloud when compared to baseline approaches, in terms of energy and cost while maintaining QoS. Compared with a state-of-the-art DRL-based algorithm, H2O-Cloud achieves up to 201.17% energy cost efficiency improvement, 47.88% energy efficiency improvement and 551.76% reward rate improvement.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات