Swift for TensorFlow: A portable, flexible platform for deep learning

81 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Brennan Saeta

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Brennan Saeta - Denys Shabalin - Marc Rasi

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Swift for TensorFlow is a deep learning platform that scales from mobile devices to clusters of hardware accelerators in data centers. It combines a language-integrated automatic differentiation system and multiple Tensor implementations within a modern ahead-of-time compiled language oriented around mutable value semantics. The resulting platform has been validated through use in over 30 deep learning models and has been employed across data center and mobile applications.

قيم البحث

206 - Noam Shazeer , Youlong Cheng , Niki Parmar 2018

Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers fr om problems including the inability to train very large models (due to memory constraints), high latency, and inefficiency at small batch sizes. All of these can be solved by more general distribution strategies (model-parallelism). Unfortunately, efficient model-parallel algorithms tend to be complicated to discover, describe, and to implement, particularly on large clusters. We introduce Mesh-TensorFlow, a language for specifying a general class of distributed tensor computations. Where data-parallelism can be viewed as splitting tensors and operations along the batch dimension, in Mesh-TensorFlow, the user can specify any tensor-dimensions to be split across any dimensions of a multi-dimensional mesh of processors. A Mesh-TensorFlow graph compiles into a SPMD program consisting of parallel operations coupled with collective communication primitives such as Allreduce. We use Mesh-TensorFlow to implement an efficient data-parallel, model-parallel version of the Transformer sequence-to-sequence model. Using TPU meshes of up to 512 cores, we train Transformer models with up to 5 billion parameters, surpassing state of the art results on WMT14 English-to-French translation task and the one-billion-word language modeling benchmark. Mesh-Tensorflow is available at https://github.com/tensorflow/mesh .

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

A Hardware-Software Blueprint for Flexible Deep Learning Specialization

134 - Thierry Moreau , Tianqi Chen , Luis Vega 2018

Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operator s, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. VTA is integrated and open-sourced into Apache TVM, a state-of-the-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية التعلم الالي

FfDL : A Flexible Multi-tenant Deep Learning Platform

190 - K. R. Jayaram , Vinod Muthusamy , Parijat Dube 2019

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and acc urate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Flexible Clustered Federated Learning for Client-Level Data Distribution Shift

158 - Moming Duan , Duo Liu , Xinyuan Ji 2021

Federated Learning (FL) enables the multiple participating devices to collaboratively contribute to a global neural network model while keeping the training data locally. Unlike the centralized training setting, the non-IID, imbalanced (statistical h eterogeneity) and distribution shifted training data of FL is distributed in the federated network, which will increase the divergences between the local models and the global model, further degrading performance. In this paper, we propose a flexible clustered federated learning (CFL) framework named FlexCFL, in which we 1) group the training of clients based on the similarities between the clients optimization directions for lower training divergence; 2) implement an efficient newcomer device cold start mechanism for framework scalability and practicality; 3) flexibly migrate clients to meet the challenge of client-level data distribution shift. FlexCFL can achieve improvements by dividing joint optimization into groups of sub-optimization and can strike a balance between accuracy and communication efficiency in the distribution shift environment. The convergence and complexity are analyzed to demonstrate the efficiency of FlexCFL. We also evaluate FlexCFL on several open datasets and made comparisons with related CFL frameworks. The results show that FlexCFL can significantly improve absolute test accuracy by +10.6% on FEMNIST compared to FedAvg, +3.5% on FashionMNIST compared to FedProx, +8.4% on MNIST compared to FeSEM. The experiment results show that FlexCFL is also communication efficient in the distribution shift environment.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية

Resource Management for Blockchain-enabled Federated Learning: A Deep Reinforcement Learning Approach

140 - Nguyen Quang Hieu , Tran The Anh , Nguyen Cong Luong 2020

Blockchain-enabled Federated Learning (BFL) enables mobile devices to collaboratively train neural network models required by a Machine Learning Model Owner (MLMO) while keeping data on the mobile devices. Then, the model updates are stored in the bl ockchain in a decentralized and reliable manner. However, the issue of BFL is that the mobile devices have energy and CPU constraints that may reduce the system lifetime and training efficiency. The other issue is that the training latency may increase due to the blockchain mining process. To address these issues, the MLMO needs to (i) decide how much data and energy that the mobile devices use for the training and (ii) determine the block generation rate to minimize the system latency, energy consumption, and incentive cost while achieving the target accuracy for the model. Under the uncertainty of the BFL environment, it is challenging for the MLMO to determine the optimal decisions. We propose to use the Deep Reinforcement Learning (DRL) to derive the optimal decisions for the MLMO.

التعلم الآلي النظم الموزعة والتوازية والحوسبة العنقودية بنية الشبكات والإنترنت