Flexible and Scalable Deep Learning with MMLSpark

67 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Mark Hamilton

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Mark Hamilton - Sudarshan Raghunathan - Akshaya Annavajhala

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this work we detail a novel open source library, called MMLSpark, that combines the flexible deep learning library Cognitive Toolkit, with the distributed computing framework Apache Spark. To achieve this, we have contributed Java Language bindings to the Cognitive Toolkit, and added several new components to the Spark ecosystem. In addition, we also integrate the popular image processing library OpenCV with Spark, and present a tool for the automated generation of PySpark wrappers from any SparkML estimator and use this tool to expose all work to the PySpark ecosystem. Finally, we provide a large library of tools for working and developing within the Spark ecosystem. We apply this work to the automated classification of Snow Leopards from camera trap images, and provide an end to end solution for the non-profit conservation organization, the Snow Leopard Trust.

قيم البحث

190 - K. R. Jayaram , Vinod Muthusamy , Parijat Dube 2019

Deep learning (DL) is becoming increasingly popular in several application domains and has made several new application features involving computer vision, speech recognition and synthesis, self-driving automobiles, drug design, etc. feasible and acc urate. As a result, large scale on-premise and cloud-hosted deep learning platforms have become essential infrastructure in many organizations. These systems accept, schedule, manage and execute DL training jobs at scale. This paper describes the design, implementation and our experiences with FfDL, a DL platform used at IBM. We describe how our design balances dependability with scalability, elasticity, flexibility and efficiency. We examine FfDL qualitatively through a retrospective look at the lessons learned from building, operating, and supporting FfDL; and quantitatively through a detailed empirical evaluation of FfDL, including the overheads introduced by the platform for various deep learning models, the load and performance observed in a real case study using FfDL within our organization, the frequency of various faults observed including unanticipated faults, and experiments demonstrating the benefits of various scheduling policies. FfDL has been open-sourced.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Toward Scalable Machine Learning and Data Mining: the Bioinformatics Case

95 - Faraz Faghri , Sayed Hadi Hashemi , Mohammad Babaeizadeh 2017

In an effort to overcome the data deluge in computational biology and bioinformatics and to facilitate bioinformatics research in the era of big data, we identify some of the most influential algorithms that have been widely used in the bioinformatic s community. These top data mining and machine learning algorithms cover classification, clustering, regression, graphical model-based learning, and dimensionality reduction. The goal of this study is to guide the focus of scalable computing experts in the endeavor of applying new storage and scalable computation designs to bioinformatics algorithms that merit their attention most, following the engineering maxim of optimize the common case.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي التعلم الالي

BSMBench: a flexible and scalable supercomputer benchmark from computational particle physics

446 - Ed Bennett , Luigi Del Debbio , Kirk Jordan 2014

Lattice Quantum ChromoDynamics (QCD), and by extension its parent field, Lattice Gauge Theory (LGT), make up a significant fraction of supercomputing cycles worldwide. As such, it would be irresponsible not to evaluate machines suitability for such a pplications. To this end, a benchmark has been developed to assess the performance of LGT applications on modern HPC platforms. Distinct from previous QCD-based benchmarks, this allows probing the behaviour of a variety of theories, which allows varying the ratio of demands between on-node computations and inter-node communications. The results of testing this benchmark on various recent HPC platforms are presented, and directions for future development are discussed.

النظم الموزعة والتوازية والحوسبة العنقودية فيزياء الطاقة العالية - شعرية

Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters

226 - Shaohuai Shi , Xianhao Zhou , Shutao Song 2020

Distributed training techniques have been widely deployed in large-scale deep neural networks (DNNs) training on dense-GPU clusters. However, on public cloud clusters, due to the moderate inter-connection bandwidth between instances, traditional stat e-of-the-art distributed training systems cannot scale well in training large-scale models. In this paper, we propose a new computing and communication efficient top-k sparsification communication library for distributed training. To further improve the system scalability, we optimize I/O by proposing a simple yet efficient multi-level data caching mechanism and optimize the update operation by introducing a novel parallel tensor operator. Experimental results on a 16-node Tencent Cloud cluster (each node with 8 Nvidia Tesla V100 GPUs) show that our system achieves 25%-40% faster than existing state-of-the-art systems on CNNs and Transformer. We finally break the record on DAWNBench on training ResNet-50 to 93% top-5 accuracy on ImageNet.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي

Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models

241 - Dheevatsa Mudigere , Yuchen Hao , Jianyu Huang 2021

Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook and are the single largest AI application in terms of infrastructure demand in its data-centers. In this paper we discuss the SW/HW co-designed so lution for high-performance distributed training of large-scale DLRMs. We introduce a high-performance scalable software stack based on PyTorch and pair it with the new evolution of Zion platform, namely ZionEX. We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems. We achieve this by (i) designing the ZionEX platform with dedicated scale-out network, provisioned with high bandwidth, optimal topology and efficient transport (ii) implementing an optimized PyTorch-based training stack supporting both model and data parallelism (iii) developing sharding algorithms capable of hierarchical partitioning of the embedding tables along row, column dimensions and load balancing them across multiple workers; (iv) adding high-performance core operators while retaining flexibility to support optimizers with fully deterministic updates (v) leveraging reduced precision communications, multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we develop and briefly comment on distributed data ingestion and other supporting services that are required for the robust and efficient end-to-end training in production environments.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي التعلم الآلي