Achieving 100X faster simulations of complex biological phenomena by coupling ML to HPC ensembles

90 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anda Trifan

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Alexander Brace - Hyungro Lee - Heng Ma

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The use of ML methods to dynamically steer ensemble-based simulations promises significant improvements in the performance of scientific applications. We present DeepDriveMD, a tool for a range of prototypical ML-driven HPC simulation scenarios, and use it to quantify improvements in the scientific performance of ML-driven ensemble-based applications. We discuss its design and characterize its performance. Motivated by the potential for further scientific improvements and applicability to more sophisticated physical systems, we extend the design of DeepDriveMD to support stream-based communication between simulations and learning methods. It demonstrates a 100x speedup to fold proteins, and performs 1.6x more simulations per unit time, improving resource utilization compared to the sequential framework. Experiments are performed on leadership-class platforms, at scales of up to O(1000) nodes, and for production workloads. We establish DeepDriveMD as a high-performance framework for ML-driven HPC simulation scenarios, that supports diverse simulation and ML back-ends, and which enables new scientific insights by improving length- and time-scale accessed.

قيم البحث

57 - J. Luc Peterson , Ben Bay , Joe Koning 2019

With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows, heterogen eous machine architectures, parallel file systems, and batch scheduling, care must be taken to facilitate this analysis in a high performance computing (HPC) environment. In this paper, we present Merlin, a workflow framework to enable large ML-friendly ensembles of scientific HPC simulations. By augmenting traditional HPC with distributed compute technologies, Merlin aims to lower the barrier for scientific subject matter experts to incorporate ML into their analysis. In addition to its design, we describe some example applications that Merlin has enabled on leadership-class HPC resources, such as the ML-augmented optimization of nuclear fusion experiments and the calibration of infectious disease models to study the progression of and possible mitigation strategies for COVID-19.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي الفيزياء الحسابية

Intelligent colocation of HPC workloads

115 - Felippe V. Zacarias 2021

Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all crit ical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager and makes the following contributions: (1) a new machine learning model to predict the performance degradation of colocated applications based on hardware counters and (2) an intelligent scheduling scheme deployed on an existing resource manager to enable application co-scheduling with minimum performance degradation. Our results show that our approach achieves performance improvements of 7% (avg) and 12% (max) compared to the standard policy commonly used by existing job managers.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي الأداء

Blink: Fast and Generic Collectives for Distributed ML

356 - Guanhua Wang , Shivaram Venkataraman , Amar Phanishayee 2019

Model parameter synchronization across GPUs introduces high overheads for data-parallel training at scale. Existing parameter synchronization protocols cannot effectively leverage available network resources in the face of ever increasing hardware he terogeneity. To address this, we propose Blink, a collective communication library that dynamically generates optimal communication primitives by packing spanning trees. We propose techniques to minimize the number of trees generated and extend Blink to leverage heterogeneous communication channels for faster data transfers. Evaluations show that compared to the state-of-the-art (NCCL), Blink can achieve up to 8x faster model synchronization, and reduce end-to-end training time for image classification tasks by up to 40%.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Revelio: ML-Generated Debugging Queries for Distributed Systems

69 - Pradeep Dogga 2021

A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query its logs. Our own study of a production debugging workflow confirms the magnitude of this burden. Thi s paper explores whether a machine-learning model can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bugs root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time. Our developer study confirms the utility of Revelio.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

AIPerf: Automated machine learning as an AI-HPC benchmark

71 - Zhixiang Ren , Yongheng Liu , Tianhui Shi 2020

The plethora of complex artificial intelligence (AI) algorithms and available high performance computing (HPC) power stimulates the expeditious development of AI components with heterogeneous designs. Consequently, the need for cross-stack performanc e benchmarking of AI-HPC systems emerges rapidly. The de facto HPC benchmark LINPACK can not reflect AI computing power and I/O performance without representative workload. The current popular AI benchmarks like MLPerf have fixed problem size therefore limited scalability. To address these issues, we propose an end-to-end benchmark suite utilizing automated machine learning (AutoML), which not only represents real AI scenarios, but also is auto-adaptively scalable to various scales of machines. We implement the algorithms in a highly parallel and flexible way to ensure the efficiency and optimization potential on diverse systems with customizable configurations. We utilize operations per second (OPS), which is measured in an analytical and systematic approach, as the major metric to quantify the AI performance. We perform evaluations on various systems to ensure the benchmarks stability and scalability, from 4 nodes with 32 NVIDIA Tesla T4 (56.1 Tera-OPS measured), up to 512 nodes with 4096 Huawei Ascend 910 (194.53 Peta-OPS measured), and the results show near-linear weak scalability. With flexible workload and single metric, our benchmark can scale and rank AI-HPC easily.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي