MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

230 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Rohan Garg

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Rohan Garg - Gregory Price - Gene Cooperman

النظم الموزعة والتوازية والحوسبة العنقودية أنظمة التشغيل

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel split-process approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.

قيم البحث

198 - Prashant Singh Chouhan 2021

Checkpoint/restart (C/R) provides fault-tolerant computing capability, enables long running applications, and provides scheduling flexibility for computing centers to support diverse workloads with different priority. It is therefore vital to get tra nsparent C/R capability working at NERSC. MANA, by Garg et. al., is a transparent checkpointing tool that has been selected due to its MPI-agnostic and network-agnostic approach. However, originally written as a proof-of-concept code, MANA was not ready to use with NERSCs diverse production workloads, which are dominated by MPI and hybrid MPI+OpenMP applications. In this talk, we present ongoing work at NERSC to enable MANA for NERSCs production workloads, including fixing bugs that were exposed by the top applications at NERSC, adding new features to address system changes, evaluating C/R overhead at scale, etc. The lessons learned from making MANA production-ready for HPC applications will be useful for C/R tool developers, supercomputing centers and HPC end-users alike.

النظم الموزعة والتوازية والحوسبة العنقودية

Asynchronous MPI for the Masses

243 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

We present a simple library which equips MPI implementations with truly asynchronous non-blocking point-to-point operations, and which is independent of the underlying communication infrastructure. It utilizes the MPI profiling interface (PMPI) and t he MPI_THREAD_MULTIPLE thread compatibility level, and works with curre

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

75 - Roberto Rocco , Davide Gadioli , Gianluca Palermo 2021

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes are discarded and the execution continues only with the non-failed ones. A hierarchical implementation of the solution has been also proposed to reduce the overhead of the repair process when scaling towards a large number of nodes. We evaluated our solutions on the Marconi100 cluster at CINECA, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. Moreover, we also integrated the solution in real-world applications to further prove its robustness by injecting faults.

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Big Data Staging with MPI-IO for Interactive X-ray Science

94 - Justin M. Wozniak , Hemant Sharma , Timothy G. Armstrong andn Michael Wilde 2020

New techniques in X-ray scattering science experiments produce large data sets that can require millions of high-performance processing hours per week of computation for analysis. In such applications, data is typically moved from X-ray detectors to a large parallel file system shared by all nodes of a petascale supercomputer and then is read repeatedly as different science application tasks proceed. However, this straightforward implementation causes significant contention in the file system. We propose an alternative approach in which data is instead staged into and cached in compute node memory for extended periods, during which time various processing tasks may efficiently access it. We describe here such a big data staging framework, based on MPI-IO and the Swift parallel scripting language. We discuss a range of large-scale data management issues involved in X-ray scattering science and measure the performance benefits of the new staging framework for high-energy diffraction microscopy, an important emerging application in data-intensive X-ray scattering. We show that our framework accelerates scientific processing turnaround from three months to under 10 minutes, and that our I/O technique reduces input overheads by a factor of 5 on 8K Blue Gene/Q nodes.

النظم الموزعة والتوازية والحوسبة العنقودية

Reinit++: Evaluating the Performance of Global-Restart Recovery Methods For MPI Fault Tolerance

192 - Giorgis Georgakoudis , Luanzheng Guo , Ignacio Laguna 2021

Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to re sume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6x, or ULFM, up to 3x, and that it scales excellently as the number of MPI processes grows.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الملك عبد العزيز

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

MANA for MPI: MPI-Agnostic Network-Agnostic Transparent Checkpointing

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً