No Arabic abstract
We present a simple library which equips MPI implementations with truly asynchronous non-blocking point-to-point operations, and which is independent of the underlying communication infrastructure. It utilizes the MPI profiling interface (PMPI) and the MPI_THREAD_MULTIPLE thread compatibility level, and works with curre
Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes are discarded and the execution continues only with the non-failed ones. A hierarchical implementation of the solution has been also proposed to reduce the overhead of the repair process when scaling towards a large number of nodes. We evaluated our solutions on the Marconi100 cluster at CINECA, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. Moreover, we also integrated the solution in real-world applications to further prove its robustness by injecting faults.
Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation time does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.
Transparently checkpointing MPI for fault tolerance and load balancing is a long-standing problem in HPC. The problem has been complicated by the need to provide checkpoint-restart services for all combinations of an MPI implementation over all network interconnects. This work presents MANA (MPI-Agnostic Network-Agnostic transparent checkpointing), a single code base which supports all MPI implementation and interconnect combinations. The agnostic properties imply that one can checkpoint an MPI application under one MPI implementation and perhaps over TCP, and then restart under a second MPI implementation over InfiniBand on a cluster with a different number of CPU cores per node. This technique is based on a novel split-process approach, which enables two separate programs to co-exist within a single process with a single address space. This work overcomes the limitations of the two most widely adopted transparent checkpointing solutions, BLCR and DMTCP/InfiniBand, which require separate modifications to each MPI implementation and/or underlying network API. The runtime overhead is found to be insignificant both for checkpoint-restart within a single host, and when comparing a local MPI computation that was migrated to a remote cluster against an ordinary MPI computation running natively on that same remote cluster.
Scaling supercomputers comes with an increase in failure rates due to the increasing number of hardware components. In standard practice, applications are made resilient through checkpointing data and restarting execution after a failure occurs to resume from the latest check-point. However, re-deploying an application incurs overhead by tearing down and re-instating execution, and possibly limiting checkpointing retrieval from slow permanent storage. In this paper we present Reinit++, a new design and implementation of the Reinit approach for global-restart recovery, which avoids application re-deployment. We extensively evaluate Reinit++ contrasted with the leading MPI fault-tolerance approach of ULFM, implementing global-restart recovery, and the typical practice of restarting an application to derive new insight on performance. Experimentation with three different HPC proxy applications made resilient to withstand process and node failures shows that Reinit++ recovers much faster than restarting, up to 6x, or ULFM, up to 3x, and that it scales excellently as the number of MPI processes grows.
New techniques in X-ray scattering science experiments produce large data sets that can require millions of high-performance processing hours per week of computation for analysis. In such applications, data is typically moved from X-ray detectors to a large parallel file system shared by all nodes of a petascale supercomputer and then is read repeatedly as different science application tasks proceed. However, this straightforward implementation causes significant contention in the file system. We propose an alternative approach in which data is instead staged into and cached in compute node memory for extended periods, during which time various processing tasks may efficiently access it. We describe here such a big data staging framework, based on MPI-IO and the Swift parallel scripting language. We discuss a range of large-scale data management issues involved in X-ray scattering science and measure the performance benefits of the new staging framework for high-energy diffraction microscopy, an important emerging application in data-intensive X-ray scattering. We show that our framework accelerates scientific processing turnaround from three months to under 10 minutes, and that our I/O technique reduces input overheads by a factor of 5 on 8K Blue Gene/Q nodes.