No Arabic abstract
In this paper we would like to share our experience for transforming a parallel code for a Computational Fluid Dynamics (CFD) problem into a parallel version for the RedisDG workflow engine. This system is able to capture heterogeneous and highly dynamic environments, thanks to opportunistic scheduling strategies. We show how to move to the field of HPC as a Service in order to use heterogeneous platforms. We mainly explain, through the CFD use case, how to transform the parallel code and we exhibit challenges to unfold the task graph dynamically in order to improve the overall performance (in a broad sense) of the workflow engine. We discuss in particular of the impact on the workflow engine of such dynamic feature. This paper states that new models for High Performance Computing are possible, under the condition we revisit our mind in the direction of the potential of new paradigms such as cloud, edge computing.
This paper aims to create a transition path from file-based IO to streaming-based workflows for scientific applications in an HPC environment. By using the openPMP-api, traditional workflows limited by filesystem bottlenecks can be overcome and flexibly extended for in situ analysis. The openPMD-api is a library for the description of scientific data according to the Open Standard for Particle-Mesh Data (openPMD). Its approach towards recent challenges posed by hardware heterogeneity lies in the decoupling of data description in domain sciences, such as plasma physics simulations, from concrete implementations in hardware and IO. The streaming backend is provided by the ADIOS2 framework, developed at Oak Ridge National Laboratory. This paper surveys two openPMD-based loosely coupled setups to demonstrate flexible applicability and to evaluate performance. In loose coupling, as opposed to tight coupling, two (or more) applications are executed separately, e.g. in individual MPI contexts, yet cooperate by exchanging data. This way, a streaming-based workflow allows for standalone codes instead of tightly-coupled plugins, using a unified streaming-aware API and leveraging high-speed communication infrastructure available in modern compute clusters for massive data exchange. We determine new challenges in resource allocation and in the need of strategies for a flexible data distribution, demonstrating their influence on efficiency and scaling on the Summit compute system. The presented setups show the potential for a more flexible use of compute resources brought by streaming IO as well as the ability to increase throughput by avoiding filesystem bottlenecks.
Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation significantly outperforms the state-of-the-art implementation.
Data centres that use consumer-grade disks drives and distributed peer-to-peer systems are unreliable environments to archive data without enough redundancy. Most redundancy schemes are not completely effective for providing high availability, durability and integrity in the long-term. We propose alpha entanglement codes, a mechanism that creates a virtual layer of highly interconnected storage devices to propagate redundant information across a large scale storage system. Our motivation is to design flexible and practical erasure codes with high fault-tolerance to improve data durability and availability even in catastrophic scenarios. By flexible and practical, we mean code settings that can be adapted to future requirements and practical implementations with reasonable trade-offs between security, resource usage and performance. The codes have three parameters. Alpha increases storage overhead linearly but increases the possible paths to recover data exponentially. Two other parameters increase fault-tolerance even further without the need of additional storage. As a result, an entangled storage system can provide high availability, durability and offer additional integrity: it is more difficult to modify data undetectably. We evaluate how several redundancy schemes perform in unreliable environments and show that alpha entanglement codes are flexible and practical codes. Remarkably, they excel at code locality, hence, they reduce repair costs and become less dependent on storage locations with poor availability. Our solution outperforms Reed-Solomon codes in many disaster recovery scenarios.
Traditional heterogeneous parallel algorithms, designed for heterogeneous clusters of workstations, are based on the assumption that the absolute speed of the processors does not depend on the size of the computational task. This assumption proved inaccurate for modern and perspective highly heterogeneous HPC platforms. New class of algorithms based on the functional performance model (FPM), representing the speed of the processor by a function of problem size, has been recently proposed. These algorithms cannot be however employed in self-adaptable applications because of very high cost of construction of the functional performance model. The paper presents a new class of parallel algorithms for highly heterogeneous HPC platforms. Like traditional FPM-based algorithms, these algorithms assume that the speed of the processors is characterized by speed functions rather than speed constants. Unlike the traditional algorithms, they do not assume the speed functions to be given. Instead, they estimate the speed functions of the processors for different problem sizes during their execution. These algorithms do not construct the full speed function for each processor but rather build and use their partial estimates sufficient for optimal distribution of computations with a given accuracy. The low execution cost of distribution of computations between heterogeneous processors in these algorithms make them suitable for employment in self-adaptable applications. Experiments with parallel matrix multiplication applications based on this approach are performed on local and global heterogeneous computational clusters. The results show that the execution time of optimal matrix distribution between processors is significantly less, by orders of magnitude, than the total execution time of the optimized application.
RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems) is a recently started project funded within the H2020 FETHPC programme, which is expressly targeted at exploring new High-Performance Computing (HPC) technologies. RECIPE aims at introducing a hierarchical runtime resource management infrastructure to optimize energy efficiency and minimize the occurrence of thermal hotspots, while enforcing the time constraints imposed by the applications and ensuring reliability for both time-critical and throughput-oriented computation that run on deeply heterogeneous accelerator-based systems. This paper presents a detailed overview of RECIPE, identifying the fundamental challenges as well as the key innovations addressed by the project. In particular, the need for predictive reliability approaches to maximize hardware lifetime and guarantee application performance is identified as the key concern for RECIPE, and is addressed via hierarchical resource management of the heterogeneous architectural components of the system, driven by estimates of the application latency and hardware reliability obtained respectively through timing analysis and modelling thermal properties, mean-time-to-failure of subsystems. We show the impact of prediction accuracy on the overheads imposed by the checkpointing policy, as well as a possible application to a weather forecasting use case.