Characterizing Impacts of Storage Faults on HPC Applications: A Methodology and Insights

82 0 0.0 ( 0 )

Download Cite

Added by Dingwen Tao

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Bo Fang - Daoce Wang - Sian Jin

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In recent years, the increasing complexity in scientific simulations and emerging demands for training heavy artificial intelligence models require massive and fast data accesses, which urges high-performance computing (HPC) platforms to equip with more advanced storage infrastructures such as solid-state disks (SSDs). While SSDs offer high-performance I/O, the reliability challenges faced by the HPC applications under the SSD-related failures remains unclear, in particular for failures resulting in data corruptions. The goal of this paper is to understand the impact of SSD-related faults on the behaviors of complex HPC applications. To this end, we propose FFIS, a FUSE-based fault injection framework that systematically introduces storage faults into the application layer to model the errors originated from SSDs. FFIS is able to plant different I/O related faults into the data returned from underlying file systems, which enables the investigation on the error resilience characteristics of the scientific file format. We demonstrate the use of FFIS with three representative real HPC applications, showing how each application reacts to the data corruptions, and provide insights on the error resilience of the widely adopted HDF5 file format for the HPC applications.

rate research

Optimal Storage under Unsynchrononized Mobile Byzantine Faults

318 - Silvia Bonomi , Antonella Del Pozzo 2017

In this paper we prove lower and matching upper bounds for the number of servers required to implement a regular shared register that tolerates unsynchronized Mobile Byzantine failures. We consider the strongest model of Mobile Byzantine failures to date: agents are moved arbitrarily by an omniscient adversary from a server to another in order to deviate their computation in an unforeseen manner. When a server is infected by an Byzantine agent, it behaves arbitrarily until the adversary decides to move the agent to another server. Previous approaches considered asynchronous servers with synchronous mobile Byzantine agents (yielding impossibility results), and synchronous servers with synchronous mobile Byzantine agents (yielding optimal solutions for regular register implementation, even in the case where servers and agents periods are decoupled). We consider the remaining open case of synchronous servers with unsynchronized agents, that can move at their own pace, and change their pace during the execution of the protocol. Most of our findings relate to lower bounds, and characterizing the model parameters that make the problem solvable. It turns out that unsynchronized mobile Byzantine agent movements requires completely new proof arguments, that can be of independent interest when studying other problems in this model. Additionally, we propose a generic server-based algorithm that emulates a regular register in this model, that is tight with respect to the number of mobile Byzantine agents that can be tolerated. Our emulation spans two awareness models: servers with and without self-diagnose mechanisms. In the first case servers are aware that the mobile Byzantine agent has left and hence they can stop running the protocol until they recover a correct state while in the second case, servers are not aware of their faulty state and continue to run the protocol using an incorrect local state.

Distributed Parallel and Cluster Computing

Blockchain Systems, Technologies and Applications: A Methodology Perspective

166 - Bin Cao , Zixin Wang , Long Zhang 2021

In the past decade, blockchain has shown a promising vision greatly to build the trust without any powerful third party in a secure, decentralized and salable manner. However, due to the wide application and future development from cryptocurrency to Internet of Things, blockchain is an extremely complex system enabling integration with mathematics, finance, computer science, communication and network engineering, etc. As a result, it is a challenge for engineer, expert and researcher to fully understand the blockchain process in a systematic view from top to down. First, this article introduces how blockchain works, the research activity and challenge, and illustrates the roadmap involving the classic methodology with typical blockchain use cases and topics. Second, in blockchain system, how to adopt stochastic process, game theory, optimization, machine learning and cryptography to study blockchain running process and design blockchain protocol/algorithm are discussed in details. Moreover, the advantage and limitation using these methods are also summarized as the guide of future work to further considered. Finally, some remaining problems from technical, commercial and political views are discussed as the open issues. The main findings of this article will provide an overview in a methodology perspective to study theoretical model for blockchain fundamentals understanding, design network service for blockchain-based mechanisms and algorithms, as well as apply blockchain for Internet of Things, etc.

Distributed Parallel and Cluster Computing Cryptography and Security

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

497 - Amani AlOnazi , David Keyes , Alexey Lastovetsky 2015

Hardware-aware design and optimization is crucial in exploiting emerging architectures for PDE-based computational fluid dynamics applications. In this work, we study optimizations aimed at acceleration of OpenFOAM-based applications on emerging hybrid heterogeneous platforms. OpenFOAM uses MPI to provide parallel multi-processor functionality, which scales well on homogeneous systems but does not fully utilize the potential per-node performance on hybrid heterogeneous platforms. In our study, we use two OpenFOAM applications, icoFoam and laplacianFoam, both based on Krylov iterative methods. We propose a number of optimizations of the dominant kernel of the Krylov solver, aimed at acceleration of the overall execution of the applications on modern GPU-accelerated heterogeneous platforms. Experimental results show that the proposed hybrid implementation significantly outperforms the state-of-the-art implementation.

Distributed Parallel and Cluster Computing

Posit NPB: Assessing the Precision Improvement in HPC Scientific Applications

213 - Steven W. D. Chien , Ivy B. Peng , Stefano Markidis 2019

Floating-point operations can significantly impact the accuracy and performance of scientific applications on large-scale parallel systems. Recently, an emerging floating-point format called Posit has attracted attention as an alternative to the standard IEEE floating-point formats because it could enable higher precision than IEEE formats using the same number of bits. In this work, we first explored the feasibility of Posit encoding in representative HPC applications by providing a 32-bit Posit NAS Parallel Benchmark (NPB) suite. Then, we evaluate the accuracy improvement in different HPC kernels compared to the IEEE 754 format. Our results indicate that using Posit encoding achieves optimized precision, ranging from 0.6 to 1.4 decimal digit, for all tested kernels and proxy-applications. Also, we quantified the overhead of the current software implementation of Posit encoding as 4x-19x that of IEEE 754 hardware implementation. Our study highlights the potential of hardware implementations of Posit to benefit a broad range of HPC applications.

Distributed Parallel and Cluster Computing

Build and Execution Environment (BEE): an Encapsulated Environment Enabling HPC Applications Running Everywhere

165 - Jieyang Chen , Qiang Guan , Xin Liang 2017

Variations in High Performance Computing (HPC) system software configurations mean that applications are typically configured and built for specific HPC environments. Building applications can require a significant investment of time and effort for application users and requires application users to have additional technical knowledge. Linux container technologies such as Docker and Charliecloud bring great benefits to the application development, build and deployment processes. While cloud platforms already widely support containers, HPC systems still have non-uniform support of container technologies. In this work, we propose a unified runtime framework -- Build and Execution Environment (BEE) across both HPC and cloud platforms that allows users to run their containerized HPC applications across all supported platforms without modification. We design four BEE backends for four different classes of HPC or cloud platform so that together they cover the majority of mainstream computing platforms for HPC users. Evaluations show that BEE provides an easy-to-use unified user interface, execution environment, and comparable performance.

Distributed Parallel and Cluster Computing