ﻻ يوجد ملخص باللغة العربية
Error-bounded lossy compression is becoming more and more important to todays extreme-scale HPC applications because of the ever-increasing volume of data generated because it has been widely used in in-situ visualization, data stream intensity reduction, storage reduction, I/O performance improvement, checkpoint/restart acceleration, memory footprint reduction, etc. Although many works have optimized ratio, quality, and performance for different error-bounded lossy compressors, there is none of the existing works attempting to systematically understand the impact of lossy compression errors on HPC application due to error propagation. In this paper, we propose and develop a lossy compression fault injection tool, called LCFI. To the best of our knowledge, this is the first fault injection tool that helps both lossy compressor developers and users to systematically and comprehensively understand the impact of lossy compression errors on HPC programs. The contributions of this work are threefold: (1) We propose an efficient approach to inject lossy compression errors according to a statistical analysis of compression errors for different state-of-the-art compressors. (2) We build a fault injector which is highly applicable, customizable, easy-to-use in generating top-down comprehensive results, and demonstrate the use of LCFI. (3) We evaluate LCFI on four representative HPC benchmarks with different abstracted fault models and make several observations about error propagation and their impacts on program outputs.
Error-bounded lossy compression is a critical technique for significantly reducing scientific data volumes. With ever-emerging heterogeneous high-performance computing (HPC) architecture, GPU-accelerated error-bounded compressors (such as cuSZ+ and c
As machine learning (ML) has seen increasing adoption in safety-critical domains (e.g., autonomous vehicles), the reliability of ML systems has also grown in importance. While prior studies have proposed techniques to enable efficient error-resilienc
The vast majority of hardware architectures use a carefully timed reference signal to clock their computational logic. However, standard distribution solutions are not fault-tolerant. In this work, we present a simple grid structure as a more reliabl
We present a new algorithm, Fractional Decomposition Tree (FDT) for finding a feasible solution for an integer program (IP) where all variables are binary. FDT runs in polynomial time and is guaranteed to find a feasible integer solution provided the
Extreme-scale cosmological simulations have been widely used by todays researchers and scientists on leadership supercomputers. A new generation of error-bounded lossy compressors has been used in workflows to reduce storage requirements and minimize