Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Erasure coding for fault oblivious linear system solvers

599 0 0.0 ( 0 )

Download Cite

Added by David Gleich

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors David F. Gleich - Ananth Grama - Yao Zhu

Numerical Analysis Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Dealing with hardware and software faults is an important problem as parallel and distributed systems scale to millions of processing cores and wide area networks. Traditional methods for dealing with faults include checkpoint-restart, active replicas, and deterministic replay. Each of these techniques has associated resource overheads and constraints. In this paper, we propose an alternate approach to dealing with faults, based on input augmentation. This approach, which is an algorithmic analog of erasure coded storage, applies a minimally modified algorithm on the augmented input to produce an augmented output. The execution of such an algorithm proceeds completely oblivious to faults in the system. In the event of one or more faults, the real solution is recovered using a rapid reconstruction method from the augmented output. We demonstrate this approach on the problem of solving sparse linear systems using a conjugate gradient solver. We present input augmentation and output recovery techniques. Through detailed experiments, we show that our approach can be made oblivious to a large number of faults with low computational overhead. Specifically, we demonstrate cases where a single fault can be corrected with less than 10% overhead in time, and even in extreme cases (fault rates of 20%), our approach is able to compute a solution with reasonable overhead. These results represent a significant improvement over the state of the art.

rate research

Sparsity Exploiting Erasure Coding for Resilient Storage and Efficient I/O Access in Delta based Versioning Systems

383 - J. Harshan , Frederique Oggier , Anwitaman Datta 2014

In this paper we study the problem of storing reliably an archive of versioned data. Specifically, we focus on systems where the differences (deltas) between subseque

Information Theory Distributed Parallel and Cluster Computing Information Theory

Erasure Coding for Distributed Storage: An Overview

152 - S. B. Balaji , M. Nikhil Krishnan , Myna Vajha 2018

In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node. Efficient repair calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data transferred over the network, the amount of data accessed at a helper node as well as the number of helper nodes contacted. Coding theory has evolved to handle these challenges by introducing two new classes of erasure codes, namely regenerating codes and locally recoverable codes as well as by coming up with novel ways to repair the ubiquitous Reed-Solomon code. This survey provides an overview of the efforts in this direction that have taken place over the past decade.

Information Theory Information Theory

Fault-tolerant Coding for Quantum Communication

93 - Matthias Christandl , Alexander Muller-Hermes 2020

Designing encoding and decoding circuits to reliably send messages over many uses of a noisy channel is a central problem in communication theory. When studying the optimal transmission rates achievable with asymptotically vanishing error it is usually assumed that these circuits can be implemented using noise-free gates. While this assumption is satisfied for classical machines in many scenarios, it is not expected to be satisfied in the near term future for quantum machines where decoherence leads to faults in the quantum gates. As a result, fundamental questions regarding the practical relevance of quantum channel coding remain open. By combining techniques from fault-tolerant quantum computation with techniques from quantum communication, we initiate the study of these questions. We introduce fault-tolera

Quantum Physics Information Theory Mathematical Physics

Parallel Tensor Compression for Large-Scale Scientific Data

130 - Woody Austin , Grey Ballard , Tamara G. Kolda 2015

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 variables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.

Numerical Analysis Distributed Parallel and Cluster Computing

A space-time parallel solver for the three-dimensional heat equation

422 - Robert Speck , Daniel Ruprecht , Matthew Emmett 2013

The paper presents a combination of the time-parallel parallel full approximation scheme in space and time (PFASST) with a parallel multigrid method (PMG) in space, resulting in a mesh-based solver for the three-dimensional heat equation with a uniquely high degree of efficient concurrency. Parallel scaling tests are reported on the Cray XE6 machine Monte Rosa on up to 16,384 cores and on the IBM Blue Gene/Q system JUQUEEN on up to 65,536 cores. The efficacy of the combined spatial- and temporal parallelization is shown by demonstrating that using PFASST in addition to PMG significantly extends the strong-scaling limit. Implications of using spatial coarsening strategies in PFASSTs multi-level hierarchy in large-scale parallel simulations are discussed.

Numerical Analysis Distributed Parallel and Cluster Computing Numerical Analysis

comments

Fetching comments

Al-Andalus University for Medical Sciences

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Erasure coding for fault oblivious linear system solvers

Ask ChatGPT about the research

No Arabic abstract

Read More