Erasure coding for fault oblivious linear system solvers

197 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل David Gleich

تاريخ النشر 2014

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف David F. Gleich - Ananth Grama - Yao Zhu

التحليل العددي النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Dealing with hardware and software faults is an important problem as parallel and distributed systems scale to millions of processing cores and wide area networks. Traditional methods for dealing with faults include checkpoint-restart, active replicas, and deterministic replay. Each of these techniques has associated resource overheads and constraints. In this paper, we propose an alternate approach to dealing with faults, based on input augmentation. This approach, which is an algorithmic analog of erasure coded storage, applies a minimally modified algorithm on the augmented input to produce an augmented output. The execution of such an algorithm proceeds completely oblivious to faults in the system. In the event of one or more faults, the real solution is recovered using a rapid reconstruction method from the augmented output. We demonstrate this approach on the problem of solving sparse linear systems using a conjugate gradient solver. We present input augmentation and output recovery techniques. Through detailed experiments, we show that our approach can be made oblivious to a large number of faults with low computational overhead. Specifically, we demonstrate cases where a single fault can be corrected with less than 10% overhead in time, and even in extreme cases (fault rates of 20%), our approach is able to compute a solution with reasonable overhead. These results represent a significant improvement over the state of the art.

قيم البحث

128 - J. Harshan , Frederique Oggier , Anwitaman Datta 2014

In this paper we study the problem of storing reliably an archive of versioned data. Specifically, we focus on systems where the differences (deltas) between subseque

نظرية المعلومات النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات

Erasure Coding for Distributed Storage: An Overview

152 - S. B. Balaji , M. Nikhil Krishnan , Myna Vajha 2018

In a distributed storage system, code symbols are dispersed across space in nodes or storage units as opposed to time. In settings such as that of a large data center, an important consideration is the efficient repair of a failed node. Efficient rep air calls for erasure codes that in the face of node failure, are efficient in terms of minimizing the amount of repair data transferred over the network, the amount of data accessed at a helper node as well as the number of helper nodes contacted. Coding theory has evolved to handle these challenges by introducing two new classes of erasure codes, namely regenerating codes and locally recoverable codes as well as by coming up with novel ways to repair the ubiquitous Reed-Solomon code. This survey provides an overview of the efforts in this direction that have taken place over the past decade.

نظرية المعلومات نظرية المعلومات

Fault-tolerant Coding for Quantum Communication

93 - Matthias Christandl , Alexander Muller-Hermes 2020

Designing encoding and decoding circuits to reliably send messages over many uses of a noisy channel is a central problem in communication theory. When studying the optimal transmission rates achievable with asymptotically vanishing error it is usual ly assumed that these circuits can be implemented using noise-free gates. While this assumption is satisfied for classical machines in many scenarios, it is not expected to be satisfied in the near term future for quantum machines where decoherence leads to faults in the quantum gates. As a result, fundamental questions regarding the practical relevance of quantum channel coding remain open. By combining techniques from fault-tolerant quantum computation with techniques from quantum communication, we initiate the study of these questions. We introduce fault-tolera

فيزياء الكم نظرية المعلومات الفيزياء الرياضية

Parallel Tensor Compression for Large-Scale Scientific Data

130 - Woody Austin , Grey Ballard , Tamara G. Kolda 2015

As parallel computing trends towards the exascale, scientific data produced by high-fidelity simulations are growing increasingly massive. For instance, a simulation on a three-dimensional spatial grid with 512 points per dimension that tracks 64 var iables per grid point for 128 time steps yields 8~TB of data, assuming double precision. By viewing the data as a dense five-way tensor, we can compute a Tucker decomposition to find inherent low-dimensional multilinear structure, achieving compression ratios of up to 5000 on real-world data sets with negligible loss in accuracy. So that we can operate on such massive data, we present the first-ever distributed-memory parallel implementation for the Tucker decomposition, whose key computations correspond to parallel linear algebra operations, albeit with nonstandard data layouts. Our approach specifies a data distribution for tensors that avoids any tensor data redistribution, either locally or in parallel. We provide accompanying analysis of the computation and communication costs of the algorithms. To demonstrate the compression and accuracy of the method, we apply our approach to real-world data sets from combustion science simulations. We also provide detailed performance results, including parallel performance in both weak and strong scaling experiments.

التحليل العددي النظم الموزعة والتوازية والحوسبة العنقودية

A space-time parallel solver for the three-dimensional heat equation

134 - Robert Speck , Daniel Ruprecht , Matthew Emmett 2013

The paper presents a combination of the time-parallel parallel full approximation scheme in space and time (PFASST) with a parallel multigrid method (PMG) in space, resulting in a mesh-based solver for the three-dimensional heat equation with a uniqu ely high degree of efficient concurrency. Parallel scaling tests are reported on the Cray XE6 machine Monte Rosa on up to 16,384 cores and on the IBM Blue Gene/Q system JUQUEEN on up to 65,536 cores. The efficacy of the combined spatial- and temporal parallelization is shown by demonstrating that using PFASST in addition to PMG significantly extends the strong-scaling limit. Implications of using spatial coarsening strategies in PFASSTs multi-level hierarchy in large-scale parallel simulations are discussed.

التحليل العددي النظم الموزعة والتوازية والحوسبة العنقودية التحليل العددي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة السورية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Erasure coding for fault oblivious linear system solvers

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً