Building a fault tolerant application using the GASPI communication layer

338 0 0.0 ( 0 )

Download Cite

Added by Faisal Shahzad

Publication date 2015

fields Informatics Engineering

and research's language is English

Authors Faisal Shahzad - Moritz Kreutzer - Thomas Zeiser

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend applications to allow integrate a low cost fault detection mechanism and, if necessary, recover the application on the fly. The aspects of process management, the restoration of groups and the recovery mechanism is presented in detail. We use a sparse matrix vector multiplication based application to perform the analysis of the overhead introduced by such modifications. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

rate research

Building a fault-tolerant quantum computer using concatenated cat codes

202 - Christopher Chamberland , Kyungjoo Noh , Patricio Arrangoiz-Arriola 2020

We present a comprehensive architectural analysis for a fault-tolerant quantum computer based on cat codes concatenated with outer quantum error-correcting codes. For the physical hardware, we propose a system of acoustic resonators coupled to superconducting circuits with a two-dimensional layout. Using estimated near-term physical parameters for electro-acoustic systems, we perform a detailed error analysis of measurements and gates, including CNOT and Toffoli gates. Having built a realistic noise model, we numerically simulate quantum error correction when the outer code is either a repetition code or a thin rectangular surface code. Our next step toward universal fault-tolerant quantum computation is a protocol for fault-tolerant Toffoli magic state preparation that significantly improves upon the fidelity of physical Toffoli gates at very low qubit cost. To achieve even lower overheads, we devise a new magic-state distillation protocol for Toffoli states. Combining these results together, we obtain realistic full-resource estimates of the physical error rates and overheads needed to run useful fault-tolerant quantum algorithms. We find that with around 1,000 superconducting circuit components, one could construct a fault-tolerant quantum computer that can run circuits which are intractable for classical supercomputers. Hardware with 32,000 superconducting circuit components, in turn, could simulate the Hubbard model in a regime beyond the reach of classical computing.

Quantum Physics

Fault Tolerant Frequent Pattern Mining

216 - Sameh Shohdy , Abhinav Vishnu , Gagan Agrawal 2016

FP-Growth algorithm is a Frequent Pattern Min- ing (FPM) algorithm that has been extensively used to study correlations and patterns in large scale datasets. While several researchers have designed distributed memory FP-Growth algorithms, it is pivotal to consider fault tolerant FP-Growth, which can address the increasing fault rates in large scale systems. In this work, we propose a novel parallel, algorithm-level fault-tolerant FP-Growth algorithm. We leverage algorithmic properties and MPI advanced features to guarantee an O(1) space complexity, achieved by using the dataset memory space itself for checkpointing. We also propose a recovery algorithm that can use in-memory and disk-based checkpointing, though in many cases the recovery can be completed without any disk access, and incurring no memory overhead for checkpointing. We evaluate our FT algorithm on a large scale InfiniBand cluster with several large datasets using up to 2K cores. Our evaluation demonstrates excellent efficiency for checkpointing and recovery in comparison to the disk-based approach. We have also observed 20x average speed-up in comparison to Spark, establishing that a well designed algorithm can easily outperform a solution based on a general fault-tolerant programming model.

Distributed Parallel and Cluster Computing

Fault-tolerant Coding for Quantum Communication

93 - Matthias Christandl , Alexander Muller-Hermes 2020

Designing encoding and decoding circuits to reliably send messages over many uses of a noisy channel is a central problem in communication theory. When studying the optimal transmission rates achievable with asymptotically vanishing error it is usually assumed that these circuits can be implemented using noise-free gates. While this assumption is satisfied for classical machines in many scenarios, it is not expected to be satisfied in the near term future for quantum machines where decoherence leads to faults in the quantum gates. As a result, fundamental questions regarding the practical relevance of quantum channel coding remain open. By combining techniques from fault-tolerant quantum computation with techniques from quantum communication, we initiate the study of these questions. We introduce fault-tolera

Quantum Physics Information Theory Mathematical Physics

Qubit metrology for building a fault-tolerant quantum computer

150 - John M. Martinis 2015

Recent progress in quantum information has led to the start of several large national and industrial efforts to build a quantum computer. Researchers are now working to overcome many scientific and technological challenges. The programs biggest obstacle, a potential showstopper for the entire effort, is the need for high-fidelity qubit operations in a scalable architecture. This challenge arises from the fundamental fragility of quantum information, which can only be overcome with quantum error correction. In a fault-tolerant quantum computer the qubits and their logic interactions must have errors below a threshold: scaling up with more and more qubits then brings the net error probability down to appropriate levels ~ $10^{-18}$ needed for running complex algorithms. Reducing error requires solving problems in physics, control, materials and fabrication, which differ for every implementation. I explain here the common key driver for continued improvement - the metrology of qubit errors.

Quantum Physics

Hermes: a Fast, Fault-Tolerant and Linearizable Replication Protocol

215 - A. Katsarakis 2020

Todays datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is accomplished with the help of a reliable replication protocol that is responsible for maintaining the replicas strongly-consistent even when faults occur. Strong consistency is preferred to weaker consistency models that cannot guarantee an intuitive behavior for the clients. Furthermore, to accommodate high demand at real-time latencies, datastores must deliver high throughput and low latency. This work introduces Hermes, a broadcast-based reliable replication protocol for in-memory datastores that provides both high throughput and low latency by enabling local reads and fully-concurrent fast writes at all replicas. Hermes couples logical timestamps with cache-coherence-inspired invalidations to guarantee linearizability, avoid write serialization at a centralized ordering point, resolve write conflicts locally at each replica (hence ensuring that writes never abort) and provide fault-tolerance via replayable writes. Our implementation of Hermes over an RDMA-enabled reliable datastore with five replicas shows that Hermes consistently achieves higher throughput than state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write ratios while also significantly reducing tail latency. At 5% writes, the tail latency of Hermes is 3.6X lower than that of CRAQ and ZAB.

Distributed Parallel and Cluster Computing