Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Dynamic Fault Tolerance Through Resource Pooling

257 0 0.0 ( 0 )

Download Cite

Added by Christian M. Fuchs

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Christian M. Fuchs - Nadia M. Murillo - Aske Plaat

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within an multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proof-of-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological systemss way of tolerating and adjusting to failures, enabling graceful ageing of an MPSoC.

rate research

Bringing Fault-Tolerant GigaHertz-Computing to Space: A Multi-Stage Software-Side Fault-Tolerance Approach for Miniaturized Spacecraft

75 - Christian M. Fuchs , Todor Stefanov , Nadia Murillo 2017

Modern embedded technology is a driving factor in satellite miniaturization, contributing to a massive boom in satellite launches and a rapidly evolving new space industry. Miniaturized satellites, however, suffer from low reliability, as traditional hardware-based fault-tolerance (FT) concepts are ineffective for on-board computers (OBCs) utilizing modern systems-on-a-chip (SoC). Therefore, larger satellites continue to rely on proven processors with large feature sizes. Software-based concepts have largely been ignored by the space industry as they were researched only in theory, and have not yet reached the level of maturity necessary for implementation. We present the first integral, real-world solution to enable fault-tolerant general-purpose computing with modern multiprocessor-SoCs (MPSoCs) for spaceflight, thereby enabling their use in future high-priority space missions. The presented multi-stage approach consists of three FT stages, combining coarse-grained thread-level distributed self-validation, FPGA reconfiguration, and mixed criticality to assure long-term FT and excellent scalability for both resource constrained and critical high-priority space missions. Early benchmark results indicate a drastic performance increase over state-of-the-art radiation-hard OBC designs and considerably lower software- and hardware development costs. This approach was developed for a 4-year European Space Agency (ESA) project, and we are implementing a tiled MPSoC prototype jointly with two industrial partners.

Distributed Parallel and Cluster Computing Operating Systems

Revisiting Fast Practical Byzantine Fault Tolerance

148 - Ittai Abraham , Guy Gueta , Dahlia Malkhi 2017

In this note, we observe a safety violation in Zyzzyva and a liveness violation in FaB. To demonstrate these issues, we require relatively simple scenarios, involving only four replicas, and one or two view changes. In all of them, the problem is manifested already in the first log slot.

Distributed Parallel and Cluster Computing

A Fault-Tolerance Shim for Serverless Computing

204 - Vikram Sreekanti , Chenggang Wu , Saurav Chhatrapati 2020

Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application. In this paper, we present AFT, an atomic fault tolerance shim for serverless applications. AFT interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. AFT supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.

Distributed Parallel and Cluster Computing Databases

Approximate Byzantine Fault-Tolerance in Distributed Optimization

94 - Shuo Liu , Nirupam Gupta , Nitin H. Vaidya 2021

This paper considers the problem of Byzantine fault-tolerance in distributed multi-agent optimization. In this problem, each agent has a local cost function, and in the fault-free case, the goal is to design a distributed algorithm that allows all the agents to find a minimum point of all the agents aggregate cost function. We consider a scenario where some agents might be Byzantine faulty that renders the original goal of computing a minimum point of all the agents aggregate cost vacuous. A more reasonable objective for an algorithm in this scenario is to allow all the non-faulty agents to compute the minimum point of only the non-faulty agents aggregate cost. Prior work shows that if there are up to $f$ (out of $n$) Byzantine agents then a minimum point of the non-faulty agents aggregate cost can be computed exactly if and only if the non-faulty agents costs satisfy a certain redundancy property called $2f$-redundancy. However, $2f$-redundancy is an ideal property that can be satisfied only in systems free from noise or uncertainties, which can make the goal of exact fault-tolerance unachievable in some applications. Thus, we introduce the notion of $(f,epsilon)$-resilience, a generalization of exact fault-tolerance wherein the objective is to find an approximate minimum point of the non-faulty aggregate cost, with $epsilon$ accuracy. This approximate fault-tolerance can be achieved under a weaker condition that is easier to satisfy in practice, compared to $2f$-redundancy. We obtain necessary and sufficient conditions for achieving $(f,epsilon)$-resilience characterizing the correlation between relaxation in redundancy and approximation in resilience. In case when the agents cost functions are differentiable, we obtain conditions for $(f,epsilon)$-resilience of the distributed gradient-descent method when equipped with robust gradient aggregation.

Distributed Parallel and Cluster Computing

Tycoon: an Implementation of a Distributed, Market-based Resource Allocation System

162 - Kevin Lai , Lars Rasmusson , Eytan Adar 2004

Distributed clusters like the Grid and PlanetLab enable the same statistical multiplexing efficiency gains for computing as the Internet provides for networking. One major challenge is allocating resources in an economically efficient and low-latency way. A common solution is proportional share, where users each get resources in proportion to their pre-defined weight. However, this does not allow users to differentiate the value of their jobs. This leads to economic inefficiency. In contrast, systems that require reservations impose a high latency (typically minutes to hours) to acquire resources. We present Tycoon, a market based distributed resource allocation system based on proportional share. The key advantages of Tycoon are that it allows users to differentiate the value of their jobs, its resource acquisition latency is limited only by communication delays, and it imposes no manual bidding overhead on users. We present experimental results using a prototype implementation of our design.

Distributed Parallel and Cluster Computing Operating Systems

comments

Fetching comments

Kalamoon Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Dynamic Fault Tolerance Through Resource Pooling

Ask ChatGPT about the research

No Arabic abstract

Read More