No Arabic abstract
The robustness of distributed optimization is an emerging field of study, motivated by various applications of distributed optimization including distributed machine learning, distributed sensing, and swarm robotics. With the rapid expansion of the scale of distributed systems, resilient distributed algorithms for optimization are needed, in order to mitigate system failures, communication issues, or even malicious attacks. This survey investigates the current state of fault-tolerance research in distributed optimization, and aims to provide an overview of the existing studies on both fault-tolerant distributed optimization theories and applicable algorithms.
This paper considers the problem of Byzantine fault-tolerance in distributed multi-agent optimization. In this problem, each agent has a local cost function, and in the fault-free case, the goal is to design a distributed algorithm that allows all the agents to find a minimum point of all the agents aggregate cost function. We consider a scenario where some agents might be Byzantine faulty that renders the original goal of computing a minimum point of all the agents aggregate cost vacuous. A more reasonable objective for an algorithm in this scenario is to allow all the non-faulty agents to compute the minimum point of only the non-faulty agents aggregate cost. Prior work shows that if there are up to $f$ (out of $n$) Byzantine agents then a minimum point of the non-faulty agents aggregate cost can be computed exactly if and only if the non-faulty agents costs satisfy a certain redundancy property called $2f$-redundancy. However, $2f$-redundancy is an ideal property that can be satisfied only in systems free from noise or uncertainties, which can make the goal of exact fault-tolerance unachievable in some applications. Thus, we introduce the notion of $(f,epsilon)$-resilience, a generalization of exact fault-tolerance wherein the objective is to find an approximate minimum point of the non-faulty aggregate cost, with $epsilon$ accuracy. This approximate fault-tolerance can be achieved under a weaker condition that is easier to satisfy in practice, compared to $2f$-redundancy. We obtain necessary and sufficient conditions for achieving $(f,epsilon)$-resilience characterizing the correlation between relaxation in redundancy and approximation in resilience. In case when the agents cost functions are differentiable, we obtain conditions for $(f,epsilon)$-resilience of the distributed gradient-descent method when equipped with robust gradient aggregation.
Machine learning (ML) provides us with numerous opportunities, allowing ML systems to adapt to new situations and contexts. At the same time, this adaptability raises uncertainties concerning the run-time product quality or dependability, such as reliability and security, of these systems. Systems can be tested and monitored, but this does not provide protection against faults and failures in adapted ML systems themselves. We studied software designs that aim at introducing fault tolerance in ML systems so that possible problems in ML components of the systems can be avoided. The research was conducted as a case study, and its data was collected through five semi-structured interviews with experienced software architects. We present a conceptualisation of the misbehaviour of ML systems, the perceived role of fault tolerance, and the designs used. Common patterns to incorporating ML components in design in a fault tolerant fashion have started to emerge. ML models are, for example, guarded by monitoring the inputs and their distribution, and enforcing business rules on acceptable outputs. Multiple, specialised ML models are used to adapt to the variations and changes in the surrounding world, and simpler fall-over techniques like default outputs are put in place to have systems up and running in the face of problems. However, the general role of these patterns is not widely acknowledged. This is mainly due to the relative immaturity of using ML as part of a complete software system: the field still lacks established frameworks and practices beyond training to implement, operate, and maintain the software that utilises ML. ML software engineering needs further analysis and development on all fronts.
In recent years, data and computing resources are typically distributed in the devices of end users, various regions or organizations. Because of laws or regulations, the distributed data and computing resources cannot be directly shared among different regions or organizations for machine learning tasks. Federated learning emerges as an efficient approach to exploit distributed data and computing resources, so as to collaboratively train machine learning models, while obeying the laws and regulations and ensuring data security and data privacy. In this paper, we provide a comprehensive survey of existing works for federated learning. We propose a functional architecture of federated learning systems and a taxonomy of related techniques. Furthermore, we present the distributed training, data communication, and security of FL systems. Finally, we analyze their limitations and propose future research directions.
Serverless computing has grown in popularity in recent years, with an increasing number of applications being built on Functions-as-a-Service (FaaS) platforms. By default, FaaS platforms support retry-based fault tolerance, but this is insufficient for programs that modify shared state, as they can unwittingly persist partial sets of updates in case of failures. To address this challenge, we would like atomic visibility of the updates made by a FaaS application. In this paper, we present AFT, an atomic fault tolerance shim for serverless applications. AFT interposes between a commodity FaaS platform and storage engine and ensures atomic visibility of updates by enforcing the read atomic isolation guarantee. AFT supports new protocols to guarantee read atomic isolation in the serverless setting. We demonstrate that aft introduces minimal overhead relative to existing storage engines and scales smoothly to thousands of requests per second, while preventing a significant number of consistency anomalies.
Miniaturized satellites are currently not considered suitable for critical, high-priority, and complex multi-phased missions, due to their low reliability. As hardware-side fault tolerance (FT) solutions designed for larger spacecraft can not be adopted aboard very small satellites due to budget, energy, and size constraints, we developed a hybrid FT-approach based upon only COTS components, commodity processor cores, library IP, and standard software. This approach facilitates fault detection, isolation, and recovery in software, and utilizes fault-coverage techniques across the embedded stack within an multiprocessor system-on-chip (MPSoC). This allows our FPGA-based proof-of-concept implementation to deliver strong fault-coverage even for missions with a long duration, but also to adapt to varying performance requirements during the mission. The operator of a spacecraft utilizing this approach can define performance profiles, which allow an on-board computer (OBC) to trade between processing capacity, fault coverage, and energy consumption using simple heuristics. The software-side FT approach developed also offers advantages if deployed aboard larger spacecraft through spare resource pooling, enabling an OBC to more efficiently handle permanent faults. This FT approach in part mimics a critical biological systemss way of tolerating and adjusting to failures, enabling graceful ageing of an MPSoC.