No Arabic abstract
Consider a complete communication network of $n$ nodes, where the nodes receive a common clock pulse. We study the synchronous $c$-counting problem: given any starting state and up to $f$ faulty nodes with arbitrary behaviour, the task is to eventually have all correct nodes labeling the pulses with increasing values modulo $c$ in agreement. Thus, we are considering algorithms that are self-stabilising despite Byzantine failures. In this work, we give new algorithms for the synchronous counting problem that (1) are deterministic, (2) have optimal resilience, (3) have a linear stabilisation time in $f$ (asymptotically optimal), (4) use a small number of states, and consequently, (5) communicate a small number of bits per round. Prior algorithms either resort to randomisation, use a large number of states and need high communication bandwidth, or have suboptimal resilience. In particular, we achieve an exponential improvement in both state complexity and message size for deterministic algorithms. Moreover, we present two complementary approaches for reducing the number of bits communicated during and after stabilisation.
The celebrated result of Fischer, Lynch and Paterson is the fundamental lower bound for asynchronous fault tolerant computation: any 1-crash resilient asynchronous agreement protocol must have some (possibly measure zero) probability of not terminating. In 1994, Ben-Or, Kelmer and Rabin published a proof-sketch of a lesser known lower bound for asynchronous fault tolerant computation with optimal resilience against a Byzantine adversary: if $nle 4t$ then any t-resilient asynchronous verifiable secret sharing protocol must have some non-zero probability of not terminating. Our main contribution is to revisit this lower bound and provide a rigorous and more general proof. Our second contribution is to show how to avoid this lower bound. We provide a protocol with optimal resilience that is almost surely terminating for a strong common coin functionality. Using this new primitive we provide an almost surely terminating protocol with optimal resilience for asynchronous Byzantine agreement that has a new fair validity property. To the best of our knowledge this is the first asynchronous Byzantine agreement with fair validity in the information theoretic setting.
We investigate the minimal number of failures that can partition a system where processes communicate both through shared memory and by message passing. We prove that this number precisely captures the resilience that can be achieved by algorithms that implement a variety of shared objects, like registers and atomic snapshots, and solve common tasks, like randomized consensus, approximate agreement and renaming. This has implications for the m&m-model and for the hybrid, cluster-based model.
Consider a fully-connected synchronous distributed system consisting of $n$ nodes, where up to $f$ nodes may be faulty and every node starts in an arbitrary initial state. In the synchronous $C$-counting problem, all nodes need to eventually agree on a counter that is increased by one modulo $C$ in each round for given $C>1$. In the self-stabilising firing squad problem, the task is to eventually guarantee that all non-faulty nodes have simultaneous responses to external inputs: if a subset of the correct nodes receive an external go signal as input, then all correct nodes should agree on a round (in the not-too-distant future) in which to jointly output a fire signal. Moreover, no node should generate a fire signal without some correct node having previously received a go signal as input. We present a framework reducing both tasks to binary consensus at very small cost. For example, we obtain a deterministic algorithm for self-stabilising Byzantine firing squads with optimal resilience $f<n/3$, asymptotically optimal stabilisation and response time $O(f)$, and message size $O(log f)$. As our framework does not restrict the type of consensus routines used, we also obtain efficient randomised solutions, and it is straightforward to adapt our framework for other types of permanent faults.
The power of networks manifests itself in a highly non-linear amplification of a number of effects, and their weakness - in propagation of cascading failures. The potential systemic risk effects can be either exacerbated or mitigated, depending on the resilience characteristics of the network. The goals of this paper are to study some characteristics of network amplification and resilience. We simulate random Erdos-Renyi networks and measure amplification by varying node capacity, transaction volume, and expected failure rates. We discover that network throughput scales almost quadratically with respect to the node capacity and that the effects of excessive network load and random and irreparable node faults are equivalent and almost perfectly anticorrelated. This knowledge can be used by capacity planners to determine optimal reliability requirements that maximize the optimal operational regions.
We study the problem of optimizing a non-convex loss function (with saddle points) in a distributed framework in the presence of Byzantine machines. We consider a standard distributed setting with one central machine (parameter server) communicating with many worker machines. Our proposed algorithm is a variant of the celebrated cubic-regularized Newton method of Nesterov and Polyak cite{nest}, which avoids saddle points efficiently and converges to local minima. Furthermore, our algorithm resists the presence of Byzantine machines, which may create emph{fake local minima} near the saddle points of the loss function, also known as saddle-point attack. We robustify the cubic-regularized Newton algorithm such that it avoids the saddle points and the fake local minimas efficiently. Furthermore, being a second order algorithm, the iteration complexity is much lower than its first order counterparts, and thus our algorithm communicates little with the parameter server. We obtain theoretical guarantees for our proposed scheme under several settings including approximate (sub-sampled) gradients and Hessians. Moreover, we validate our theoretical findings with experiments using standard datasets and several types of Byzantine attacks.