No Arabic abstract
Distributed replication systems based on the replicated state machine model have become ubiquitous as the foundation of modern database systems. To ensure availability in the presence of faults, these systems must be able to dynamically replace failed nodes with healthy ones via dynamic reconfiguration. MongoDB is a document oriented database with a distributed replication mechanism derived from the Raft protocol. In this paper, we present MongoRaftReconfig, a novel dynamic reconfiguration protocol for the MongoDB replication system. MongoRaftReconfig utilizes a logless approach to managing configuration state and decouples the processing of configuration changes from the main database operation log. The protocols design was influenced by engineering constraints faced when attempting to redesign an unsafe, legacy reconfiguration mechanism that existed previously in MongoDB. We provide a safety proof of MongoRaftReconfig, along with a formal specification in TLA+. To our knowledge, this is the first published safety proof and formal specification of a reconfiguration protocol for a Raft-based system. We also present results from model checking its safety properties on finite protocol instances. Finally, we discuss the conceptual novelties of MongoRaftReconfig, how it can be understood as an optimized and generalized version of the single server reconfiguration algorithm of Raft, and present an experimental evaluation of how its optimizations can provide performance benefits for reconfigurations.
Highly-available datastores are widely deployed for online applications. However, many online applications are not contented with the simple data access interface currently provided by highly-available datastores. Distributed transaction support is demanded by applications such as large-scale online payment used by Alipay or Paypal. Current solutions to distributed transaction can spend more than half of the whole transaction processing time in distributed commit. An efficient atomic commit protocol is highly desirable. This paper presents the HACommit protocol, a logless one-phase commit protocol for highly-available systems. HACommit has transaction participants vote for a commit before the client decides to commit or abort the transaction; in comparison, the state-of-the-art practice for distributed commit is to have the client decide before participants vote. The change enables the removal of both the participant logging and the coordinator logging steps in the distributed commit process; it also makes possible that, after the client initiates the transaction commit, the transaction data is visible to other transactions within one communication roundtrip time (i.e., one phase). In the evaluation with extensive experiments, HACommit outperforms recent atomic commit solutions for highly-available datastores under different workloads. In the best case, HACommit can commit in one fifth of the time 2PC does.
In this paper, we study systems of distributed entities that can actively modify their communication network. This gives rise to distributed algorithms that apart from communication can also exploit network reconfiguration in order to carry out a given task. At the same time, the distributed task itself may now require global reconfiguration from a given initial network $G_s$ to a target network $G_f$ from a family of networks having some good properties, like small diameter. With reasonably powerful computational entities, there is a straightforward algorithm that transforms any $G_s$ into a spanning clique in $O(log n)$ time. The algorithm can then compute any global function on inputs and reconfigure to any target network in one round. We argue that such a strategy is impractical for real applications. In real dynamic networks there is a cost associated with creating and maintaining connections. To formally capture such costs, we define three edge-complexity measures: the emph{total edge activations}, the emph{maximum activated edges per round}, and the emph{maximum activated degree of a node}. The clique formation strategy highlighted above, maximizes all of them. We aim at improved algorithms that achieve (poly)log$(n)$ time while minimizing the edge-complexity for the general task of transforming any $G_s$ into a $G_f$ of diameter (poly)log$(n)$. We give three distributed algorithms. The first runs in $O(log n)$ time, with at most $2n$ active edges per round, an optimal total of $O(nlog n)$ edge activations, a maximum degree $n-1$, and a target network of diameter 2. The second achieves bounded degree by paying an additional logarithmic factor in time and in total edge activations and gives a target network of diameter $O(log n)$. Our third algorithm shows that if we slightly increase the maximum degree to polylog$(n)$ then we can achieve a running time of $o(log^2 n)$.
A decentralized blockchain is a distributed ledger that is often used as a platform for exchanging goods and services. This ledger is maintained by a network of nodes that obeys a set of rules, called a consensus protocol, which helps to resolve inconsistencies among local copies of a blockchain. In this paper, we build a mathematical framework for the consensus protocol designer that specifies (a) the measurement of a resource which nodes strategically invest in and compete for in order to win the right to build new blocks in the blockchain; and (b) a payoff function for their efforts. Thus the equilibrium of an associated stochastic differential game can be implemented by selecting nodes in proportion to this specified resource and penalizing dishonest nodes by its loss. This associated, induced game can be further analyzed by using mean field games. The problem can be broken down into two coupled PDEs, where an individual nodes optimal control path is solved using a Hamilton-Jacobi-Bellman equation, where the evolution of states distribution is characterized by a Fokker-Planck equation. We develop numerical methods to compute the mean field equilibrium for both steady states at the infinite time horizon and evolutionary dynamics. As an example, we show how the mean field equilibrium can be applied to the Bitcoin blockchain mechanism design. We demonstrate that a blockchain can be viewed as a mechanism that operates in a decentralized setup and propagates properties of the mean field equilibrium over time, such as the underlying security of the blockchain.
Todays datacenter applications are underpinned by datastores that are responsible for providing availability, consistency, and performance. For high availability in the presence of failures, these datastores replicate data across several nodes. This is accomplished with the help of a reliable replication protocol that is responsible for maintaining the replicas strongly-consistent even when faults occur. Strong consistency is preferred to weaker consistency models that cannot guarantee an intuitive behavior for the clients. Furthermore, to accommodate high demand at real-time latencies, datastores must deliver high throughput and low latency. This work introduces Hermes, a broadcast-based reliable replication protocol for in-memory datastores that provides both high throughput and low latency by enabling local reads and fully-concurrent fast writes at all replicas. Hermes couples logical timestamps with cache-coherence-inspired invalidations to guarantee linearizability, avoid write serialization at a centralized ordering point, resolve write conflicts locally at each replica (hence ensuring that writes never abort) and provide fault-tolerance via replayable writes. Our implementation of Hermes over an RDMA-enabled reliable datastore with five replicas shows that Hermes consistently achieves higher throughput than state-of-the-art RDMA-based reliable protocols (ZAB and CRAQ) across all write ratios while also significantly reducing tail latency. At 5% writes, the tail latency of Hermes is 3.6X lower than that of CRAQ and ZAB.
The concurrency features of the Go language have proven versatile in the development of a number of concurrency systems. However, correctness methods to address challenges in Go concurrency debugging have not received much attention. In this work, we present an automatic dynamic tracing mechanism that efficiently captures and helps analyze the whole-program concurrency model. Using an enhancement to the built-in tracer package of Go and a framework that collects dynamic traces from application execution, we enable thorough post-mortem analysis for concurrency debugging. Preliminary results about the effectiveness and scalability (up to more than 2K goroutines) of our proposed dynamic tracing for concurrent debugging are presented. We discuss the future direction for exploiting dynamic tracing towards accelerating concurrent bug exposure.