No Arabic abstract
Atomicity or strong consistency is one of the fundamental, most intuitive, and hardest to provide primitives in distributed shared memory emulations. To ensure survivability, scalability, and availability of a storage service in the presence of failures, traditional approaches for atomic memory emulation, in message passing environments, replicate the objects across multiple servers. Compared to replication based algorithms, erasure code-based atomic memory algorithms has much lower storage and communication costs, but usually, they are harder to design. The difficulty of designing atomic memory algorithms further grows, when the set of servers may be changed to ensure survivability of the service over software and hardware upgrades, while avoiding service interruptions. Atomic memory algorithms for performing server reconfiguration, in the replicated systems, are very few, complex, and are still part of an active area of research; reconfigurations of erasure-code based algorithms are non-existent. In this work, we present ARES, an algorithmic framework that allows reconfiguration of the underlying servers, and is particularly suitable for erasure-code based algorithms emulating atomic objects. ARES introduces new configurations while keeping the service available. To use with ARES we also propose a new, and to our knowledge, the first two-round erasure code based algorithm TREAS, for emulating multi-writer, multi-reader (MWMR) atomic objects in asynchronous, message-passing environments, with near-optimal communication and storage costs. Our algorithms can tolerate crash failures of any client and some fraction of servers, and yet, guarantee safety and liveness property. Moreover, by bringing together the advantages of ARES and TREAS, we propose an optimized algorithm where new configurations can be installed without the objects values passing through the reconfiguration clients.
Motivated by emerging applications to the edge computing paradigm, we introduce a two-layer erasure-coded fault-tolerant distributed storage system offering atomic access for read and write operations. In edge computing, clients interact with an edge-layer of servers that is geographically near; the edge-layer in turn interacts with a back-end layer of servers. The edge-layer provides low latency access and temporary storage for client operations, and uses the back-end layer for persistent storage. Our algorithm, termed Layered Data Storage (LDS) algorithm, offers several features suitable for edge-computing systems, works under asynchronous message-passing environments, supports multiple readers and writers, and can tolerate $f_1 < n_1/2$ and $f_2 < n_2/3$ crash failures in the two layers having $n_1$ and $n_2$ servers, respectively. We use a class of erasure codes known as regenerating codes for storage of data in the back-end layer. The choice of regenerating codes, instead of popular choices like Reed-Solomon codes, not only optimizes the cost of back-end storage, but also helps in optimizing communication cost of read operations, when the value needs to be recreated all the way from the back-end. The two-layer architecture permits a modular implementation of atomicity and erasure-code protocols; the implementation of erasure-codes is mostly limited to interaction between the two layers. We prove liveness and atomicity of LDS, and also compute performance costs associated with read and write operations. Further, in a multi-object system running $N$ independent instances of LDS, where only a small fraction of the objects undergo concurrent accesses at any point during the execution, the overall storage cost is dominated by that of persistent storage in the back-end layer, and is given by $Theta(N)$.
Erasure codes are an integral part of many distributed storage systems aimed at Big Data, since they provide high fault-tolerance for low overheads. However, traditional erasure codes are inefficient on reading stored data in degraded environments (when nodes might be unavailable), and on replenishing lost data (vital for long term resilience). Consequently, novel codes optimized to cope with distributed storage system nuances are vigorously being researched. In this paper, we take an engineering alternative, exploring the use of simple and mature techniques -juxtaposing a standard erasure code with RAID-4 like parity. We carry out an analytical study to determine the efficacy of this approach over traditional as well as some novel codes. We build upon this study to design CORE, a general storage primitive that we integrate into HDFS. We benchmark this implementation in a proprietary cluster and in EC2. Our experiments show that compared to traditional erasure codes, CORE uses 50% less bandwidth and is up to 75% faster while recovering a single failed node, while the gains are respectively 15% and 60% for double node failures.
Modern data stores achieve scalability by partitioning data into shards and fault-tolerance by replicating each shard across several servers. A key component of such systems is a Transaction Certification Service (TCS), which atomically commits a transaction spanning multiple shards. Existing TCS protocols require 2f+1 crash-stop replicas per shard to tolerate f failures. In this paper we present atomic commit protocols that require only f+1 replicas and reconfigure the system upon failures using an external reconfiguration service. We furthermore rigorously prove that these protocols correctly implement a recently proposed TCS specification. We present protocols in two different models--the standard asynchronous message-passing model and a model with Remote Direct Memory Access (RDMA), which allows a machine to access the memory of another machine over the network without involving the latters CPU. Our protocols are inspired by a recent FARM system for RDMA-based transaction processing. Our work codifies the core ideas of FARM as distributed TCS protocols, rigorously proves them correct and highlights the trade-offs required by the use of RDMA.
To achieve reliability in distributed storage systems, data has usually been replicated across different nodes. However the increasing volume of data to be stored has motivated the introduction of erasure codes, a storage efficient alternative to replication, particularly suited for archival in data centers, where old datasets (rarely accessed) can be erasure encoded, while replicas are maintained only for the latest data. Many recent works consider the design of new storage-centric erasure codes for improved repairability. In contrast, this paper addresses the migration from replication to encoding: traditionally erasure coding is an atomic operation in that a single node with the whole object encodes and uploads all the encoded pieces. Although large datasets can be concurrently archived by distributing individual object encodings among different nodes, the network and computing capacity of individual nodes constrain the archival process due to such atomicity. We propose a new pipelined coding strategy that distributes the network and computing load of single-object encodings among different nodes, which also speeds up multiple object archival. We further present RapidRAID codes, an explicit family of pipelined erasure codes which provides fast archival without compromising either data reliability or storage overheads. Finally, we provide a real implementation of RapidRAID codes and benchmark its performance using both a cluster of 50 nodes and a set of Amazon EC2 instances. Experiments show that RapidRAID codes reduce a single objects coding time by up to 90%, while when multiple objects are encoded concurrently, the reduction is up to 20%.
This paper proposes the first implementation of an atomic storage tolerant to mobile Byzantine agents. Our implementation is designed for the round-based synchronous model where the set of Byzantine nodes changes from round to round. In this model we explore the feasibility of multi-writer multi-reader atomic register prone to various mobile Byzantine behaviors. We prove upper and lower bounds for solving the atomic storage in all the explored models. Our results, significantly different from the static case, advocate for a deeper study of the main building blocks of distributed computing while the system is prone to mobile Byzantine failures.