No Arabic abstract
Classical erasure codes, e.g. Reed-Solomon codes, have been acknowledged as an efficient alternative to plain replication to reduce the storage overhead in reliable distributed storage systems. Yet, such codes experience high overhead during the maintenance process. In this paper we propose a novel erasure-coded framework especially tailored for networked storage systems. Our approach relies on the use of random codes coupled with a clustered placement strategy, enabling the maintenance of a failed machine at the granularity of multiple files. Our repair protocol leverages network coding techniques to reduce by half the amount of data transferred during maintenance, as several files can be repaired simultaneously. This approach, as formally proven and demonstrated by our evaluation on a public experimental testbed, enables to dramatically decrease the bandwidth overhead during the maintenance process, as well as the time to repair a failure. In addition, the implementation is made as simple as possible, aiming at a deployment into practical systems.
Distributed storage systems provide reliable access to data through redundancy spread over individually unreliable nodes. Application scenarios include data centers, peer-to-peer storage systems, and storage in wireless networks. Storing data using an erasure code, in fragments spread across nodes, requires less redundancy than simple replication for the same level of reliability. However, since fragments must be periodically replaced as nodes fail, a key question is how to generate encoded fragments in a distributed way while transferring as little data as possible across the network. For an erasure coded system, a common practice to repair from a node failure is for a new node to download subsets of data stored at a number of surviving nodes, reconstruct a lost coded block using the downloaded data, and store it at the new node. We show that this procedure is sub-optimal. We introduce the notion of regenerating codes, which allow a new node to download emph{functions} of the stored data from the surviving nodes. We show that regenerating codes can significantly reduce the repair bandwidth. Further, we show that there is a fundamental tradeoff between storage and repair bandwidth which we theoretically characterize using flow arguments on an appropriately constructed graph. By invoking constructive results in network coding, we introduce regenerating codes that can achieve any point in this optimal tradeoff.
Erasure coding techniques are getting integrated in networked distributed storage systems as a way to provide fault-tolerance at the cost of less storage overhead than traditional replication. Redundancy is maintained over time through repair mechanisms, which may entail large network resource overheads. In recent years, several novel codes tailor-made for distributed storage have been proposed to optimize storage overhead and repair, such as Regenerating Codes that minimize the per repair traffic, or Self-Repairing Codes which minimize the number of nodes contacted per repair. Existing studies of these coding techniques are however predominantly theoretical, under the simplifying assumption that only one object is stored. They ignore many practical issues that real systems must address, such as data placement, de/correlation of multiple stored objects, or the competition for limited network resources when multiple objects are repaired simultaneously. This paper empirically studies the repair performance of these novel storage centric codes with respect to classical erasure codes by simulating realistic scenarios and exploring the interplay of code parameters, failure characteristics and data placement with respect to the trade-offs of bandwidth usage and speed of repairs.
We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.
Erasure codes are increasingly being studied in the context of implementing atomic memory objects in large scale asynchronous distributed storage systems. When compared with the traditional replication based schemes, erasure codes have the potential of significantly lowering storage and communication costs while simultaneously guaranteeing the desired resiliency levels. In this work, we propose the Storage-Optimized Data-Atomic (SODA) algorithm for implementing atomic memory objects in the multi-writer multi-reader setting. SODA uses Maximum Distance Separable (MDS) codes, and is specifically designed to optimize the total storage cost for a given fault-tolerance requirement. For tolerating $f$ server crashes in an $n$-server system, SODA uses an $[n, k]$ MDS code with $k=n-f$, and incurs a total storage cost of $frac{n}{n-f}$. SODA is designed under the assumption of reliable point-to-point communication channels. The communication cost of a write and a read operation are respectively given by $O(f^2)$ and $frac{n}{n-f}(delta_w+1)$, where $delta_w$ denotes the number of writes that are concurrent with the particular read. In comparison with the recent CASGC algorithm, which also uses MDS codes, SODA offers lower storage cost while pays more on the communication cost. We also present a modification of SODA, called SODA$_{text{err}}$, to handle the case where some of the servers can return erroneous coded elements during a read operation. Specifically, in order to tolerate $f$ server failures and $e$ error-prone coded elements, the SODA$_{text{err}}$ algorithm uses an $[n, k]$ MDS code such that $k=n-2e-f$. SODA$_{text{err}}$ also guarantees liveness and atomicity, while maintaining an optimized total storage cost of $frac{n}{n-f-2e}$.
In todays enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objectives). Typical storage system applications such as web or VDI (Virtual Desktop Infrastructure) follow a repetitive high/low workload pattern that can be learned and forecasted. We propose a priority-based background scheduler that learns this repetitive pattern and allows storage systems to maintain peak performance and in turn meet service level objectives (SLOs) while supporting a number of data services. When foreground IO demand intensifies, system resources are dedicated to service foreground IO requests and any background processing that can be deferred are recorded to be processed in future idle cycles as long as forecast shows that storage pool has remaining capacity. The smart background scheduler adopts a resource partitioning model that allows both foreground and background IO to execute together as long as foreground IOs are not impacted where the scheduler harness any free cycle to clear background debt. Using traces from VDI application, we show how our technique surpasses a method that statically limit the deferred background debt and improve SLO violations from 54.6% when using a fixed background debt watermark to merely a 6.2% if dynamically set by our smart background scheduler.