No Arabic abstract
Due to its high performance and decreasing cost per bit, flash is becoming the main storage medium in datacenters for hot data. However, flash endurance is a perpetual problem, and due to technology trends, subsequent generations of flash devices exhibit progressively shorter lifetimes before they experience uncorrectable bit errors. In this paper we propose extending flash lifetime by allowing devices to expose higher bit error rates. To do so, we present DIRECT, a novel set of policies that leverages latent redundancy in distributed storage systems to recover from bit corruption errors with minimal performance and recovery overhead. In doing so, DIRECT can significantly extend the lifetime of flash devices by effectively utilizing these devices even after they begin exposing bit errors. We implemented DIRECT on two real-world storage systems: ZippyDB, a distributed key-value store backed by RocksDB, and HDFS, a distributed file system. When tested on production traces at Facebook, DIRECT reduces application-visible error rates in ZippyDB by more than 10^2 and recovery time by more than 10^4. DIRECT also allows HDFS to tolerate a 10^4--10^5 higher bit error rate without experiencing application-visible errors.
We present Kaleidoscope an innovative system that supports live forensics for application performance problems caused by either individual component failures or resource contention issues in large-scale distributed storage systems. The design of Kaleidoscope is driven by our study of I/O failures observed in a peta-scale storage system anonymized as PetaStore. Kaleidoscope is built on three key features: 1) using temporal and spatial differential observability for end-to-end performance monitoring of I/O requests, 2) modeling the health of storage components as a stochastic process using domain-guided functions that accounts for path redundancy and uncertainty in measurements, and, 3) observing differences in reliability and performance metrics between similar types of healthy and unhealthy components to attribute the most likely root causes. We deployed Kaleidoscope on PetaStore and our evaluation shows that Kaleidoscope can run live forensics at 5-minute intervals and pinpoint the root causes of 95.8% of real-world performance issues, with negligible monitoring overhead.
Erasure codes are increasingly being studied in the context of implementing atomic memory objects in large scale asynchronous distributed storage systems. When compared with the traditional replication based schemes, erasure codes have the potential of significantly lowering storage and communication costs while simultaneously guaranteeing the desired resiliency levels. In this work, we propose the Storage-Optimized Data-Atomic (SODA) algorithm for implementing atomic memory objects in the multi-writer multi-reader setting. SODA uses Maximum Distance Separable (MDS) codes, and is specifically designed to optimize the total storage cost for a given fault-tolerance requirement. For tolerating $f$ server crashes in an $n$-server system, SODA uses an $[n, k]$ MDS code with $k=n-f$, and incurs a total storage cost of $frac{n}{n-f}$. SODA is designed under the assumption of reliable point-to-point communication channels. The communication cost of a write and a read operation are respectively given by $O(f^2)$ and $frac{n}{n-f}(delta_w+1)$, where $delta_w$ denotes the number of writes that are concurrent with the particular read. In comparison with the recent CASGC algorithm, which also uses MDS codes, SODA offers lower storage cost while pays more on the communication cost. We also present a modification of SODA, called SODA$_{text{err}}$, to handle the case where some of the servers can return erroneous coded elements during a read operation. Specifically, in order to tolerate $f$ server failures and $e$ error-prone coded elements, the SODA$_{text{err}}$ algorithm uses an $[n, k]$ MDS code such that $k=n-2e-f$. SODA$_{text{err}}$ also guarantees liveness and atomicity, while maintaining an optimized total storage cost of $frac{n}{n-f-2e}$.
Classical erasure codes, e.g. Reed-Solomon codes, have been acknowledged as an efficient alternative to plain replication to reduce the storage overhead in reliable distributed storage systems. Yet, such codes experience high overhead during the maintenance process. In this paper we propose a novel erasure-coded framework especially tailored for networked storage systems. Our approach relies on the use of random codes coupled with a clustered placement strategy, enabling the maintenance of a failed machine at the granularity of multiple files. Our repair protocol leverages network coding techniques to reduce by half the amount of data transferred during maintenance, as several files can be repaired simultaneously. This approach, as formally proven and demonstrated by our evaluation on a public experimental testbed, enables to dramatically decrease the bandwidth overhead during the maintenance process, as well as the time to repair a failure. In addition, the implementation is made as simple as possible, aiming at a deployment into practical systems.
In todays enterprise storage systems, supported data services such as snapshot delete or drive rebuild can cause tremendous performance interference if executed inline along with heavy foreground IO, often leading to missing SLOs (Service Level Objectives). Typical storage system applications such as web or VDI (Virtual Desktop Infrastructure) follow a repetitive high/low workload pattern that can be learned and forecasted. We propose a priority-based background scheduler that learns this repetitive pattern and allows storage systems to maintain peak performance and in turn meet service level objectives (SLOs) while supporting a number of data services. When foreground IO demand intensifies, system resources are dedicated to service foreground IO requests and any background processing that can be deferred are recorded to be processed in future idle cycles as long as forecast shows that storage pool has remaining capacity. The smart background scheduler adopts a resource partitioning model that allows both foreground and background IO to execute together as long as foreground IOs are not impacted where the scheduler harness any free cycle to clear background debt. Using traces from VDI application, we show how our technique surpasses a method that statically limit the deferred background debt and improve SLO violations from 54.6% when using a fixed background debt watermark to merely a 6.2% if dynamically set by our smart background scheduler.
The start of data taking at the Large Hadron Collider will herald a new era in data volumes and distributed processing in particle physics. Data volumes of hundreds of Terabytes will be shipped to Tier-2 centres for analysis by the LHC experiments using the Worldwide LHC Computing Grid (WLCG). In many countries Tier-2 centres are distributed between a number of institutes, e.g., the geographically spread Tier-2s of GridPP in the UK. This presents a number of challenges for experiments to utilise these centres efficaciously, as CPU and storage resources may be sub-divided and exposed in smaller units than the experiment would ideally want to work with. In addition, unhelpful mismatches between storage and CPU at the individual centres may be seen, which make efficient exploitation of a Tier-2s resources difficult. One method of addressing this is to unify the storage across a distributed Tier-2, presenting the centres aggregated storage as a single system. This greatly simplifies data management for the VO, which then can access a greater amount of data across the Tier-2. However, such an approach will lead to scenarios where analysis jobs on one sites batch system must access data hosted on another site. We investigate this situation using the Glasgow and Edinburgh clusters, which are part of the ScotGrid distributed Tier-2. In particular we look at how to mitigate the problems associated with ``distant data access and discuss the security implications of having LAN access protocols traverse the WAN between centres.