No Arabic abstract
Todays storage systems expose abstractions which are either too low-level (e.g., key-value store, raw-block store) that they require developers to re-invent the wheels, or too high-level (e.g., relational databases, Git) that they lack generality to support many classes of applications. In this work, we propose and implement a general distributed data storage system, called UStore, which has rich semantics. UStore delivers three key properties, namely immutability, sharing and security, which unify and add values to many classes of todays applications, and which also open the door for new applications. By keeping the core properties within the storage, UStore helps reduce application development efforts while offering high performance at hand. The storage embraces current hardware trends as key enablers. It is built around a data-structure similar to that of Git, a popular source code versioning system, but it also synthesizes many designs from distributed systems and databases. Our current implementation of UStore has better performance than general in-memory key-value storage systems, especially for version scan operations. We port and evaluate four applications on top of UStore: a Git-like application, a collaborative data science application, a transaction management application, and a blockchain application. We demonstrate that UStore enables faster development and the UStore-backed applications can have better performance than the existing implementations.
A benchmark study of modern distributed databases is an important source of information to select the right technology for managing data in the cloud-edge paradigms. To make the right decision, it is required to conduct an extensive experimental study on a variety of hardware infrastructures. While most of the state-of-the-art studies have investigated only response time and scalability of distributed databases, focusing on other various metrics (e.g., energy, bandwidth, and storage consumption) is essential to fully understand the resources consumption of the distributed databases. Also, existing studies have explored the response time and scalability of these databases either in private or public cloud. Hence, there is a paucity of investigation into the evaluation of these databases deployed in a hybrid cloud, which is the seamless integration of public and private cloud. To address these research gaps, in this paper, we investigate energy, bandwidth and storage consumption of the most used and common distributed databases. For this purpose, we have evaluated four open-source databases (Cassandra, Mongo, Redis and MySQL) on the hybrid cloud spanning over local OpenStack and Microsoft Azure, and a variety of edge computing nodes including Raspberry Pi, a cluster of Raspberry Pi, and low and high power servers. Our extensive experimental results reveal several helpful insights for the deployment selection of modern distributed databases in edge-cloud environments.
Recently, the research on local repair codes is mainly confined to repair the failed nodes within each repair group. But if the extreme cases occur that the entire repair group has failed, the local code stored in the failed group need to be recovered as a whole. In this paper, local codes with cooperative repair, in which the local codes are constructed based on minimum storage regeneration (MSR) codes, is proposed to achieve repairing the failed groups. Specifically, the proposed local codes with cooperative repair construct a kind of mutual interleaving structure among the parity symbols, that the parity symbols of each local code, named as distributed local parity, can be generated by the parity symbols of the MSR codes in its two adjacent local codes. Taking advantage of the structure given, the failed local groups can be repaired cooperatively by their adjacent local groups with lower repair locality, and meanwhile the minimum distance of local codes with cooperative repair is derived. Theoretical analysis and simulation experiments show that, compared with codes with local regeneration (such as MSR-local codes and MBR-local codes), the proposed local codes with cooperative repair have benefits in bandwidth overhead and repair locality for the case of local groups failure.
We present SLASH (Sketched LocAlity Sensitive Hashing), an MPI (Message Passing Interface) based distributed system for approximate similarity search over terabyte scale datasets. SLASH provides a multi-node implementation of the popular LSH (locality sensitive hashing) algorithm, which is generally implemented on a single machine. We show how we can append the LSH algorithm with heavy hitters sketches to provably solve the (high) similarity search problem without a single distance computation. Overall, we mathematically show that, under realistic data assumptions, we can identify the near-neighbor of a given query $q$ in sub-linear ($ ll O(n)$) number of simple sketch aggregation operations only. To make such a system practical, we offer a novel design and sketching solution to reduce the inter-machine communication overheads exponentially. In a direct comparison on comparable hardware, SLASH is more than 10000x faster than the popular LSH package in PySpark. PySpark is a widely-adopted distributed implementation of the LSH algorithm for large datasets and is deployed in commercial platforms. In the end, we show how our system scale to Tera-scale Criteo dataset with more than 4 billion samples. SLASH can index this 2.3 terabyte data over 20 nodes in under an hour, with query times in a fraction of milliseconds. To the best of our knowledge, there is no open-source system that can index and perform a similarity search on Criteo with a commodity cluster.
In recent years, emerging hardware storage technologies have focused on divergent goals: better performance or lower cost-per-bit of storage. Correspondingly, data systems that employ these new technologies are optimized either to be fast (but expensive) or cheap (but slow). We take a different approach: by combining multiple tiers of fast and low-cost storage technologies within the same system, we can achieve a Pareto-efficient balance between performance and cost-per-bit. This paper presents the design and implementation of PrismDB, a novel log-structured merge tree based key-value store that exploits a full spectrum of heterogeneous storage technologies (from 3D XPoint to QLC NAND). We introduce the notion of read-awareness to log-structured merge trees, which allows hot objects to be pinned to faster storage, achieving better tiering and hot-cold separation of objects. Compared to the standard use of RocksDB on flash in datacenters today, PrismDBs average throughput on heterogeneous storage is 2.3$times$ faster and its tail latency is more than an order of magnitude better, using hardware than is half the cost.
Redundant storage maintains the performance of distributed systems under various forms of uncertainty. This paper considers the uncertainty in node access and download service. We consider two access models under two download service models. In one access model, a user can access each node with a fixed probability, and in the other, a user can access a random fixed-size subset of nodes. We consider two download service models. In the first (small file) model, the randomness associated with the file size is negligible. In the second (large file) model, randomness is associated with both the file size and the systems operations. We focus on the service rate of the system. For a fixed redundancy level, the systems service rate is determined by the allocation of coded chunks over the storage nodes. We consider quasi-uniform allocations, where coded content is uniformly spread among a subset of nodes. The question we address asks what the size of this subset (spreading) should be. We show that in the small file model, concentrating the coded content to a minimum-size subset is universally optimal. For the large file model, the optimal spreading depends on the system parameters. These conclusions hold for both access models.