Enabling Efficient RDMA-based Synchronous Mirroring of Persistent Memory Transactions

139 0 0.0 ( 0 )

Download Cite

Added by Arash Tavakkol

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Arash Tavakkol - Aasheesh Kolli - Stanko Novakovic

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Synchronous Mirroring (SM) is a standard approach to building highly-available and fault-tolerant enterprise storage systems. SM ensures strong data consistency by maintaining multiple exact data replicas and synchronously propagating every update to all of them. Such strong consistency provides fault tolerance guarantees and a simple programming model coveted by enterprise system designers. For current storage devices, SM comes at modest performance overheads. This is because performing both local and remote updates simultaneously is only marginally slower than performing just local updates, due to the relatively slow performance of accesses to storage in todays systems. However, emerging persistent memory and ultra-low-latency network technologies necessitate a careful re-evaluation of the existing SM techniques, as these technologies present fundamentally different latency characteristics compared than their traditional counterparts. In addition to that, existing low-latency network technologies, such as Remote Direct Memory Access (RDMA), provide limited ordering guarantees and do not provide durability guarantees necessary for SM. To evaluate the performance implications of RDMA-based SM, we develop a rigorous testing framework that is based on emulated persistent memory. Our testing framework makes use of two different tools: (i) a configurable microbenchmark and (ii) a modified version of the WHISPER benchmark suite, which comprises a set of common cloud applications. Using this framework, we find that recently proposed RDMA primitives, such as remote commit, provide correctness guarantees, but do not take full advantage of the asynchronous nature of RDMA hardware. To this end, we propose new primitives enabling efficient and correct SM over RDMA, and use these primitives to develop two new techniques delivering high-performance SM of persistent memories.

rate research

Consistent RDMA-Friendly Hashing on Remote Persistent Memory

80 - Xinxin Liu , Yu Hua , Rong Bai 2021

Coalescing RDMA and Persistent Memory (PM) delivers high end-to-end performance for networked storage systems, which requires rethinking the design of efficient hash structures. In general, existing hashing schemes separately optimize RDMA and PM, thus partially addressing the problems of RDMA Access Amplification and High-Overhead PM Consistency. In order to address these problems, we propose a continuity hashing, which is a one-stone-two-birds design to optimize both RDMA and PM. The continuity hashing leverages a fine-grained contiguous shared region, called SBuckets, to provide standby positions for the neighbouring two buckets in case of hash collisions. In the continuity hashing, remote read only needs a single RDMA read to directly fetch the home bucket and the neighbouring SBuckets, which contain all the positions of maintaining a key-value item, thus alleviating RDMA access amplification. Continuity hashing further leverages indicators that can be atomically modified to support log-free PM consistency for all the write operations. Evaluation results demonstrate that compared with state-of-the-art schemes, continuity hashing achieves high throughput (i.e., 1.45X -- 2.43X improvement), low latency (about 1.7X speedup) and the smallest number of PM writes with various workloads, while has acceptable load factors of about 70%.

Distributed Parallel and Cluster Computing

Crafty: Efficient, HTM-Compatible Persistent Transactions

103 - Kaan Genc{c} Ohion State University 2020

Byte-addressable persistent memory, such as Intel/Micron 3D XPoint, is an emerging technology that bridges the gap between volatile memory and persistent storage. Data in persistent memory survives crashes and restarts; however, it is challenging to ensure that this data is consistent after failures. Existing approaches incur significant performance costs to ensure crash consistency. This paper introduces Crafty, a new approach for ensuring consistency and atomicity on persistent memory operations using commodity hardware with existing hardware transactional memory (HTM) capabilities, while incurring low overhead. Crafty employs a novel technique called nondestructive undo logging that leverages commodity HTM to control persist ordering. Our evaluation shows that Crafty outperforms state-of-the-art prior work under low contention, and performs competitively under high contention.

Programming Languages

RDMAbox : Optimizing RDMA for Memory Intensive Workloads

100 - Juhyun Bae , Ling Liu , Yanzhao Wu 2021

We present RDMAbox, a set of low level RDMA optimizations that provide better performance than previous approaches. The optimizations are packaged in easy-to-use kernel and user space libraries for applications and systems in data center. We demonstrate the flexibility and effectiveness of RDMAbox by implementing a kernel remote paging system and a user space file system using RDMAbox. RDMAbox employs two optimization techniques. First, we suggest RDMA request merging and chaining to further reduce the total number of I/O operations to the RDMA NIC. The I/O merge queue at the same time functions as a traffic regulator to enforce admission control and avoid overloading the NIC. Second, we propose Adaptive Polling to achieve higher efficiency of polling Work Completion than existing busy polling while maintaining the low CPU overhead of event trigger. Our implementation of a remote paging system with RDMAbox outperforms existing representative solutions with up to 4? throughput improvement and up to 83% decrease in average tail latency in bigdata workloads, and up to 83% reduction in completion time in machine learning workloads. Our implementation of a user space file system based on RDMAbox achieves up to 5.9? higher throughput over existing representative solutions.

Distributed Parallel and Cluster Computing

Efficient Synchronous Byzantine Consensus

67 - Ittai Abraham , Srinivas Devadas , Danny Dolev 2017

We present new protocols for Byzantine state machine replication and Byzantine agreement in the synchronous and authenticated setting. The celebrated PBFT state machine replication protocol tolerates $f$ Byzantine faults in an asynchronous setting using $3f+1$ replicas, and has since been studied or deployed by numerous works. In this work, we improve the Byzantine fault tolerance threshold to $n=2f+1$ by utilizing a relaxed synchrony assumption. We present a synchronous state machine replication protocol that commits a decision every 3 rounds in the common case. The key challenge is to ensure quorum intersection at one honest replica. Our solution is to rely on the synchrony assumption to form a post-commit quorum of size $2f+1$, which intersects at $f+1$ replicas with any pre-commit quorums of size $f+1$. Our protocol also solves synchronous authenticated Byzantine agreement in expected 8 rounds. The best previous solution (Katz and Koo, 2006) requires expected 24 rounds. Our protocols may be applied to build Byzantine fault tolerant systems or improve cryptographic protocols such as cryptocurrencies when synchrony can be assumed.

Distributed Parallel and Cluster Computing Cryptography and Security

A Flat-Combining-Based Persistent Stack for Non-Volatile Memory

78 - Matan Rusanovsky , Ohad Ben-Baruch , Danny Hendler 2020

Flat combining (FC) is a synchronization paradigm in which a single thread, holding a global lock, collects requests by multiple threads for accessing a concurrent data structure and applies their combined requests to it. Although FC is sequential, it significantly reduces synchronization overheads and cache invalidations and thus often provides better performance than that of lock-free implementations. The recent emergence of non-volatile memory (NVM) technologies increases the interest in the development of persistent (a.k.a. durable or recoverable) objects. These are objects that are able to recover from system failures and ensure consistency by retaining their state in NVM and fixing it, if required, upon recovery. Of particular interest are detectable objects that, in addition to ensuring consistency, allow recovery code to infer if a failed operation took effect before the crash and, if it did, obtain its response. In this work, we present the first FC-based persistent object. Specifically, we introduce a detectable FC-based implementation of a concurrent LIFO stack object. Our empirical evaluation establishes that thanks to the usage of flat combining, the novel stack algorithm requires a much smaller number of costly persistence instructions than competing algorithms and is therefore able to significantly outperform them.

Distributed Parallel and Cluster Computing Operating Systems