RDMAbox : Optimizing RDMA for Memory Intensive Workloads

101 0 0.0 ( 0 )

Download Cite

Added by Juhyun Bae

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Juhyun Bae - Ling Liu - Yanzhao Wu

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present RDMAbox, a set of low level RDMA optimizations that provide better performance than previous approaches. The optimizations are packaged in easy-to-use kernel and user space libraries for applications and systems in data center. We demonstrate the flexibility and effectiveness of RDMAbox by implementing a kernel remote paging system and a user space file system using RDMAbox. RDMAbox employs two optimization techniques. First, we suggest RDMA request merging and chaining to further reduce the total number of I/O operations to the RDMA NIC. The I/O merge queue at the same time functions as a traffic regulator to enforce admission control and avoid overloading the NIC. Second, we propose Adaptive Polling to achieve higher efficiency of polling Work Completion than existing busy polling while maintaining the low CPU overhead of event trigger. Our implementation of a remote paging system with RDMAbox outperforms existing representative solutions with up to 4? throughput improvement and up to 83% decrease in average tail latency in bigdata workloads, and up to 83% reduction in completion time in machine learning workloads. Our implementation of a user space file system based on RDMAbox achieves up to 5.9? higher throughput over existing representative solutions.

rate research

Efficient Orchestration of Host and Remote Shared Memory for Memory Intensive Workloads

161 - Juhyun Bae , Gong Su , Arun Iyengar 2020

Since very few contributions to the development of an unified memory orchestration framework for efficient management of both host and remote idle memory have been made, we present Valet, an efficient approach to orchestration of host and remote shared memory for improving performance of memory intensive workloads. The paper makes three original contributions. First, we redesign the data flow in the critical path by introducing a host-coordinated memory pool that works as a local cache to reduce the latency in the critical path of the host and remote memory orchestration. Second, Valet utilizes unused local memory across containers by managing local memory via Valet host-coordinated memory pool, which allows containers to dynamically expand and shrink their memory allocations according to the workload demands. Third, Valet provides an efficient remote memory reclaiming technique on remote peers, based on two optimizations: (1) an activity-based victim selection scheme to allow the least-active-chunk of data to be selected for serving the eviction requests and (2) a migration protocol to move the least-active-chunk of data to less-memory-pressured remote node. As a result, Valet can effectively reduce the performance impact and migration overhead on local nodes. Our extensive experiments on both NoSQL systems and Machine Learning (ML) workloads show that Valet outperforms existing representative remote paging systems with up to 226X throughput improvement and up to 98% latency decrease over conventional OS swap facility for big data and ML workloads, and by up to 5.5X throughput improvement and up to 78.4% latency decrease over the state-of-the-art remote paging systems. Valet is open sourced at https://github.com/git-disl/Valet.

Distributed Parallel and Cluster Computing

Consistent RDMA-Friendly Hashing on Remote Persistent Memory

80 - Xinxin Liu , Yu Hua , Rong Bai 2021

Coalescing RDMA and Persistent Memory (PM) delivers high end-to-end performance for networked storage systems, which requires rethinking the design of efficient hash structures. In general, existing hashing schemes separately optimize RDMA and PM, thus partially addressing the problems of RDMA Access Amplification and High-Overhead PM Consistency. In order to address these problems, we propose a continuity hashing, which is a one-stone-two-birds design to optimize both RDMA and PM. The continuity hashing leverages a fine-grained contiguous shared region, called SBuckets, to provide standby positions for the neighbouring two buckets in case of hash collisions. In the continuity hashing, remote read only needs a single RDMA read to directly fetch the home bucket and the neighbouring SBuckets, which contain all the positions of maintaining a key-value item, thus alleviating RDMA access amplification. Continuity hashing further leverages indicators that can be atomically modified to support log-free PM consistency for all the write operations. Evaluation results demonstrate that compared with state-of-the-art schemes, continuity hashing achieves high throughput (i.e., 1.45X -- 2.43X improvement), low latency (about 1.7X speedup) and the smallest number of PM writes with various workloads, while has acceptable load factors of about 70%.

Distributed Parallel and Cluster Computing

MKPipe: A Compiler Framework for Optimizing Multi-Kernel Workloads in OpenCL for FPGA

76 - Ji Liu , Abdullah-Al Kafi , Xipeng Shen 2020

OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.

Distributed Parallel and Cluster Computing Performance

Enabling Efficient RDMA-based Synchronous Mirroring of Persistent Memory Transactions

138 - Arash Tavakkol , Aasheesh Kolli , Stanko Novakovic 2018

Synchronous Mirroring (SM) is a standard approach to building highly-available and fault-tolerant enterprise storage systems. SM ensures strong data consistency by maintaining multiple exact data replicas and synchronously propagating every update to all of them. Such strong consistency provides fault tolerance guarantees and a simple programming model coveted by enterprise system designers. For current storage devices, SM comes at modest performance overheads. This is because performing both local and remote updates simultaneously is only marginally slower than performing just local updates, due to the relatively slow performance of accesses to storage in todays systems. However, emerging persistent memory and ultra-low-latency network technologies necessitate a careful re-evaluation of the existing SM techniques, as these technologies present fundamentally different latency characteristics compared than their traditional counterparts. In addition to that, existing low-latency network technologies, such as Remote Direct Memory Access (RDMA), provide limited ordering guarantees and do not provide durability guarantees necessary for SM. To evaluate the performance implications of RDMA-based SM, we develop a rigorous testing framework that is based on emulated persistent memory. Our testing framework makes use of two different tools: (i) a configurable microbenchmark and (ii) a modified version of the WHISPER benchmark suite, which comprises a set of common cloud applications. Using this framework, we find that recently proposed RDMA primitives, such as remote commit, provide correctness guarantees, but do not take full advantage of the asynchronous nature of RDMA hardware. To this end, we propose new primitives enabling efficient and correct SM over RDMA, and use these primitives to develop two new techniques delivering high-performance SM of persistent memories.

Distributed Parallel and Cluster Computing

CloudChain: A Cloud Blockchain Using Shared Memory Consensus and RDMA

81 - Minghui Xu , Shuo Liu , Dongxiao Yu 2021

Blockchain technologies can enable secure computing environments among mistrusting parties. Permissioned blockchains are particularly enlightened by companies, enterprises, and government agencies due to their efficiency, customizability, and governance-friendly features. Obviously, seamlessly fusing blockchain and cloud computing can significantly benefit permissioned blockchains; nevertheless, most blockchains implemented on clouds are originally designed for loosely-coupled networks where nodes communicate asynchronously, failing to take advantages of the closely-coupled nature of cloud servers. In this paper, we propose an innovative cloud-oriented blockchain -- CloudChain, which is a modularized three-layer system composed of the network layer, consensus layer, and blockchain layer. CloudChain is based on a shared-memory model where nodes communicate synchronously by direct memory accesses. We realize the shared-memory model with the Remote Direct Memory Access technology, based on which we propose a shared-memory consensus algorithm to ensure presistence and liveness, the two crucial blockchain security properties countering Byzantine nodes. We also implement a CloudChain prototype based on a RoCEv2-based testbed to experimentally validate our design, and the results verify the feasibility and efficiency of CloudChain.

Distributed Parallel and Cluster Computing