A General Technique for Non-blocking Trees

134 0 0.0 ( 0 )

Download Cite

Added by Trevor Brown

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Trevor Brown - Faith Ellen - Eric Ruppert

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We describe a general technique for obtaining provably correct, non-blocking implementations of a large class of tree data structures where pointers are directed from parents to children. Updates are permitted to modify any contiguous portion of the tree atomically. Our non-blocking algorithms make use of the LLX, SCX and VLX primitives, which are multi-word generalizations of the standard LL, SC and VL primitives and have been implemented from single-word CAS. To illustrate our technique, we describe how it can be used in a fairly straightforward way to obtain a non-blocking implementation of a chromatic tree, which is a relaxed variant of a red-black tree. The height of the tree at any time is $O(c+ log n)$, where $n$ is the number of keys and $c$ is the number of updates in progress. We provide an experimental performance analysis which demonstrates that our Java implementation of a chromatic tree rivals, and often significantly outperforms, other leading concurrent dictionaries.

rate research

Persistent Non-Blocking Binary Search Trees Supporting Wait-Free Range Queries

92 - Panagiota Fatourou , Eric Ruppert 2018

This paper presents the first implementation of a search tree data structure in an asynchronous shared-memory system that provides a wait-free algorithm for executing range queries on the tree, in addition to non-blocking algorithms for Insert, Delete and Find, using single-word Compare-and-Swap (CAS). The implementation is linearizable and tolerates any number of crash failures. Insert and Delete operations that operate on different parts of the tree run fully in parallel (without any interference with one another). We employ a lightweight helping mechanism, where each Insert, Delete and Find operation helps only update operations that affect the local neighbourhood of the leaf it arrives at. Similarly, a Scan helps only those updates taking place on nodes of the part of the tree it traverses, and therefore Scans operating on different parts of the tree do not interfere with one another. Our implementation works in a dynamic system where the number of processes may change over time. The implementation builds upon the non-blocking binary search tree implementation presented by Ellen et al. (in PODC 2010) by applying a simple mechanism to make the tree persistent.

Distributed Parallel and Cluster Computing

Pragmatic Primitives for Non-blocking Data Structures

96 - Trevor Brown , Faith Ellen , Eric Ruppert 2017

We define a new set of primitive operations that greatly simplify the implementation of non-blocking data structures in asynchronous shared-memory systems. The new operations operate on a set of Data-records, each of which contains multiple fields. The operations are generalizations of the well-known load-link (LL) and store-conditional (SC) operations called LLX and SCX. The LLX operation takes a snapshot of one Data-record. An SCX operation by a process $p$ succeeds only if no Data-record in a specified set has been changed since $p$ last performed an LLX on it. If successful, the SCX atomically updates one specific field of a Data-record in the set and prevents any future changes to some specified subset of those Data-records. We provide a provably correct implementation of these new primitives from single-word compare-and-swap. As a simple example, we show how to implement a non-blocking multiset data structure in a straightforward way using LLX and SCX.

Distributed Parallel and Cluster Computing

Integrating Blocking and Non-Blocking MPI Primitives with Task-Based Programming Models

74 - Kevin Sala 2019

In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime systems ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks.

Distributed Parallel and Cluster Computing

Analysis and Evaluation of Non-Blocking Interpolation Search Trees

78 - Aleksandar Prokopec , Trevor Brown , Dan Alistarh 2020

We start by summarizing the recently proposed implementation of the first non-blocking concurrent interpolation search tree (C-IST) data structure. We then analyze the individual operations of the C-IST, and show that they are correct and linearizable. We furthermore show that lookup (and several other non-destructive operations) are wait-free, and that the insert and delete operations are lock-free. We continue by showing that the C-IST has the following properties. For arbitrary key distributions, this data structure ensures worst-case $O(log n + p)$ amortized time for search, insertion and deletion traversals. When the input key distributions are smooth, lookups run in expected $O(log log n + p)$ time, and insertion and deletion run in expected amortized $O(log log n + p)$ time, where $p$ is a bound on the number of threads. Finally, we present an extended experimental evaluation of the non-blocking IST performance.

Data Structures and Algorithms

A Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse Matrix-Vector Multiplication

296 - Christie L. Alappat , Georg Hager , Olaf Schenk 2019

The symmetric sparse matrix-vector multiplication (SymmSpMV) is an important building block for many numerical linear algebra kernel operations or graph traversal applications. Parallelizing SymmSpMV on todays multicore platforms with up to 100 cores is difficult due to the need to manage conflicting updates on the result vector. Coloring approaches can be used to solve this problem without data duplication, but existing coloring algorithms do not take load balancing and deep memory hierarchies into account, hampering scalability and full-chip performance. In this work, we propose the recursive algebraic coloring engine (RACE), a novel coloring algorithm and open-source library implementation, which eliminates the shortcomings of previous coloring methods in terms of hardware efficiency and parallelization overhead. We describe the level construction, distance-k coloring, and load balancing steps in RACE, use it to parallelize SymmSpMV, and compare its performance on 31 sparse matrices with other state-of-the-art coloring techniques and Intel MKL on two modern multicore processors. RACE outperforms all other approaches substantially and behaves in accordance with the Roofline model. Outliers are discussed and analyzed in detail. While we focus on SymmSpMV in this paper, our algorithm and software is applicable to any sparse matrix operation with data dependencies that can be resolved by distance-k coloring.

Distributed Parallel and Cluster Computing Performance