Integrating Blocking and Non-Blocking MPI Primitives with Task-Based Programming Models

75 0 0.0 ( 0 )

Download Cite

Added by Kevin Sala

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Kevin Sala

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we present the Task-Aware MPI library (TAMPI) that integrates both blocking and non-blocking MPI primitives with task-based programming models. The TAMPI library leverages two new runtime APIs to improve both programmability and performance of hybrid applications. The first API allows to pause and resume the execution of a task depending on external events. This API is used to improve the interoperability between blocking MPI communication primitives and tasks. When an MPI operation executed inside a task blocks, the task running is paused so that the runtime system can schedule a new task on the core that became idle. Once the blocked MPI operation is completed, the paused task is put again on the runtime systems ready queue, so eventually it will be scheduled again and its execution will be resumed. The second API defers the release of dependencies associated with a task completion until some external events are fulfilled. This API is composed only of two functions, one to bind external events to a running task and another function to notify about the completion of external events previously bound. TAMPI leverages this API to bind non-blocking MPI operations with tasks, deferring the release of their task dependencies until both task execution and all its bound MPI operations are completed. Our experiments reveal that the enhanced features of TAMPI not only simplify the development of hybrid MPI+OpenMP applications that use blocking or non-blocking MPI primitives but they also naturally overlap computation and communication phases, which improves application performance and scalability by removing artificial dependencies across communication tasks.

rate research

Pragmatic Primitives for Non-blocking Data Structures

96 - Trevor Brown , Faith Ellen , Eric Ruppert 2017

We define a new set of primitive operations that greatly simplify the implementation of non-blocking data structures in asynchronous shared-memory systems. The new operations operate on a set of Data-records, each of which contains multiple fields. The operations are generalizations of the well-known load-link (LL) and store-conditional (SC) operations called LLX and SCX. The LLX operation takes a snapshot of one Data-record. An SCX operation by a process $p$ succeeds only if no Data-record in a specified set has been changed since $p$ last performed an LLX on it. If successful, the SCX atomically updates one specific field of a Data-record in the set and prevents any future changes to some specified subset of those Data-records. We provide a provably correct implementation of these new primitives from single-word compare-and-swap. As a simple example, we show how to implement a non-blocking multiset data structure in a straightforward way using LLX and SCX.

Distributed Parallel and Cluster Computing

A General Technique for Non-blocking Trees

133 - Trevor Brown , Faith Ellen , Eric Ruppert 2017

We describe a general technique for obtaining provably correct, non-blocking implementations of a large class of tree data structures where pointers are directed from parents to children. Updates are permitted to modify any contiguous portion of the tree atomically. Our non-blocking algorithms make use of the LLX, SCX and VLX primitives, which are multi-word generalizations of the standard LL, SC and VL primitives and have been implemented from single-word CAS. To illustrate our technique, we describe how it can be used in a fairly straightforward way to obtain a non-blocking implementation of a chromatic tree, which is a relaxed variant of a red-black tree. The height of the tree at any time is $O(c+ log n)$, where $n$ is the number of keys and $c$ is the number of updates in progress. We provide an experimental performance analysis which demonstrates that our Java implementation of a chromatic tree rivals, and often significantly outperforms, other leading concurrent dictionaries.

Distributed Parallel and Cluster Computing

GPU-aware Communication with UCX in Parallel Programming Models: Charm++, MPI, and Python

81 - Jaemin Choi , Zane Fink , Sam White 2021

As an increasing number of leadership-class systems embrace GPU accelerators in the race towards exascale, efficient communication of GPU data is becoming one of the most critical components of high-performance computing. For developers of parallel programming models, implementing support for GPU-aware communication using native APIs for GPUs such as CUDA can be a daunting task as it requires considerable effort with little guarantee of performance. In this work, we demonstrate the capability of the Unified Communication X (UCX) framework to compose a GPU-aware communication layer that serves multiple parallel programming models of the Charm++ ecosystem: Charm++, Adaptive MPI (AMPI), and Charm4py. We demonstrate the performance impact of our designs with microbenchmarks adapted from the OSU benchmark suite, obtaining improvements in latency of up to 10.2x, 11.7x, and 17.4x in Charm++, AMPI, and Charm4py, respectively. We also observe increases in bandwidth of up to 9.6x in Charm++, 10x in AMPI, and 10.5x in Charm4py. We show the potential impact of our designs on real-world applications by evaluating a proxy application for the Jacobi iterative method, improving the communication performance by up to 12.4x in Charm++, 12.8x in AMPI, and 19.7x in Charm4py.

Distributed Parallel and Cluster Computing

Persistent Non-Blocking Binary Search Trees Supporting Wait-Free Range Queries

92 - Panagiota Fatourou , Eric Ruppert 2018

This paper presents the first implementation of a search tree data structure in an asynchronous shared-memory system that provides a wait-free algorithm for executing range queries on the tree, in addition to non-blocking algorithms for Insert, Delete and Find, using single-word Compare-and-Swap (CAS). The implementation is linearizable and tolerates any number of crash failures. Insert and Delete operations that operate on different parts of the tree run fully in parallel (without any interference with one another). We employ a lightweight helping mechanism, where each Insert, Delete and Find operation helps only update operations that affect the local neighbourhood of the leaf it arrives at. Similarly, a Scan helps only those updates taking place on nodes of the part of the tree it traverses, and therefore Scans operating on different parts of the tree do not interfere with one another. Our implementation works in a dynamic system where the number of processes may change over time. The implementation builds upon the non-blocking binary search tree implementation presented by Ellen et al. (in PODC 2010) by applying a simple mechanism to make the tree persistent.

Distributed Parallel and Cluster Computing

Exact blocking formulas for spin and gauge models

651 - Yuzhi Liu 2013

Using the example of the two-dimensional (2D) Ising model, we show that in contrast to what can be done in configuration space, the tensor renormalization group (TRG) formulation allows one to write exact, compact, and manifestly local blocking formulas and exact coarse grained expressions for the partition function. We argue that similar results should hold for most models studied by lattice gauge theorists. We provide exact blocking formulas for several 2D spin models (the O(2) and O(3) sigma models and the SU(2) principal chiral model) and for the 3D gauge theories with groups Z_2, U(1) and SU(2). We briefly discuss generalizations to other groups, higher dimensions and practical implementations.

High Energy Physics - Lattice Statistical Mechanics High Energy Physics - Theory