New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability

175 0 0.0 ( 0 )

Download Cite

Added by Vincent Richard Pascuzzi

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Vincent R. Pascuzzi - Mehdi Goli

Distributed Parallel and Cluster Computing Mathematical Software High Energy Physics - Experiment

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

High-performance computing (HPC) is a major driver accelerating scientific research and discovery, from quantum simulations to medical therapeutics. The growing number of new HPC systems coming online are being furnished with various hardware components, engineered by competing industry entities, each having their own architectures and platforms to be supported. While the increasing availability of these resources is in many cases pivotal to successful science, even the largest collaborations lack the computational expertise required for maximal exploitation of current hardware capabilities. The need to maintain multiple platform-specific codebases further complicates matters, potentially adding a constraint on the number of machines that can be utilized. Fortunately, numerous programming models are under development that aim to facilitate software solutions for heterogeneous computing. In this paper, we leverage the SYCL programming model to demonstrate cross-platform performance portability across heterogeneous resources. We detail our NVIDIA and AMD random number generator extensions to the oneMKL open-source interfaces library. Performance portability is measured relative to platform-specific baseline applications executed on four major hardware platforms using two different compilers supporting SYCL. The utility of our extensions are exemplified in a real-world setting via a high-energy physics simulation application. We show the performance of implementations that capitalize on SYCL interoperability are at par with native implementations, attesting to the cross-platform performance portability of a SYCL-based approach to scientific codes.

rate research

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

94 - E. Calore , A. Gabbana , J. Kraus 2017

An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures.

Distributed Parallel and Cluster Computing

A Metric for Performance Portability

75 - S. J. Pennycook , J. D. Sewall , V. W. Lee 2016

The term performance portability has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of performance on these platforms. However, there has been a noticeable lack of consensus on the precise meaning of the term, and authors conclusions regarding their success (or failure) to achieve performance portability have thus been subjective. Comparing one approach to performance portability with another has generally been marked with vague claims and verbose, qualitative explanation of the comparison. This paper presents a concise definition for performance portability, along with a simple metric that accurately captures the performance and portability of an application across different platforms. The utility of this metric is then demonstrated with a retroactive application to previous work.

Performance

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

209 - Elliott Slaughter , Wei Wu , Yuankun Fu 2019

We present Task Bench, a parameterized benchmark designed to explore the performance of parallel and distributed programming systems under a variety of application scenarios. Task Bench lowers the barrier to benchmarking multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Benchs parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications. We conduct a comprehensive study with implementations of Task Bench in 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. We introduce a novel metric, minimum effective task granularity to study the baseline runtime overhead of each system. We show that when running at scale, 100 {mu}s is the smallest granularity that even the most efficient systems can reliably support with current technologies. We also study each systems scalability, ability to hide communication and mitigate load imbalance.

Distributed Parallel and Cluster Computing

ArborX: A Performance Portable Geometric Search Library

79 - D. Lebrun-Grandie , A. Prokopenko , B. Turcksin 2019

Searching for geometric objects that are close in space is a fundamental component of many applications. The performance of search algorithms comes to the forefront as the size of a problem increases both in terms of total object count as well as in the total number of search queries performed. Scientific applications requiring modern leadership-class supercomputers also pose an additional requirement of performance portability, i.e. being able to efficiently utilize a variety of hardware architectures. In this paper, we introduce a new open-source C++ search library, ArborX, which we have designed for modern supercomputing architectures. We examine scalable search algorithms with a focus on performance, including a highly efficient parallel bounding volume hierarchy implementation, and propose a flexible interface making it easy to integrate with existing applications. We demonstrate the performance portability of ArborX on multi-core CPUs and GPUs, and compare it to the state-of-the-art libraries such as Boost.Geometry.Index and nanoflann.

Distributed Parallel and Cluster Computing Mathematical Software

GHOST: Building blocks for high performance sparse linear algebra on heterogeneous systems

71 - Moritz Kreutzer , Jonas Thies , Melven Rohrig-Zollner 2015

While many of the architectural details of future exascale-class high performance computer systems are still a matter of intense research, there appears to be a general consensus that they will be strongly heterogeneous, featuring standard as well as accelerated resources. Today, such resources are available as multicore processors, graphics processing units (GPUs), and other accelerators such as the Intel Xeon Phi. Any software infrastructure that claims usefulness for such environments must be able to meet their inherent challenges: massive multi-level parallelism, topology, asynchronicity, and abstraction. The General, Hybrid, and Optimized Sparse Toolkit (GHOST) is a collection of building blocks that targets algorithms dealing with sparse matrix representations on current and future large-scale systems. It implements the MPI+X paradigm, has a pure C interface, and provides hybrid-parallel numerical kernels, intelligent resource management, and truly heterogeneous parallelism for multicore CPUs, Nvidia GPUs, and the Intel Xeon Phi. We describe the details of its design with respect to the challenges posed by modern heterogeneous supercomputers and recent algorithmic developments. Implementation details which are indispensable for achieving high efficiency are pointed out and their necessity is justified by performance measurements or predictions based on performance models. The library code and several applications are available as open source. We also provide instructions on how to make use of GHOST in existing software packages, together with a case study which demonstrates the applicability and performance of GHOST as a component within a larger software stack.

Distributed Parallel and Cluster Computing Mathematical Software

comments

Fetching comments

Peninsula Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Achieving near native runtime performance and cross-platform performance portability for random number generation through SYCL interoperability

Ask ChatGPT about the research

No Arabic abstract

Read More