Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

A Study of Mixed Precision Strategies for GMRES on GPUs

150 0 0.0 ( 0 )

Download Cite

Added by Jennifer Loe

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Jennifer A. Loe - Christian A. Glusa - Ichitaro Yamazaki

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of mixed precision strategies for accelerating this kernel on an NVIDIA V$100$ GPU with a Power 9 CPU. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when mixed precision GMRES will be effective and to choose parameters for a mixed precision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of mixed precision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

rate research

Experimental Evaluation of Multiprecision Strategies for GMRES on GPUs

110 - Jennifer A. Loe , Christian A. Glusa , Ichitaro Yamazaki 2021

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for multiprecision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of multiprecision strategies for accelerating this kernel on GPUs. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when multiprecision GMRES will be effective and to choose parameters for a multiprecision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of multiprecision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

Numerical Analysis Mathematical Software Numerical Analysis

Accelerating an Iterative Eigensolver for Nuclear Structure Configuration Interaction Calculations on GPUs using OpenACC

183 - Pieter Maris , Chao Yang , Dossay Oryspayev 2021

To accelerate the solution of large eigenvalue problems arising from many-body calculations in nuclear physics on distributed-memory parallel systems equipped with general-purpose Graphic Processing Units (GPUs), we modified a previously developed hybrid MPI/OpenMP implementation of an eigensolver written in FORTRAN 90 by using an OpenACC directives based programming model. Such an approach requires making minimal changes to the original code and enables a smooth migration of large-scale nuclear structure simulations from a distributed-memory many-core CPU system to a distributed GPU system. However, in order to make the OpenACC based eigensolver run efficiently on GPUs, we need to take into account the architectural differences between a many-core CPU and a GPU device. Consequently, the optimal way to insert OpenACC directives may be different from the original way of inserting OpenMP directives. We point out these differences in the implementation of sparse matrix-matrix multiplications (SpMM), which constitutes the main cost of the eigensolver, as well as other differences in the preconditioning step and dense linear algebra operations. We compare the performance of the OpenACC based implementation executed on multiple GPUs with the performance on distributed-memory many-core CPUs, and demonstrate significant speedup achieved on GPUs compared to the on-node performance of a many-core CPU. We also show that the overall performance improvement of the eigensolver on multiple GPUs is more modest due to the communication overhead among different MPI ranks.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

Multithreaded Filtering Preconditioner for Diffusion Equation on Structured Grid

153 - Abhinav Aggarwal , Shivam Kakkar , Pawan Kumar 2019

A parallel and nested version of a frequency filtering preconditioner is proposed for linear systems corresponding to diffusion equation on a structured grid. The proposed preconditioner is found to be robust with respect to jumps in the diffusion coefficients. The storage requirement for the preconditioner is O(N),where N is number of rows of matrix, hence, a fairly large problem of size more than 42 million unknowns has been solved on a quad core machine with 64GB RAM. The parallelism is achieved using twisted factorization and SIMD operations. The preconditioner achieves a speedup of 3.3 times on a quad core processor clocked at 4.2 GHz, and compared to a well known algebraic multigrid method, it is significantly faster in both setup and solve times for diffusion equations with jumps.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation

444 - Moritz Kreutzer , Georg Hager , Gerhard Wellein 2011

Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia Fermi class of GPGPUs. A new padded jagged diagonals storage (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

75 - Michael Lass , Stephan Mohr , Hendrik Wiebeler 2017

We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

comments

Fetching comments

University of Mosul

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Study of Mixed Precision Strategies for GMRES on GPUs

Ask ChatGPT about the research

No Arabic abstract

Read More