New community

Subscribe to the gold package and get unlimited access to Shamra Academy

GPU Algorithms for Efficient Exascale Discretizations

273 0 0.0 ( 0 )

Download Cite

Added by Tzanio Kolev V

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Ahmad Abdelfattah - Valeria Barra - Natalie Beams

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In this paper we describe the research and development activities in the Center for Efficient Exascale Discretization within the US Exascale Computing Project, targeting state-of-the-art high-order finite-element algorithms for high-order applications on GPU-accelerated platforms. We discuss the GPU developments in several components of the CEED software stack, including the libCEED, MAGMA, MFEM, libParanumal, and Nek projects. We report performance and capability improvements in several CEED-enabled applications on both NVIDIA and AMD GPU systems.

rate research

Efficient Exascale Discretizations: High-Order Finite Element Methods

111 - Tzanio Kolev , Paul Fischer , Misun Min 2021

Efficient exploitation of exascale architectures requires rethinking of the numerical algorithms used in many large-scale applications. These architectures favor algorithms that expose ultra fine-grain parallelism and maximize the ratio of floating point operations to energy intensive data movement. One of the few viable approaches to achieve high efficiency in the area of PDE discretizations on unstructured grids is to use matrix-free/partially-assembled high-order finite element methods, since these methods can increase the accuracy and/or lower the computational time due to reduced data motion. In this paper we provide an overview of the research and development activities in the Center for Efficient Exascale Discretizations (CEED), a co-design center in the Exascale Computing Project that is focused on the development of next-generation discretization software and algorithms to enable a wide range of finite element applications to run efficiently on future hardware. CEED is a research partnership involving more than 30 computational scientists from two US national labs and five universities, including members of the Nek5000, MFEM, MAGMA and PETSc projects. We discuss the CEED co-design activities based on targeted benchmarks, miniapps and discretization libraries and our work on performance optimizations for large-scale GPU architectures. We also provide a broad overview of research and development activities in areas such as unstructured adaptive mesh refinement algorithms, matrix-free linear solvers, high-order data visualization, and list examples of collaborations with several ECP and external applications.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

H2Opus: A distributed-memory multi-GPU software package for non-local operators

124 - Stefano Zampini , Wajih Boukaram , George Turkiyyah 2021

Hierarchical $mathcal{H}^2$-matrices are asymptotically optimal representations for the discretizations of non-local operators such as those arising in integral equations or from kernel functions. Their $O(N)$ complexity in both memory and operator application makes them particularly suited for large-scale problems. As a result, there is a need for software that provides support for distributed operations on these matrices to allow large-scale problems to be represented. In this paper, we present high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the $mathcal{H}^2$ format. The algorithms are a new module of H2Opus, a performance-oriented package that supports a broad variety of $mathcal{H}^2$-matrix operations on CPUs and GPUs. Performance in the distributed GPU setting is achieved by marshaling the tree data of the hierarchical matrix representation to allow batched kernels to be executed on the individual GPUs. MPI is used for inter-process communication. We optimize the communication data volume and hide much of the communication cost with local compute phases of the algorithms. Results show near-ideal scalability up to 1024 NVIDIA V100 GPUs on Summit, with performance exceeding 2.3 Tflop/s/GPU for the matrix-vector multiplication, and 670 Gflops/s/GPU for matrix compression, which involves batched QR and SVD operations. We illustrate the flexibility and efficiency of the library by solving a 2D variable diffusivity integral fractional diffusion problem with an algebraic multigrid-preconditioned Krylov solver and demonstrate scalability up to 16M degrees of freedom problems on 64 GPUs.

Distributed Parallel and Cluster Computing Mathematical Software

Parallel GPU Implementation of Iterative PCA Algorithms

422 - M. Andrecut 2008

Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimize

Quantitative Methods Mathematical Software Computational Physics

Multithreaded Filtering Preconditioner for Diffusion Equation on Structured Grid

153 - Abhinav Aggarwal , Shivam Kakkar , Pawan Kumar 2019

A parallel and nested version of a frequency filtering preconditioner is proposed for linear systems corresponding to diffusion equation on a structured grid. The proposed preconditioner is found to be robust with respect to jumps in the diffusion coefficients. The storage requirement for the preconditioner is O(N),where N is number of rows of matrix, hence, a fairly large problem of size more than 42 million unknowns has been solved on a quad core machine with 64GB RAM. The parallelism is achieved using twisted factorization and SIMD operations. The preconditioner achieves a speedup of 3.3 times on a quad core processor clocked at 4.2 GHz, and compared to a well known algebraic multigrid method, it is significantly faster in both setup and solve times for diffusion equations with jumps.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

A Study of Mixed Precision Strategies for GMRES on GPUs

149 - Jennifer A. Loe , Christian A. Glusa , Ichitaro Yamazaki 2021

Support for lower precision computation is becoming more common in accelerator hardware due to lower power usage, reduced data movement and increased computational performance. However, computational science and engineering (CSE) problems require double precision accuracy in several domains. This conflict between hardware trends and application needs has resulted in a need for mixed precision strategies at the linear algebra algorithms level if we want to exploit the hardware to its full potential while meeting the accuracy requirements. In this paper, we focus on preconditioned sparse iterative linear solvers, a key kernel in several CSE applications. We present a study of mixed precision strategies for accelerating this kernel on an NVIDIA V$100$ GPU with a Power 9 CPU. We seek the best methods for incorporating multiple precisions into the GMRES linear solver; these include iterative refinement and parallelizable preconditioners. Our work presents strategies to determine when mixed precision GMRES will be effective and to choose parameters for a mixed precision iterative refinement solver to achieve better performance. We use an implementation that is based on the Trilinos library and employs Kokkos Kernels for performance portability of linear algebra kernels. Performance results demonstrate the promise of mixed precision approaches and demonstrate even further improvements are possible by optimizing low-level kernels.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

comments

Fetching comments

Higher Institute for Applied Sciences and Technology

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

GPU Algorithms for Efficient Exascale Discretizations

Ask ChatGPT about the research

No Arabic abstract

Read More