New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Chemora: A PDE Solving Framework for Modern HPC Architectures

183 0 0.0 ( 0 )

Download Cite

Added by Erik Schnetter

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors Erik Schnetter - Marek Blazewicz - Steven R. Brandt

Mathematical Software Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Modern HPC architectures consist of heterogeneous multi-core, many-node systems with deep memory hierarchies. Modern applications employ ever more advanced discretisation methods to study multi-physics problems. Developing such applications that explore cutting-edge physics on cutting-edge HPC systems has become a complex task that requires significant HPC knowledge and experience. Unfortunately, this combined knowledge is currently out of reach for all but a few groups of application developers. Chemora is a framework for solving systems of Partial Differential Equations (PDEs) that targets modern HPC architectures. Chemora is based on Cactus, which sees prominent usage in the computational relativistic astrophysics community. In Chemora, PDEs are expressed either in a high-level LaTeX-like language or in Mathematica. Discretisation stencils are defined separately from equations, and can include Finite Differences, Discontinuous Galerkin Finite Elements (DGFE), Adaptive Mesh Refinement (AMR), and multi-block systems. We use Chemora in the Einstein Toolkit to implement the Einstein Equations on CPUs and on accelerators, and study astrophysical systems such as black hole binaries, neutron stars, and core-collapse supernovae.

rate research

A unified sparse matrix data format for efficient general sparse matrix-vector multiply on modern processors with wide SIMD units

156 - Moritz Kreutzer , Georg Hager , Gerhard Wellein 2013

Sparse matrix-vector multiplication (spMVM) is the most time-consuming kernel in many numerical algorithms and has been studied extensively on all modern processor and accelerator architectures. However, the optimal sparse matrix data storage format is highly hardware-specific, which could become an obstacle when using heterogeneous systems. Also, it is as yet unclear how the wide single instruction multiple data (SIMD) units in current multi- and many-core processors should be used most efficiently if there is no structure in the sparsity pattern of the matrix. We suggest SELL-C-sigma, a variant of Sliced ELLPACK, as a SIMD-friendly data format which combines long-standing ideas from General Purpose Graphics Processing Units (GPGPUs) and vector computer programming. We discuss the advantages of SELL-C-sigma compared to established formats like Compressed Row Storage (CRS) and ELLPACK and show its suitability on a variety of hardware platforms (Intel Sandy Bridge, Intel Xeon Phi and Nvidia Tesla K20) for a wide range of test matrices from different application areas. Using appropriate performance models we develop deep insight into the data transfer properties of the SELL-C-sigma spMVM kernel. SELL-C-sigma comes with two tuning parameters whose performance impact across the range of test matrices is studied and for which reasonable choices are proposed. This leads to a hardware-independent (catch-all) sparse matrix format, which achieves very high efficiency for all test matrices across all hardware platforms.

Mathematical Software Distributed Parallel and Cluster Computing

A Framework for General Sparse Matrix-Matrix Multiplication on GPUs and Heterogeneous Processors

300 - Weifeng Liu , Brian Vinter 2015

General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. The source code of this work is available at https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR

Mathematical Software Distributed Parallel and Cluster Computing Numerical Analysis

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

141 - Daniel J. Magee , Anthony S. Walker , 2018

Applications that exploit the architectural details of high-performance computing (HPC) systems have become increasingly invaluable in academia and industry over the past two decades. The most important hardware development of the last decade in HPC has been the General Purpose Graphics Processing Unit (GPGPU), a class of massively parallel devices that now contributes the majority of computational power in the top 500 supercomputers. As these systems grow, small costs such as latency---due to the fixed cost of memory accesses and communication---accumulate in a large simulation and become a significant barrier to performance. The swept time-space decomposition rule is a communication-avoiding technique for time-stepping stencil update formulas that attempts to reduce latency costs. This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems. We compare our approach to a naive decomposition scheme with two test equations using an MPI+CUDA pattern on 40 processes over two nodes containing one GPU. The swept rule produces a factor of 1.9 to 23 speedup for the heat equation and a factor of 1.1 to 2.0 speedup for the Euler equations, using the same processors and work distribution, and with the best possible configurations. These results show the potential effectiveness of the swept rule for different equations and numerical schemes on massively parallel computing systems that incur substantial latency costs.

Mathematical Software Distributed Parallel and Cluster Computing Computational Physics

OpenSBLI: A framework for the automated derivation and parallel execution of finite difference solvers on a range of computer architectures

316 - Christian T. Jacobs , Satya P. Jammy , Neil D. Sandham 2016

Exascale computing will feature novel and potentially disruptive hardware architectures. Exploiting these to their full potential is non-trivial. Numerical modelling frameworks involving finite difference methods are currently limited by the static nature of the hand-coded discretisation schemes and repeatedly may have to be re-written to run efficiently on new hardware. In contrast, OpenSBLI uses code generation to derive the models code from a high-level specification. Users focus on the equations to solve, whilst not concerning themselves with the detailed implementation. Source-to-source translation is used to tailor the code and enable its execution on a variety of hardware.

Mathematical Software Symbolic Computation Software Engineering

Solving the Bethe-Salpeter equation on massively parallel architectures

114 - Xiao Zhang 2020

The last ten years have witnessed fast spreading of massively parallel computing clusters, from leading supercomputing facilities down to the average university computing center. Many companies in the private sector have undergone a similar evolution. In this scenario, the seamless integration of software and middleware libraries is a key ingredient to ensure portability of scientific codes and guarantees them an extended lifetime. In this work, we describe the integration of the ChASE library, a modern parallel eigensolver, into an existing legacy code for the first-principles computation of optical properties of materials via solution of the Bethe-Salpeter equation for the optical polarization function. Our numerical tests show that, as a result of integrating ChASE and parallelizing the reading routine, the code experiences a remarkable speedup and greatly improved scaling behavior on both multi- and many-core architectures. We demonstrate that such a modernized BSE code will, by fully exploiting parallel computing architectures and file systems, enable domain scientists to accurately study complex material systems that were not accessible before.

Computational Engineering Distributed Parallel and Cluster Computing Computational Physics

comments

Fetching comments

Arab Academy for Science and Technology and Maritime Transport

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Chemora: A PDE Solving Framework for Modern HPC Architectures

Ask ChatGPT about the research

No Arabic abstract

Read More