No Arabic abstract
We focus on implementing and optimizing a sixth-order finite-difference solver for simulating compressible fluids on a GPU using third-order Runge-Kutta integration. Since graphics processing units perform well in data-parallel tasks, this makes them an attractive platform for fluid simulation. However, high-order stencil computation is memory-intensive with respect to both main memory and the caches of the GPU. We present two approaches for simulating compressible fluids using 55-point and 19-point stencils. We seek to reduce the requirements for memory bandwidth and cache size in our methods by using cache blocking and decomposing a latency-bound kernel into several bandwidth-bound kernels. Our fastest implementation is bandwidth-bound and integrates $343$ million grid points per second on a Tesla K40t GPU, achieving a $3.6 times$ speedup over a comparable hydrodynamics solver benchmarked on two Intel Xeon E5-2690v3 processors. Our alternative GPU implementation is latency-bound and achieves the rate of $168$ million updates per second.
In this paper, our goal is to efficiently solve the Vlasov equation on GPUs. A semi-Lagrangian discontinuous Galerkin scheme is used for the discretization. Such kinetic computations are extremely expensive due to the high-dimensional phase space. The SLDG code, which is publicly available under the MIT license abstracts the number of dimensions and uses a shared codebase for both GPU and CPU based simulations. We investigate the performance of the implementation on a range of both Tesla (V100, Titan V, K80) and consumer (GTX 1080 Ti) GPUs. Our implementation is typically able to achieve a performance of approximately 470 GB/s on a single GPU and 1600 GB/s on four V100 GPUs connected via NVLink. This results in a speedup of about a factor of ten (comparing a single GPU with a dual socket Intel Xeon Gold node) and approximately a factor of 35 (comparing a single node with and without GPUs). In addition, we investigate the effect of single precision computation on the performance of the SLDG code and demonstrate that a template based dimension independent implementation can achieve good performance regardless of the dimensionality of the problem.
A high fidelity flow simulation for complex geometries for high Reynolds number ($Re$) flow is still very challenging, which requires more powerful computational capability of HPC system. However, the development of HPC with traditional CPU architecture suffers bottlenecks due to its high power consumption and technical difficulties. Heterogeneous architecture computation is raised to be a promising solution of difficulties of HPC development. GPU accelerating technology has been utilized in low order scheme CFD solvers on structured grid and high order scheme solvers on unstructured meshes. The high order finite difference methods on structured grid possess many advantages, e.g. high efficiency, robustness and low storage, however, the strong dependence among points for a high order finite difference scheme still limits its application on GPU platform. In present work, we propose a set of hardware-aware technology to optimize the efficiency of data transfer between CPU and GPU, and efficiency of communication between GPUs. An in-house multi-block structured CFD solver with high order finite difference methods on curvilinear coordinates is ported onto GPU platform, and obtain satisfying performance with speedup maximum around 2000x over a single CPU core. This work provides efficient solution to apply GPU computing in CFD simulation with certain high order finite difference methods on current GPU heterogeneous computers. The test shows that significant accelerating effects can been achieved for different GPUs.
Restricted solid on solid surface growth models can be mapped onto binary lattice gases. We show that efficient simulation algorithms can be realized on GPUs either by CUDA or by OpenCL programming. We consider a deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1 dimensions related to the Asymmetric Simple Exclusion Process and show that for sizes, that fit into the shared memory of GPUs one can achieve the maximum parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect to a single CPU of 2.67 GHz). This permits us to study the effect of quenched columnar disorder, requiring extremely long simulation times. We compare the CUDA realization with an OpenCL implementation designed for processor clusters via MPI. A two-lane traffic model with randomized turning points is also realized and the dynamical behavior has been investigated.
Recently, a 4th-order asymptotic preserving multiderivative implicit-explicit (IMEX) scheme was developed (Schutz and Seal 2020, arXiv:2001.08268). This scheme is based on a 4th-order Hermite interpolation in time, and uses an approach based on operator splitting that converges to the underlying quadrature if iterated sufficiently. Hermite schemes have been used in astrophysics for decades, particularly for N-body calculations, but not in a form suitable for solving stiff equations. In this work, we extend the scheme presented in Schutz and Seal 2020 to higher orders. Such high-order schemes offer advantages when one aims to find high-precision solutions to systems of differential equations containing stiff terms, which occur throughout the physical sciences. We begin by deriving Hermite schemes of arbitrary order and discussing the stability of these formulas. Afterwards, we demonstrate how the method of Schutz and Seal 2020 generalises in a straightforward manner to any of these schemes, and prove convergence properties of the resulting IMEX schemes. We then present results for methods ranging from 6th to 12th order and explore a selection of test problems, including both linear and nonlinear ordinary differential equations and Burgers equation. To our knowledge this is also the first time that Hermite time-stepping methods have been applied to partial differential equations. We then discuss some benefits of these schemes, such as their potential for parallelism and low memory usage, as well as limitations and potential drawbacks.
Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order stencils on GPUs have been well-studied in the literature, not all of proposed enhancements work well for high-order stencils, such as those used for seismic modeling. Furthermore, coping with boundary conditions often requires different computational logic, which complicates efficient exploitation of the thread-level parallelism on GPUs. In this paper, we study high-order stencils and their unique characteristics on GPUs. We manually crafted a collection of implementations of a 25-point seismic modeling stencil in CUDA and related boundary conditions. We evaluate their code shapes, memory hierarchy usage, data-fetching patterns, and other performance attributes. We conducted an empirical evaluation of these stencils using several mature and emerging tools and discuss our quantitative findings. Among our implementations, we achieve twice the performance of a proprietary code developed in C and mapped to GPUs using OpenACC. Additionally, several of our implementations have excellent performance portability.