No Arabic abstract
Computational fluid dynamics (CFD) requires a vast amount of compute cycles on contemporary large-scale parallel computers. Hence, performance optimization is a pivotal activity in this field of computational science. Not only does it reduce the time to solution, but it also allows to minimize the energy consumption. In this work we study performance optimizations for an MPI-parallel lattice Boltzmann-based flow solver that uses a sparse lattice representation with indirect addressing. First we describe how this indirect addressing can be minimized in order to increase the single-core and chip-level performance. Second, the communication overhead is reduced via appropriate partitioning, but maintaining the single core performance improvements. Both optimizations allow to run the solver at an operating point with minimal energy consumption.
GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures.
We revisit the classical stability versus accuracy dilemma for the lattice Boltzmann methods (LBM). Our goal is a stable method of second-order accuracy for fluid dynamics based on the lattice Bhatnager--Gross--Krook method (LBGK). The LBGK scheme can be recognised as a discrete dynamical system generated by free-flight and entropic involution. In this framework the stability and accuracy analysis are more natural. We find the necessary and sufficient conditions for second-order accurate fluid dynamics modelling. In particular, it is proven that in order to guarantee second-order accuracy the distribution should belong to a distinguished surface -- the invariant film (up to second-order in the time step). This surface is the trajectory of the (quasi)equilibrium distribution surface under free-flight. The main instability mechanisms are identified. The simplest recipes for stabilisation add no artificial dissipation (up to second-order) and provide second-order accuracy of the method. Two other prescriptions add some artificial dissipation locally and prevent the system from loss of positivity and local blow-up. Demonstration of the proposed stable LBGK schemes are provided by the numerical simulation of a 1D shock tube and the unsteady 2D-flow around a square-cylinder up to Reynolds number $mathcal{O}(10000)$.
In this paper, we first present a unified framework for the modelling of generalized lattice Boltzmann method (GLBM). We then conduct a comparison of the four popular analysis methods (Chapman-Enskog analysis, Maxwell iteration, direct Taylor expansion and recurrence equations approaches) that have been used to obtain the macroscopic Navier-Stokes equations and nonlinear convection-diffusion equations from the GLBM, and show that from mathematical point of view, these four analysis methods are equivalent to each other. Finally, we give some elements that are needed in the implementation of the GLBM, and also find that some available LB models can be obtained from this GLBM.
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs.