ﻻ يوجد ملخص باللغة العربية
GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent years and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.
Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, b
Heterogeneous programming has started becoming the norm in order to achieve better performance by running portions of code on the most appropriate hardware resource. Currently, significant engineering efforts are undertaken in order to enable existin
The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance
Computational fluid dynamics (CFD) requires a vast amount of compute cycles on contemporary large-scale parallel computers. Hence, performance optimization is a pivotal activity in this field of computational science. Not only does it reduce the time
The term performance portability has been informally used in computing to refer to a variety of notions which generally include: 1) the ability to run one application across multiple hardware platforms; and 2) achieving some notional level of perform