No Arabic abstract
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs.
We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a regular discretized simulation domain. This method is scalable for a large number of processes even for complex geometries and ensures load balance between the domains, reasonable communication interfaces, and good data locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar or even higher flow solver performance than widely used standard graph partition based tools such as METIS and PT-SCOTCH.
This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics.
Hydrodynamic interactions in systems comprised of self-propelled particles, such as swimming microorganisms, and passive tracers have a significant impact on the tracer dynamics compared to the equivalent dry sample. However, such interactions are often difficult to take into account in simulations due to their computational cost. Here, we perform a systematic investigation of swimmer-tracer interaction using an efficient force/counter-force based lattice-Boltzmann (LB) algorithm [J. de Graaf~textit{et al.}, J. Chem. Phys.~textbf{144}, 134106 (2016)] in order to validate its ability to capture the relevant low-Reynolds-number physics. We show that the LB algorithm reproduces far-field theoretical results well, both in a system with periodic boundary conditions and in a spherical cavity with no-slip walls, for which we derive expressions here. The force-lattice coupling of the LB algorithm leads to a smearing out of the flow field, which strongly perturbs the tracer trajectories at close swimmer-tracer separations, and we analyze how this effect can be accurately captured using a simple renormalized hydrodynamic theory. Finally, we show that care must be taken when using LB algorithms to simulate systems of self-propelled particles, since its finite momentum transport time can lead to significant deviations from theoretical predictions based on Stokes flow. These insights should prove relevant to the future study of large-scale microswimmer suspensions using these methods.
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures.
We study numerically the effect of thermal fluctuations and of variable fluid-substrate interactions on the spontaneous dewetting of thin liquid films. To this aim, we use a recently developed lattice Boltzmann method for thin liquid film flows, equipped with a properly devised stochastic term. While it is known that thermal fluctuations yield shorter rupture times, we show that this is a general feature of hydrophilic substrates, irrespective of the contact angle. The ratio between deterministic and stochastic rupture times, though, decreases with $theta$. Finally, we discuss the case of fluctuating thin film dewetting on chemically patterned substrates and its dependence on the form of the wettability gradients.