Do you want to publish a course? Click here

Measuring and comparing the scaling behaviour of a high-performance CFD code on different supercomputing infrastructures

394   0   0.0 ( 0 )
 Added by Ralf-Peter Mundani
 Publication date 2018
and research's language is English




Ask ChatGPT about the research

Parallel code design is a challenging task especially when addressing petascale systems for massive parallel processing (MPP), i.e. parallel computations on several hundreds of thousands of cores. An in-house computational fluid dynamics code, developed by our group, was designed for such high-fidelity runs in order to exhibit excellent scalability values. Basis for this code is an adaptive hierarchical data structure together with an efficient communication and (numerical) computation scheme that supports MPP. For a detailled scalability analysis, we performed several experiments on two of Germanys national supercomputers up to 140,000 processes. In this paper, we will show the results of those experiments and discuss any bottlenecks that could be observed while solving engineering-based problems such as porous media flows or thermal comfort assessments for problem sizes up to several hundred billion degrees of freedom.

rate research

Read More

This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPUs memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.
The Python package fluidsim is introduced in this article as an extensible framework for Computational Fluid Mechanics (CFD) solvers. It is developed as a part of FluidDyn project (Augier et al., 2018), an effort to promote open-source and open-science collaboration within fluid mechanics community and intended for both educational as well as research purposes. Solvers in fluidsim are scalable, High-Performance Computing (HPC) codes which are powered under the hood by the rich, scientific Python ecosystem and the Application Programming Interfaces (API) provided by fluiddyn and fluidfft packages (Mohanan et al., 2018). The present article describes the design aspects of fluidsim, viz. use of Python as the main language; focus on the ease of use, reuse and maintenance of the code without compromising performance. The implementation details including optimization methods, modular organization of features and object-oriented approach of using classes to implement solvers are also briefly explained. Currently, fluidsim includes solvers for a variety of physical problems using different numerical methods (including finite-difference methods). However, this metapaper shall dwell only on the implementation and performance of its pseudo-spectral solvers, in particular the two- and three-dimensional Navier-Stokes solvers. We investigate the performance and scalability of fluidsim in a state of the art HPC cluster. Three similar pseudo-spectral CFD codes based on Python (Dedalus, SpectralDNS) and Fortran (NS3D) are presented and qualitatively and quantitatively compared to fluidsim. The source code is hosted at Bitbucket as a Mercurial repository bitbucket.org/fluiddyn/fluidsim and the documentation generated using Sphinx can be read online at fluidsim.readthedocs.io.
The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current worlds fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
Using a realistic molecular catalyst system, we conduct scaling studies of ab initio molecular dynamics simulations using the CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures. We explore using process placement and affinity to gain additional performance improvements. We also use statistical methods to understand performance changes in spite of the variability in runtime for each molecular dynamics timestep. We found ideal conditions for CPU runs included at least four MPI ranks per node, bound evenly across each socket, and fully utilizing processing cores with one OpenMP thread per core, no benefit was shown from reserving cores for the system. The CPU-only simulations scaled at 70% or more of the ideal scaling up to 10 compute nodes, after which the returns began to diminish more quickly. Simulations on a single 40-core node with two NVIDIA V100 GPUs for acceleration achieved over 3.7x speedup compared to the fastest single 36-core node CPU-only version, and showed 13% speedup over the fastest time we achieved across five CPU-only nodes.
The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-stepping finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا