New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Measuring and comparing the scaling behaviour of a high-performance CFD code on different supercomputing infrastructures

394 0 0.0 ( 0 )

Download Cite

Added by Ralf-Peter Mundani

Publication date 2018

fields Informatics Engineering Physics

and research's language is English

Authors Jer^ome Frisch RWTH Aachenn University - Aachen

Performance Computational Physics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Parallel code design is a challenging task especially when addressing petascale systems for massive parallel processing (MPP), i.e. parallel computations on several hundreds of thousands of cores. An in-house computational fluid dynamics code, developed by our group, was designed for such high-fidelity runs in order to exhibit excellent scalability values. Basis for this code is an adaptive hierarchical data structure together with an efficient communication and (numerical) computation scheme that supports MPP. For a detailled scalability analysis, we performed several experiments on two of Germanys national supercomputers up to 140,000 processes. In this paper, we will show the results of those experiments and discuss any bottlenecks that could be observed while solving engineering-based problems such as porous media flows or thermal comfort assessments for problem sizes up to several hundred billion degrees of freedom.

rate research

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

122 - Weicheng Xue , Christopher J. Roy 2020

This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPUs memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.

Distributed Parallel and Cluster Computing Performance

FluidSim: modular, object-oriented Python package for high-performance CFD simulations

82 - Ashwin Vishnu Mohanan , Cyrille Bonamy , Miguel Calpe Linares 2018

The Python package fluidsim is introduced in this article as an extensible framework for Computational Fluid Mechanics (CFD) solvers. It is developed as a part of FluidDyn project (Augier et al., 2018), an effort to promote open-source and open-science collaboration within fluid mechanics community and intended for both educational as well as research purposes. Solvers in fluidsim are scalable, High-Performance Computing (HPC) codes which are powered under the hood by the rich, scientific Python ecosystem and the Application Programming Interfaces (API) provided by fluiddyn and fluidfft packages (Mohanan et al., 2018). The present article describes the design aspects of fluidsim, viz. use of Python as the main language; focus on the ease of use, reuse and maintenance of the code without compromising performance. The implementation details including optimization methods, modular organization of features and object-oriented approach of using classes to implement solvers are also briefly explained. Currently, fluidsim includes solvers for a variety of physical problems using different numerical methods (including finite-difference methods). However, this metapaper shall dwell only on the implementation and performance of its pseudo-spectral solvers, in particular the two- and three-dimensional Navier-Stokes solvers. We investigate the performance and scalability of fluidsim in a state of the art HPC cluster. Three similar pseudo-spectral CFD codes based on Python (Dedalus, SpectralDNS) and Fortran (NS3D) are presented and qualitatively and quantitatively compared to fluidsim. The source code is hosted at Bitbucket as a Mercurial repository bitbucket.org/fluiddyn/fluidsim and the documentation generated using Sphinx can be read online at fluidsim.readthedocs.io.

Computational Engineering Computational Physics Fluid Dynamics

Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms

136 - Benjamin Michalowicz , Eric Raut , Yan Kang 2021

The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current worlds fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.

Performance

Performance Analysis of CP2K Code for Ab Initio Molecular Dynamics

273 - Dewi Yokelson , Nikolay V. Tkachenko , Robert Robey 2021

Using a realistic molecular catalyst system, we conduct scaling studies of ab initio molecular dynamics simulations using the CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures. We explore using process placement and affinity to gain additional performance improvements. We also use statistical methods to understand performance changes in spite of the variability in runtime for each molecular dynamics timestep. We found ideal conditions for CPU runs included at least four MPI ranks per node, bound evenly across each socket, and fully utilizing processing cores with one OpenMP thread per core, no benefit was shown from reserving cores for the system. The CPU-only simulations scaled at 70% or more of the ideal scaling up to 10 compute nodes, after which the returns began to diminish more quickly. Simulations on a single 40-core node with two NVIDIA V100 GPUs for acceleration achieved over 3.7x speedup compared to the fastest single 36-core node CPU-only version, and showed 13% speedup over the fastest time we achieved across five CPU-only nodes.

Performance Distributed Parallel and Cluster Computing

Performance prediction of finite-difference solvers for different computer architectures

91 - Mathias Louboutin , Michael Lange , Felix Herrmann 2016

The life-cycle of a partial differential equation (PDE) solver is often characterized by three development phases: the development of a stable numerical discretization, development of a correct (verified) implementation, and the optimization of the implementation for different computer architectures. Often it is only after significant time and effort has been invested that the performance bottlenecks of a PDE solver are fully understood, and the precise details varies between different computer architectures. One way to mitigate this issue is to establish a reliable performance model that allows a numerical analyst to make reliable predictions of how well a numerical method would perform on a given computer architecture, before embarking upon potentially long and expensive implementation and optimization phases. The availability of a reliable performance model also saves developer effort as it both informs the developer on what kind of optimisations are beneficial, and when the maximum expected performance has been reached and optimisation work should stop. We show how discretization of a wave equation can be theoretically studied to understand the performance limitations of the method on modern computer architectures. We focus on the roofline model, now broadly used in the high-performance computing community, which considers the achievable performance in terms of the peak memory bandwidth and peak floating point performance of a computer with respect to algorithmic choices. A first principles analysis of operational intensity for key time-stepping finite-difference algorithms is presented. With this information available at the time of algorithm design, the expected performance on target computer systems can be used as a driver for algorithm design.

Performance

comments

Fetching comments

Information Technology Institute ITI

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Measuring and comparing the scaling behaviour of a high-performance CFD code on different supercomputing infrastructures

Ask ChatGPT about the research

No Arabic abstract

Read More