sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems

329 0 0.0 ( 0 )

Download Cite

Added by Steven W. D. Chien

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Steven W. D. Chien - Jonas Nylund - Gabriel Bengtsson

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Large-scale simulations of plasmas are essential for advancing our understanding of fusion devices, space, and astrophysical systems. Particle-in-Cell (PIC) codes have demonstrated their success in simulating numerous plasma phenomena on HPC systems. Today, flagship supercomputers feature multiple GPUs per compute node to achieve unprecedented computing power at high power efficiency. PIC codes require new algorithm design and implementation for exploiting such accelerated platforms. In this work, we design and optimize a three-dimensional implicit PIC code, called sputniPIC, to run on a general multi-GPU compute node. We introduce a particle decomposition data layout, in contrast to domain decomposition on CPU-based implementations, to use particle batches for overlapping communication and computation on GPUs. sputniPIC also natively supports different precision representations to achieve speed up on hardware that supports reduced precision. We validate sputniPIC through the well-known GEM challenge and provide performance analysis. We test sputniPIC on three multi-GPU platforms and report a 200-800x performance improvement with respect to the sputniPIC CPU OpenMP version performance. We show that reduced precision could further improve performance by 45% to 80% on the three platforms. Because of these performance improvements, on a single node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC simulations that were only possible using clusters.

rate research

Multi-GPU Performance Optimization of a CFD Code using OpenACC on Different Platforms

122 - Weicheng Xue , Christopher J. Roy 2020

This paper investigates the multi-GPU performance of a 3D buoyancy driven cavity solver using MPI and OpenACC directives on different platforms. The paper shows that decomposing the total problem in different dimensions affects the strong scaling performance significantly for the GPU. Without proper performance optimizations, it is shown that 1D domain decomposition scales poorly on multiple GPUs due to the noncontiguous memory access. The performance using whatever decompositions can be benefited from a series of performance optimizations in the paper. Since the buoyancy driven cavity code is latency-bounded on the clusters examined, a series of optimizations both agnostic and tailored to the platforms are designed to reduce the latency cost and improve memory throughput between hosts and devices efficiently. First, the parallel message packing/unpacking strategy developed for noncontiguous data movement between hosts and devices improves the overall performance by about a factor of 2. Second, transferring different data based on the stencil sizes for different variables further reduces the communication overhead. These two optimizations are general enough to be beneficial to stencil computations having ghost changes on all of the clusters tested. Third, GPUDirect is used to improve the communication on clusters which have the hardware and software support for direct communication between GPUs without staging CPUs memory. Finally, overlapping the communication and computations is shown to be not efficient on multi-GPUs if only using MPI or MPI+OpenACC. Although we believe our implementation has revealed enough overlap, the actual running does not utilize the overlap well due to a lack of asynchronous progression.

Distributed Parallel and Cluster Computing Performance

SMILEI: a collaborative, open-source, multi-purpose particle-in-cell code for plasma simulation

183 - J. Derouillat , A. Beck , F. Perez 2017

SMILEI is a collaborative, open-source, object-oriented (C++) particle-in-cell code. To benefit from the latest advances in high-performance computing (HPC), SMILEI is co-developed by both physicists and HPC experts. The codes structures, capabilities, parallelization strategy and performances are discussed. Additional modules (e.g. to treat ionization or collisions), benchmarks and physics highlights are also presented. Multi-purpose and evolutive, SMILEI is applied today to a wide range of physics studies, from relativistic laser-plasma interaction to astrophysical plasmas.

Plasma Physics

MGSim + MGMark: A Framework for Multi-GPU System Research

78 - Yifan Sun , Trinayan Baruah , Saiful A. Mojumder 2018

The rapidly growing popularity and scale of data-parallel workloads demand a corresponding increase in raw computational power of GPUs (Graphics Processing Units). As single-GPU systems struggle to satisfy the performance demands, multi-GPU systems have begun to dominate the high-performance computing world. The advent of such systems raises a number of design challenges, including the GPU microarchitecture, multi-GPU interconnect fabrics, runtime libraries and associated programming models. The research community currently lacks a publically available and comprehensive multi-GPU simulation framework and benchmark suite to evaluate multi-GPU system design solutions. In this work, we present MGSim, a cycle-accurate, extensively validated, multi-GPU simulator, based on AMDs Graphics Core Next 3 (GCN3) instruction set architecture. We complement MGSim with MGMark, a suite of multi-GPU workloads that explores multi-GPU collaborative execution patterns. Our simulator is scalable and comes with in-built support for multi-threaded execution to enable fast and efficient simulations. In terms of performance accuracy, MGSim differs $5.5%$ on average when compared against actual GPU hardware. We also achieve a $3.5times$ and a $2.5times$ average speedup in function emulation and architectural simulation with 4 CPU cores, while delivering the same accuracy as the serial simulation. We illustrate the novel simulation capabilities provided by our simulator through a case study exploring programming models based on a unified multi-GPU system (U-MGPU) and a discrete multi-GPU system (D-MGPU) that both utilize unified memory space and cross-GPU memory access. We evaluate the design implications from our case study, suggesting that D-MGPU is an attractive programming model for future multi-GPU systems.

Distributed Parallel and Cluster Computing Hardware Architecture

Multi-Node Multi-GPU Diffeomorphic Image Registration for Large-Scale Imaging Problems

148 - Malte Brunn , Naveen Himthani , George Biros 2020

We present a Gauss-Newton-Krylov solver for large deformation diffeomorphic image registration. We extend the publicly available CLAIRE library to multi-node multi-graphics processing unit (GPUs) systems and introduce novel algorithmic modifications that significantly improve performance. Our contributions comprise ($i$) a new preconditioner for the reduced-space Gauss-Newton Hessian system, ($ii$) a highly-optimized multi-node multi-GPU implementation exploiting device direct communication for the main computational kernels (interpolation, high-order finite difference operators and Fast-Fourier-Transform), and ($iii$) a comparison with state-of-the-art CPU and GPU implementations. We solve a $256^3$-resolution image registration problem in five seconds on a single NVIDIA Tesla V100, with a performance speedup of 70% compared to the state-of-the-art. In our largest run, we register $2048^3$ resolution images (25 B unknowns; approximately 152$times$ larger than the largest problem solved in state-of-the-art GPU implementations) on 64 nodes with 256 GPUs on TACCs Longhorn system.

Distributed Parallel and Cluster Computing Optimization and Control

Verification of a Fully Implicit Particle-in-Cell Method for the $v_parallel$ Formalism of Electromagnetic Gyrokinetics in the XGC Code

97 - Benjamin J. Sturdevant , S. Ku , L. Chacon 2021

A fully implicit particle-in-cell method for handling the $v_parallel$-formalism of electromagnetic gyrokinetics has been implemented in XGC. By choosing the $v_parallel$-formalism, we avoid introducing the non-physical skin terms in Amp`{e}res law, which are responsible for the well-known ``cancellation problem in the $p_parallel$-formalism. The $v_parallel$-formalism, however, is known to suffer from a numerical instability when explicit time integration schemes are used due to the appearance of a time derivative in the particle equations of motion from the inductive component of the electric field. Here, using the conventional $delta f$ scheme, we demonstrate that our implicitly discretized algorithm can provide numerically stable simulation results with accurate dispersive properties. We verify the algorithm using a test case for shear Alfv{e}n wave propagation in addition to a case demonstrating the ITG-KBM transition. The ITG-KBM transition case is compared to results obtained from other $delta f$ gyrokinetic codes/schemes, whose verification has already been archived in the literature.

Plasma Physics