Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Porting HEP Parameterized Calorimeter Simulation Code to GPUs

58 0 0.0 ( 0 )

Download Cite

Added by Charles Leggett

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Zhihua Dong - Heather Gray - Charles Leggett

High Energy Physics - Experiment Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The High Energy Physics (HEP) experiments, such as those at the Large Hadron Collider (LHC), traditionally consume large amounts of CPU cycles for detector simulations and data analysis, but rarely use compute accelerators such as GPUs. As the LHC is upgraded to allow for higher luminosity, resulting in much higher data rates, purely relying on CPUs may not provide enough computing power to support the simulation and data analysis needs. As a proof of concept, we investigate the feasibility of porting a HEP parameterized calorimeter simulation code to GPUs. We have chosen to use FastCaloSim, the ATLAS fast parametrized calorimeter simulation. While FastCaloSim is sufficiently fast such that it does not impose a bottleneck in detector simulations overall, significant speed-ups in the processing of large samples can be achieved from GPU parallelization at both the particle (intra-event) and event levels; this is especially beneficial in conditions expected at the high-luminosity LHC, where extremely high per-event particle multiplicities will result from the many simultaneous proton-proton collisions. We report our experience with porting FastCaloSim to NVIDIA GPUs using CUDA. A preliminary Kokkos implementation of FastCaloSim for portability to other parallel architectures is also described.

rate research

Preparing Ginkgo for AMD GPUs -- A Testimonial on Porting CUDA Code to HIP

53 - Yuhsiang M. Tsai 2020

With AMD reinforcing their ambition in the scientific high performance computing ecosystem, we extend the hardware scope of the Ginkgo linear algebra package to feature a HIP backend for AMD GPUs. In this paper, we report and discuss the porting effort from CUDA, the extension of the HIP framework to add missing features such as cooperative groups, the performance price of compiling HIP code for AMD architectures, and the design of a library providing native backends for NVIDIA and AMD GPUs while minimizing code duplication by using a shared code base.

Mathematical Software

Porting WarpX to GPU-accelerated platforms

65 - A. Myers , A. Almgren , L. D. Amorim 2021

WarpX is a general purpose electromagnetic particle-in-cell code that was originally designed to run on many-core CPU architectures. We describe the strategy followed to allow WarpX to use the GPU-accelerated nodes on OLCFs Summit supercomputer, a strategy we believe will extend to the upcoming machines Frontier and Aurora. We summarize the challenges encountered, lessons learned, and give current performance results on a series of relevant benchmark problems.

Computational Physics Distributed Parallel and Cluster Computing Accelerator Physics

Accelerating QDP++ using GPUs

346 - Frank Winter 2011

Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making use of the compute power of GPUs. CUDA provides sufficient support for C++ language elements to enable the Expression Template (ET) technique in the device memory domain. QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expressions and forms the basis of the lattice QCD software suite Chroma. In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to normal CPU execution could be measured.

High Energy Physics - Lattice Distributed Parallel and Cluster Computing Programming Languages

Accelerating QDP++/Chroma on GPUs

478 - Frank Winter 2011

Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Expression Template Engine and using Just-in-Time compilation techniques. Memory management is automated by a software implementation of a cache controlling the GPUs memory. Interoperability with existing Krylov space solvers is demonstrated and special attention is paid on Chroma readiness. Non-kernel routines in lattice QCD calculations typically not subject of hand-tuned optimisations are accelerated which can reduce the effects otherwise suffered from Amdahls Law.

High Energy Physics - Lattice Distributed Parallel and Cluster Computing

Porting DDalphaAMG solver to K computer

54 - Ken-Ichi Ishikawa , Issaku Kanamori 2018

We port Domain-Decomposed-alpha-AMG solver to the K computer. The system has 8 cores and 16 GB memory per node, of which theoretical peak is 128 GFlops (82,944 nodes in total). Its feature, as many as 256 registers per core and as large as 0.5 byte/Flop ratio, requires a different tuning from other machines. In order to use more registers, we change some of the data structure and rewrite matrix-vector operations with intrinsics. The performance is improved by more than a factor two for twelve solves including the setup. The efficiency is still about 5% after the optimization, which is lower than a previously tuned mixed precision solver for the K computer, 22%. The throughput is, however, more than two times better for a physical point configuration.

High Energy Physics - Lattice Computational Physics

comments

Fetching comments

University of Babylon

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Porting HEP Parameterized Calorimeter Simulation Code to GPUs

Ask ChatGPT about the research

No Arabic abstract

Read More