ﻻ يوجد ملخص باللغة العربية
Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making use of the compute power of GPUs. CUDA provides sufficient support for C++ language elements to enable the Expression Template (ET) technique in the device memory domain. QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expressions and forms the basis of the lattice QCD software suite Chroma. In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to normal CPU execution could be measured.
Extensions to the C++ implementation of the QCD Data Parallel Interface are provided enabling acceleration of expression evaluation on NVIDIA GPUs. Single expressions are off-loaded to the device memory and execution domain leveraging the Portable Ex
We accelerate many-flavor lattice QCD simulations using multiple GPUs. Multiple pseudo-fermion fields are introduced additively and independently for each flavor in the many-flavor HMC algorithm. Using the independence of each pseudo-fermion field an
Priority queue, often implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstras shortest path algorithm, Prims minimum spanning tree, Huffman encoding, and the branch-and-bound algorithm. Howeve
Stencil computations are widely used in HPC applications. Today, many HPC platforms use GPUs as accelerators. As a result, understanding how to perform stencil computations fast on GPUs is important. While implementation strategies for low-order sten
A number of stochastic methods developed for the calculation of fermion loops are investigated and compared, in particular with respect to their efficiency when implemented on Graphics Processing Units (GPUs). We assess the performance of the various