ترغب بنشر مسار تعليمي؟ اضغط هنا

Scaling Lattice QCD beyond 100 GPUs

94   0   0.0 ( 0 )
 نشر من قبل Ronald Babich
 تاريخ النشر 2011
  مجال البحث فيزياء
والبحث باللغة English




اسأل ChatGPT حول البحث

Over the past five years, graphics processing units (GPUs) have had a transformational effect on numerical lattice quantum chromodynamics (LQCD) calculations in nuclear and particle physics. While GPUs have been applied with great success to the post-Monte Carlo analysis phase which accounts for a substantial fraction of the workload in a typical LQCD calculation, the initial Monte Carlo gauge field generation phase requires capability-level supercomputing, corresponding to O(100) GPUs or more. Such strong scaling has not been previously achieved. In this contribution, we demonstrate that using a multi-dimensional parallelization strategy and a domain-decomposed preconditioner allows us to scale into this regime. We present results for two popular discretizations of the Dirac operator, Wilson-clover and improved staggered, employing up to 256 GPUs on the Edge cluster at Lawrence Livermore National Laboratory.



قيم البحث

اقرأ أيضاً

Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahls law issue. The lattice QCD application Chroma allows to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallel from the application layer. The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory and Chroma implements algorithms in terms of this high-level interface. Thus by porting the low-level layer one can effectively move the whole application in one swing to a different platform. The QDP-JIT/PTX library, the reimplementation of the low-level layer, provides a framework for lattice QCD calculations for the CUDA architecture. The complete software interface is supported and thus applications can be run unaltered on GPU-based parallel computers. This reimplementation was possible due to the availability of a JIT compiler (part of the NVIDIA Linux kernel driver) which translates an assembly-like language (PTX) to GPU code. The expression template technique is used to build PTX code generators and a software cache manages the GPU memory. This reimplementation allows us to deploy an efficient implementation of the full gauge-generation program with dynamical fermions on large-scale GPU-based machines such as Titan and Blue Waters which accelerates the algorithm by more than an order of magnitude.
167 - Huey-Wen Lin 2012
Study of the hadronic matrix elements can provide not only tests of the QCD sector of the Standard Model (in comparing with existing experiments) but also reliable low-energy hadronic quantities applicable to a wide range of beyond-the-Standard Model scenarios where experiments or theoretical calculations are limited or difficult. On the QCD side, progress has been made in the notoriously difficult problem of addressing gluonic structure inside the nucleon, reaching higher-$Q^2$ region of the form factors, and providing a complete picture of the proton spin. However, even further study and improvement of systematic uncertainties are needed. There are also proposed calculations of higher-order operators in the neutron electric dipole moment Lagrangian, which would be useful when combined with effective theory to probe BSM. Lattice isovector tensor and scalar charges can be combined with upcoming neutron beta-decay measurements of the Fierz interference term and neutrino asymmetry parameter to probe new interactions in the effective theory, revealing the scale of potential new TeV particles. Finally, I revisit the systematic uncertainties in recent calculations of $g_A$ and review prospects for future calculations.
Recently Arm introduced a new instruction set called Scalable Vector Extension (SVE), which supports vector lengths up to 2048 bits. While SVE hardware will not be generally available until about 2021, we believe that future SVE-based architectures w ill have great potential for Lattice QCD. In this contribution we discuss key aspects of SVE and describe how we implemented SVE in the Grid Lattice QCD framework.
110 - Y. Nakamura , G. Schierholz 2018
The axion is a hypothetical elementary particle postulated by the Peccei-Quinn theory to resolve the strong CP problem in QCD. If axions exist and have low mass, they are a candidate for dark matter as well. So far our knowledge of the properties of the QCD axion rests on semi-classical arguments and effective theory. In this work we perform, for the first time, a fully dynamical investigation of the Peccei-Quinn theory, focussing on the axion mass, by simulating the theory on the lattice. The results of the simulation are found to be in conflict with present axion phenomenology.
Modern graphics hardware is designed for highly parallel numerical tasks and promises significant cost and performance benefits for many scientific applications. One such application is lattice quantum chromodyamics (lattice QCD), where the main comp utational challenge is to efficiently solve the discretized Dirac equation in the presence of an SU(3) gauge field. Using NVIDIAs CUDA platform we have implemented a Wilson-Dirac sparse matrix-vector product that performs at up to 40 Gflops, 135 Gflops and 212 Gflops for double, single and half precision respectively on NVIDIAs GeForce GTX 280 GPU. We have developed a new mixed precision approach for Krylov solvers using reliable updates which allows for full double precision accuracy while using only single or half precision arithmetic for the bulk of the computation. The resulting BiCGstab and CG solvers run in excess of 100 Gflops and, in terms of iterations until convergence, perform better than the usual defect-correction approach for mixed precision.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا