بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Gauge Field Generation on Large-Scale GPU-Enabled Systems

499 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Frank Winter

تاريخ النشر 2012

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Frank Winter

فيزياء الطاقة العالية - شعرية النظم الموزعة والتوازية والحوسبة العنقودية الفيزياء الحسابية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Over the past years GPUs have been successfully applied to the task of inverting the fermion matrix in lattice QCD calculations. Even strong scaling to capability-level supercomputers, corresponding to O(100) GPUs or more has been achieved. However strong scaling a whole gauge field generation algorithm to this regim requires significantly more functionality than just having the matrix inverter utilizing the GPUs and has not yet been accomplished. This contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this regime. Initial results are shown for gauge field generation with Chroma simulating pure Wilson fermions on OLCF TitanDev.

قيم البحث

91 - Deepak Narayanan , Mohammad Shoeybi , Jared Casper 2021

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a mu lti-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm.

الحساب واللغة النظم الموزعة والتوازية والحوسبة العنقودية

Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters

495 - Tomas Ekeberg , Stefan Engblom , 2014

The classical method of determining the atomic structure of complex molecules by analyzing diffraction patterns is currently undergoing drastic developments. Modern techniques for producing extremely bright and coherent X-ray lasers allow a beam of s treaming particles to be intercepted and hit by an ultrashort high energy X-ray beam. Through machine learning methods the data thus collected can be transformed into a three-dimensional volumetric intensity map of the particle itself. The computational complexity associated with this problem is very high such that clusters of data parallel accelerators are required. We have implemented a distributed and highly efficient algorithm for inversion of large collections of diffraction patterns targeting clusters of hundreds of GPUs. With the expected enormous amount of diffraction data to be produced in the foreseeable future, this is the required scale to approach real time processing of data at the beam site. Using both real and synthetic data we look at the scaling properties of the application and discuss the overall computational viability of this exciting and novel imaging technique.

الجزيئات الحيوية النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

DD-$alpha$AMG on QPACE 3

171 - Peter Georg , Daniel Richtmann , Tilo Wettig 2017

We describe our experience porting the Regensburg implementation of the DD-$alpha$AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knight s Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.

فيزياء الطاقة العالية - شعرية النظم الموزعة والتوازية والحوسبة العنقودية الفيزياء الحسابية

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

417 - Moritz Kreutzer , Georg Hager , Gerhard Wellein 2014

The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system.

الهندسة الحاسوبية، المالية،العلوم الفيزياء ميسكالي وننكالي النظم الموزعة والتوازية والحوسبة العنقودية

Parallel implementation of a lattice-gauge-theory code: studying quark confinement on PC clusters

110 - Attilio Cucchieri , Tereza Mendes , Gonzalo Travieso 2003

We consider the implementation of a parallel Monte Carlo code for high-performance simulations on PC clusters with MPI. We carry out tests of speedup and efficiency. The code is used for numerical simulations of pure SU(2) lattice gauge theory at ver y large lattice volumes, in order to study the infrared behavior of gluon and ghost propagators. This problem is directly related to the confinement of quarks and gluons in the physics of strong interactions.

فيزياء الطاقة العالية - شعرية النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة حلوان

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Gauge Field Generation on Large-Scale GPU-Enabled Systems

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً