Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Gauge Field Generation on Large-Scale GPU-Enabled Systems

516 0 0.0 ( 0 )

Download Cite

Added by Frank Winter

Publication date 2012

fields Informatics Engineering

and research's language is English

Authors Frank Winter

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Over the past years GPUs have been successfully applied to the task of inverting the fermion matrix in lattice QCD calculations. Even strong scaling to capability-level supercomputers, corresponding to O(100) GPUs or more has been achieved. However strong scaling a whole gauge field generation algorithm to this regim requires significantly more functionality than just having the matrix inverter utilizing the GPUs and has not yet been accomplished. This contribution extends QDP-JIT, the migration of SciDAC QDP++ to GPU-enabled parallel systems, to help to strong scale the whole Hybrid Monte-Carlo to this regime. Initial results are shown for gauge field generation with Chroma simulating pure Wilson fermions on OLCF TitanDev.

rate research

Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM

91 - Deepak Narayanan , Mohammad Shoeybi , Jared Casper 2021

Large language models have led to state-of-the-art accuracies across a range of tasks. However, training these models efficiently is challenging for two reasons: a) GPU memory capacity is limited, making it impossible to fit large models on even a multi-GPU server, and b) the number of compute operations required to train these models can result in unrealistically long training times. Consequently, new methods of model parallelism such as tensor and pipeline parallelism have been proposed. Unfortunately, naive usage of these methods leads to fundamental scaling issues at thousands of GPUs, e.g., due to expensive cross-node communication or devices spending significant time waiting on other devices to make progress. In this paper, we show how different types of parallelism methods (tensor, pipeline, and data parallelism) can be composed to scale to thousands of GPUs and models with trillions of parameters. We survey techniques for pipeline parallelism and propose a novel interleaved pipeline parallelism schedule that can improve throughput by 10+% with memory footprint comparable to existing approaches. We quantitatively study the trade-offs between tensor, pipeline, and data parallelism, and provide intuition as to how to configure distributed training of a large model. Our approach allows us to perform training iterations on a model with 1 trillion parameters at 502 petaFLOP/s on 3072 GPUs with achieved per-GPU throughput of 52% of theoretical peak. Our code is open sourced at https://github.com/nvidia/megatron-lm.

Computation and Language Distributed Parallel and Cluster Computing

Machine learning for ultrafast X-ray diffraction patterns on large-scale GPU clusters

510 - Tomas Ekeberg , Stefan Engblom , 2014

The classical method of determining the atomic structure of complex molecules by analyzing diffraction patterns is currently undergoing drastic developments. Modern techniques for producing extremely bright and coherent X-ray lasers allow a beam of streaming particles to be intercepted and hit by an ultrashort high energy X-ray beam. Through machine learning methods the data thus collected can be transformed into a three-dimensional volumetric intensity map of the particle itself. The computational complexity associated with this problem is very high such that clusters of data parallel accelerators are required. We have implemented a distributed and highly efficient algorithm for inversion of large collections of diffraction patterns targeting clusters of hundreds of GPUs. With the expected enormous amount of diffraction data to be produced in the foreseeable future, this is the required scale to approach real time processing of data at the beam site. Using both real and synthetic data we look at the scaling properties of the application and discuss the overall computational viability of this exciting and novel imaging technique.

Biomolecules Distributed Parallel and Cluster Computing Machine Learning

DD-$alpha$AMG on QPACE 3

171 - Peter Georg , Daniel Richtmann , Tilo Wettig 2017

We describe our experience porting the Regensburg implementation of the DD-$alpha$AMG solver from QPACE 2 to QPACE 3. We first review how the code was ported from the first generation Intel Xeon Phi processor (Knights Corner) to its successor (Knights Landing). We then describe the modifications in the communication library necessitated by the switch from InfiniBand to Omni-Path. Finally, we present the performance of the code on a single processor as well as the scaling on many nodes, where in both cases the speedup factor is close to the theoretical expectations.

High Energy Physics - Lattice Distributed Parallel and Cluster Computing Computational Physics

Performance Engineering of the Kernel Polynomial Method on Large-Scale CPU-GPU Systems

441 - Moritz Kreutzer , Georg Hager , Gerhard Wellein 2014

The Kernel Polynomial Method (KPM) is a well-established scheme in quantum physics and quantum chemistry to determine the eigenvalue density and spectral properties of large sparse matrices. In this work we demonstrate the high optimization potential and feasibility of peta-scale heterogeneous CPU-GPU implementations of the KPM. At the node level we show that it is possible to decouple the sparse matrix problem posed by KPM from main memory bandwidth both on CPU and GPU. To alleviate the effects of scattered data access we combine loosely coupled outer iterations with tightly coupled block sparse matrix multiple vector operations, which enables pure data streaming. All optimizations are guided by a performance analysis and modelling process that indicates how the computational bottlenecks change with each optimization step. Finally we use the optimized node-level KPM with a hybrid-parallel framework to perform large scale heterogeneous electronic structure calculations for novel topological materials on a petascale-class Cray XC30 system.

Computational Engineering Mesoscale and Nanoscale Physics Distributed Parallel and Cluster Computing

Parallel implementation of a lattice-gauge-theory code: studying quark confinement on PC clusters

110 - Attilio Cucchieri , Tereza Mendes , Gonzalo Travieso 2003

We consider the implementation of a parallel Monte Carlo code for high-performance simulations on PC clusters with MPI. We carry out tests of speedup and efficiency. The code is used for numerical simulations of pure SU(2) lattice gauge theory at very large lattice volumes, in order to study the infrared behavior of gluon and ghost propagators. This problem is directly related to the confinement of quarks and gluons in the physics of strong interactions.

High Energy Physics - Lattice Distributed Parallel and Cluster Computing

comments

Fetching comments

Sham Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Gauge Field Generation on Large-Scale GPU-Enabled Systems

Ask ChatGPT about the research

No Arabic abstract

Read More