Opening the Black Box: Performance Estimation during Code Generation for GPUs

68 0 0.0 ( 0 )

Download Cite

Added by Dominik Ernst

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Dominik Ernst

Performance

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Automatic code generation is frequently used to create implementations of algorithms specifically tuned to particular hardware and application parameters. The code generation process involves the selection of adequate code transformations, tuning parameters, and parallelization strategies. To cover the huge search space, code generation frameworks may apply time-intensive autotuning, exploit scenario-specific performance models, or treat performance as an intangible black box that must be described via machine learning. This paper addresses the selection problem by identifying the relevant performance-defining mechanisms through a performance model coupled with an analytic hardware metric estimator. This enables a quick exploration of large configuration spaces to identify highly efficient candidates with high accuracy. Our current approach targets memory-intensive GPGPU applications and focuses on the correct modeling of data transfer volumes to all levels of the memory hierarchy. We show how our method can be coupled to the pystencils stencil code generator, which is used to generate kernels for a range four 3D25pt stencil and a complex two phase fluid solver based on the Lattice Boltzmann Method. For both, it delivers a ranking that can be used to select the best performing candidate. The method is not limited to stencil kernels, but can be integrated into any code generator that can generate the required address expressions.

rate research

Performance Analysis of CP2K Code for Ab Initio Molecular Dynamics

273 - Dewi Yokelson , Nikolay V. Tkachenko , Robert Robey 2021

Using a realistic molecular catalyst system, we conduct scaling studies of ab initio molecular dynamics simulations using the CP2K code on both Intel Xeon CPU and NVIDIA V100 GPU architectures. We explore using process placement and affinity to gain additional performance improvements. We also use statistical methods to understand performance changes in spite of the variability in runtime for each molecular dynamics timestep. We found ideal conditions for CPU runs included at least four MPI ranks per node, bound evenly across each socket, and fully utilizing processing cores with one OpenMP thread per core, no benefit was shown from reserving cores for the system. The CPU-only simulations scaled at 70% or more of the ideal scaling up to 10 compute nodes, after which the returns began to diminish more quickly. Simulations on a single 40-core node with two NVIDIA V100 GPUs for acceleration achieved over 3.7x speedup compared to the fastest single 36-core node CPU-only version, and showed 13% speedup over the fastest time we achieved across five CPU-only nodes.

Performance Distributed Parallel and Cluster Computing

Measuring and comparing the scaling behaviour of a high-performance CFD code on different supercomputing infrastructures

393 - Jer^ome Frisch RWTH Aachenn University , Aachen 2018

Parallel code design is a challenging task especially when addressing petascale systems for massive parallel processing (MPP), i.e. parallel computations on several hundreds of thousands of cores. An in-house computational fluid dynamics code, developed by our group, was designed for such high-fidelity runs in order to exhibit excellent scalability values. Basis for this code is an adaptive hierarchical data structure together with an efficient communication and (numerical) computation scheme that supports MPP. For a detailled scalability analysis, we performed several experiments on two of Germanys national supercomputers up to 140,000 processes. In this paper, we will show the results of those experiments and discuss any bottlenecks that could be observed while solving engineering-based problems such as porous media flows or thermal comfort assessments for problem sizes up to several hundred billion degrees of freedom.

Performance Computational Physics

Opening the Black Box of Deep Neural Networks in Physical Layer Communication

105 - Jun Liu , Kai Mei , Dongtang Ma 2021

Deep Neural Network (DNN)-based physical layer techniques are attracting considerable interest due to their potential to enhance communication systems. However, most studies in the physical layer have tended to focus on the application of DNN models to wireless communication problems but not to theoretically understand how does a DNN work in a communication system. In this letter, we aim to quantitatively analyse why DNNs can achieve comparable performance in the physical layer comparing with traditional techniques and their cost in terms of computational complexity. We further investigate and also experimentally validate how information is flown in a DNN-based communication system under the information theoretic concepts.

Signal Processing Information Theory Machine Learning

Accelerating Sparse Approximate Matrix Multiplication on GPUs

256 - Xiaoyan Liu , Yi Liu , Ming Dun 2021

Although the matrix multiplication plays a vital role in computational linear algebra, there are few efficient solutions for matrix multiplication of the near-sparse matrices. The Sparse Approximate Matrix Multiply (SpAMM) is one of the algorithms to fill the performance gap neglected by traditional optimizations for dense/sparse matrix multiplication. However, existing SpAMM algorithms fail to exploit the performance potential of GPUs for acceleration. In this paper, we present cuSpAMM, the first parallel SpAMM algorithm optimized for multiple GPUs. Several performance optimizations have been proposed, including algorithm re-design to adapt to the thread parallelism, blocking strategies for memory access optimization, and the acceleration with the tensor core. In addition, we scale cuSpAMM to run on multiple GPUs with an effective load balance scheme. We evaluate cuSpAMM on both synthesized and real-world datasets on multiple GPUs. The experiment results show that cuSpAMM achieves significant performance speedup compared to vendor optimized cuBLAS and cuSPARSE libraries.

Performance

Gaia FGK Benchmark stars: Opening the black box of stellar element abundance determination

65 - P. Jofre , U. Heiter , C. C. Worley 2016

Gaia and its complementary spectroscopic surveys combined will yield the most comprehensive database of kinematic and chemical information of stars in the Milky Way. The Gaia FGK benchmark stars play a central role in this matter as they are calibration pillars for the atmospheric parameters and chemical abundances for various surveys. The spectroscopic analyses of the benchmark stars are done by combining different methods, and the results will be affected by the systematic uncertainties inherent in each method. In this paper we explore some of these systematic uncertainties. We determined line abundances of Ca, Cr, Mn and Co for four benchmark stars using six different methods. We changed the default input parameters of the different codes in a systematic way and found in some cases significant differences between the results. Since there is no consensus on the correct values for many of these default parameters, we urge the community to raise discussions towards standard input parameters that could alleviate the difference in abundances obtained by different methods. In this work we provide quantitative estimates of uncertainties in elemental abundances due to the effect of differing technical assumptions in spectrum modelling.

Solar and Stellar Astrophysics Astrophysics of Galaxies