Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

AVX-512 extension to OpenQCD 1.6

57 0 0.0 ( 0 )

Download Cite

Added by Jarno Rantaharju

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Ed Bennett - Mark Dawson - Michele Mesiti

High Energy Physics - Lattice Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We publish an extension of openQCD-1.6 with AVX-512 vector instructions using Intel intrinsics. Recent Intel processors support extended instruction sets with operations on 512-bit wide vectors, increasing both the capacity for floating point operations and register memory. Optimal use of the new capabilities requires reorganising data and floating point operations into these wider vector units. We report on the implementation and performance of the AVX-512 OpenQCD extension on clusters using Intel Knights Landing and Xeon Scalable (Skylake) CPUs. In complete HMC trajectories with physically relevant parameters we observe a performance increase of 5% to 10%.

rate research

Practical Implementation of Lattice QCD Simulation on SIMD Machines with Intel AVX-512

58 - Issaku Kanamori , Hideo Matsufuru 2018

We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel AVX-512 architecture. The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that represents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of AVX-512, such as using intrinsics and manual prefetching, for the matrix multiplication. Based on experience on the Oakforest-PACS system, a large scale cluster composed of Intel Xeon Phi Knights Landing, we discuss the performance tuning exploiting AVX-512 and code design on the SIMD architecture and massively parallel machines. We observe that the same code runs efficiently on an Intel Xeon Skylake-SP machine.

High Energy Physics - Lattice Computational Physics

SIMD Vectorization for the Lennard-Jones Potential with AVX2 and AVX-512 instructions

65 - Hiroshi Watanabe , Koh M. Nakagawa 2018

This work describes the SIMD vectorization of the force calculation of the Lennard-Jones potential with Intel AVX2 and AVX-512 instruction sets. Since the force-calculation kernel of the molecular dynamics method involves indirect access to memory, the data layout is one of the most important factors in vectorization. We find that the Array of Structures (AoS) with padding exhibits better performance than Structure of Arrays (SoA) with appropriate vectorization and optimizations. In particular, AoS with 512-bit width exhibits the best performance among the architectures. While the difference in performance between AoS and SoA is significant for the vectorization with AVX2, that with AVX-512 is minor. The effect of other optimization techniques, such as software pipelining together with vectorization, is also discussed. We present results for benchmarks on three CPU architectures: Intel Haswell (HSW), Knights Landing (KNL), and Skylake (SKL). The performance gains by vectorization are about 42% on HSW compared with the code optimized without vectorization. On KNL, the hand-vectorized codes exhibit 34% better performance than the codes vectorized automatically by the Intel compiler. On SKL, the code vectorized with AVX2 exhibits slightly better performance than that with vectorized AVX-512.

Mathematical Software Computational Engineering

An extension to the Luschers finite volume method above inelastic threashold (formalism)

500 - Noriyoshi Ishii for HAL-QCD Collaboration 2011

An extension of the Luschers finite volume method above inelastic thresholds is proposed. It is fulfilled by extendind the procedure recently proposed by HAL-QCD Collaboration for a single channel system. Focusing on the asymptotic behaviors of the Nambu-Bethe-Salpeter (NBS) wave functions (equal-time) near spatial infinity, a coupled channel extension of effective Schrodinger equation is constructed by introducing an energy-independent interaction kernel. Because the NBS wave functions contain the information of T-matrix at long distance, S-matrix can be obtained by solving the coupled channel effective Schrodinger equation in the infinite volume.

High Energy Physics - Lattice Nuclear Theory

Extension of a new method for locating critical temperatures

144 - P. Sawicki 1997

We investigate recently proposed method for locating critical temperatures and introduce some modifications which allow to formulate exact criterion for any self-dual model. We apply the modified method for the Ashkin-Teller model and show that the exact result for a critical temperature is reproduced. We test also a two-layer Ising model for the presence of eventual self-duality.

High Energy Physics - Lattice

Accelerating QDP++ using GPUs

349 - Frank Winter 2011

Graphic Processing Units (GPUs) are getting increasingly important as target architectures in scientific High Performance Computing (HPC). NVIDIA established CUDA as a parallel computing architecture controlling and making use of the compute power of GPUs. CUDA provides sufficient support for C++ language elements to enable the Expression Template (ET) technique in the device memory domain. QDP++ is a C++ vector class library suited for quantum field theory which provides vector data types and expressions and forms the basis of the lattice QCD software suite Chroma. In this work accelerating QDP++ expression evaluation to a GPU was successfully implemented leveraging the ET technique and using Just-In-Time (JIT) compilation. The Portable Expression Template Engine (PETE) and the C API for CUDA kernel arguments were used to build the bridge between host and device memory domains. This provides the possibility to accelerate Chroma routines to a GPU which are typically not subject to special optimisation. As an application example a smearing routine was accelerated to execute on a GPU. A significant speed-up compared to normal CPU execution could be measured.

High Energy Physics - Lattice Distributed Parallel and Cluster Computing Programming Languages

comments

Fetching comments

Cordoba Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

AVX-512 extension to OpenQCD 1.6

Ask ChatGPT about the research

No Arabic abstract

Read More