Do you want to publish a course? Click here

A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon Phi$^{TM}$, KNC) system

107   0   0.0 ( 0 )
 Added by Ken-Ichi Ishikawa
 Publication date 2016
  fields Physics
and research's language is English




Ask ChatGPT about the research

The most computationally demanding part of Lattice QCD simulations is solving quark propagators. Quark propagators are typically obtained with a linear equation solver utilizing HPC machines. The CCS QCD Benchmark is a benchmark program solving the Wilson-Clover quark propagator, and is developed at the Center for Computational Sciences (CCS), University of Tsukuba. We optimized the benchmark program for a Intel XeonPhi (Knights Corner, KNC) system named COMA (PACS-IX) at CCS Tsukuba under the Intel Parallel Computing Center program. A single precision BiCGStab solver with the overlapped Restricted Additive Schwarz (RAS) preconditioner was implemented using SIMD intrinsics, OpenMP and MPI in the offload mode. With the reverse-offloading technique, we could reduce the communication and offloading overheads. We observed a performance of $sim 200$ GFlops sustained for the Wilson-Clover hopping matrix multiplication on the lattice sizes larger than $24^3times 32$ on a sinlge card of the COMA system. A good weak scaling perofmace was observed on the local lattice sizes larger than $24^3times 32$.



rate research

Read More

We review our work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is capable of massive thread parallelism, data parallelism, and high on-board memory bandwidth and is being adopted in supercomputing centers for scientific research. The CG solver consumes the majority of time in production running, so we have spent most of our effort on it. We compare performance of an MPI+OpenMP baseline version of the MILC code with a version incorporating the QPhiX staggered CG solver, for both one-node and multi-node runs.
We give an overview of QPACE 2, which is a custom-designed supercomputer based on Intel Xeon Phi processors, developed in a collaboration of Regensburg University and Eurotech. We give some general recommendations for how to write high-performance code for the Xeon Phi and then discuss our implementation of a domain-decomposition-based solver and present a number of benchmarks.
We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel AVX-512 architecture. The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that represents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of AVX-512, such as using intrinsics and manual prefetching, for the matrix multiplication. Based on experience on the Oakforest-PACS system, a large scale cluster composed of Intel Xeon Phi Knights Landing, we discuss the performance tuning exploiting AVX-512 and code design on the SIMD architecture and massively parallel machines. We observe that the same code runs efficiently on an Intel Xeon Skylake-SP machine.
Load balancing is a widely accepted technique for performance optimization of scientific applications on parallel architectures. Indeed, balanced applications do not waste processor cycles on waiting at points of synchronization and data exchange, maximizing this way the utilization of processors. In this paper, we challenge the universality of the load-balancing approach to optimization of the performance of parallel applications. First, we formulate conditions that should be satisfied by the performance profile of an application in order for the application to achieve its best performance via load balancing. Then we use a real-life scientific application, MPDATA, to demonstrate that its performance profile on a modern parallel architecture, Intel Xeon Phi, significantly deviates from these conditions. Based on this observation, we propose a method of performance optimization of scientific applications through load imbalancing. We also propose an algorithm that finds the optimal, possibly imbalanced, configuration of a data parallel application on a set of homogeneous processors. This algorithm uses functional performance models of the application to find the partitioning that minimizes its computation time but not necessarily balances the load of the processors. We show how to apply this algorithm to optimization of MPDATA on Intel Xeon Phi. Experimental results demonstrate that the performance of this carefully optimized load-balanced application can be further improved by 15% using the proposed load-imbalancing optimization.
In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computational and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.
comments
Fetching comments Fetching comments
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا