No Arabic abstract
Motivated by the computational demands of our research and budgetary constraints which are common to many research institutions, we built a ``poor mans supercomputer, a cluster of PC nodes which together can perform parallel calculations at a fraction of the price of a commercial supercomputer. We describe the construction, cost, and performance of our cluster.
QPACE is a novel parallel computer which has been developed to be primarily used for lattice QCD simulations. The compute power is provided by the IBM PowerXCell 8i processor, an enhanced version of the Cell processor that is used in the Playstation 3. The QPACE nodes are interconnected by a custom, application optimized 3-dimensional torus network implemented on an FPGA. To achieve the very high packaging density of 26 TFlops per rack a new water cooling concept has been developed and successfully realized. In this paper we give an overview of the architecture and highlight some important technical details of the system. Furthermore, we provide initial performance results and report on the installation of 8 QPACE racks providing an aggregate peak performance of 200 TFlops.
In this proceedings we discuss the motivation, implementation details, and performance of a new physics code base called Grid. It is intended to be more performant, more general, but similar in spirit to QDP++cite{QDP}. Our approach is to engineer the basic type system to be consistently fast, rather than bolt on a few optimised routines, and we are attempt to write all our optimised routines directly in the Grid framework. It is hoped this will deliver best known practice performance across the next generation of supercomputers, which will provide programming challenges to traditional scalar codes. We illustrate the programming patterns used to implement our goals, and advances in productivity that have been enabled by using new features in C++11.
Our progress in computing the spectrum of excited baryons and mesons in lattice QCD is described. Sets of spatially-extended hadron operators with a variety of different momenta are used. A new method of stochastically estimating the low-lying effects of quark propagation is utilized which allows reliable determinations of temporal correlations of both single-hadron and multi-hadron operators. The method is tested on the isoscalar mesons in the scalar, pseudoscalar, and vector channels, and on the two-pion system of total isospin I=0,1,2.
We study the performance of QCD simulations with dynamical Wilson fermions by combining the Hybrid Monte Carlo algorithm with parallel tempering on $10^4$ and $12^4$ lattices. In order to compare tempered with standard simulations, covariance matrices between sub-ensembles have to be formulated and evaluated using the general properties of autocorrelations of the parallel tempering algorithm. We find that rendering the hopping parameter $kappa$ dynamical does not lead to an essential improvement. We point out possible reasons for this observation and discuss more suitable ways of applying parallel tempering to QCD.
Computing platforms equipped with accelerators like GPUs have proven to provide great computational power. However, exploiting such platforms for existing scientific applications is not a trivial task. Current GPU programming frameworks such as CUDA C/C++ require low-level programming from the developer in order to achieve high performance code. As a result porting of applications to GPUs is typically limited to time-dominant algorithms and routines, leaving the remainder not accelerated which can open a serious Amdahls law issue. The lattice QCD application Chroma allows to explore a different porting strategy. The layered structure of the software architecture logically separates the data-parallel from the application layer. The QCD Data-Parallel software layer provides data types and expressions with stencil-like operations suitable for lattice field theory and Chroma implements algorithms in terms of this high-level interface. Thus by porting the low-level layer one can effectively move the whole application in one swing to a different platform. The QDP-JIT/PTX library, the reimplementation of the low-level layer, provides a framework for lattice QCD calculations for the CUDA architecture. The complete software interface is supported and thus applications can be run unaltered on GPU-based parallel computers. This reimplementation was possible due to the availability of a JIT compiler (part of the NVIDIA Linux kernel driver) which translates an assembly-like language (PTX) to GPU code. The expression template technique is used to build PTX code generators and a software cache manages the GPU memory. This reimplementation allows us to deploy an efficient implementation of the full gauge-generation program with dynamical fermions on large-scale GPU-based machines such as Titan and Blue Waters which accelerates the algorithm by more than an order of magnitude.