ترغب بنشر مسار تعليمي؟ اضغط هنا

Implementation of the conjugate gradient algorithm on FPGA devices

68   0   0.0 ( 0 )
 نشر من قبل Piotr Korcyl
 تاريخ النشر 2018
  مجال البحث فيزياء
والبحث باللغة English




اسأل ChatGPT حول البحث

Results of porting parts of the Lattice Quantum Chromodynamics code to modern FPGA devices are presented. A single-node, double precision implementation of the Conjugate Gradient algorithm is used to invert numerically the Dirac-Wilson operator on a 4-dimensional grid on a Xilinx Zynq evaluation board. The code is divided into two software/hardware parts in such a way that the entire multiplication by the Dirac operator is performed in programmable logic, and the rest of the algorithm runs on the ARM cores. Optimized data blocks are used to efficiently use data movement infrastructure allowing to reach intervals of 1 clock cycle. We show that the FPGA implementation can offer a comparable performance compared to that obtained using Intel Xeon Phi KNL.



قيم البحث

اقرأ أيضاً

We review our work done to optimize the staggered conjugate gradient (CG) algorithm in the MILC code for use with the Intel Knights Landing (KNL) architecture. KNL is the second gener- ation Intel Xeon Phi processor. It is capable of massive thread p arallelism, data parallelism, and high on-board memory bandwidth and is being adopted in supercomputing centers for scientific research. The CG solver consumes the majority of time in production running, so we have spent most of our effort on it. We compare performance of an MPI+OpenMP baseline version of the MILC code with a version incorporating the QPhiX staggered CG solver, for both one-node and multi-node runs.
We investigate implementation of lattice Quantum Chromodynamics (QCD) code on the Intel AVX-512 architecture. The most time consuming part of the numerical simulations of lattice QCD is a solver of linear equation for a large sparse matrix that repre sents the strong interaction among quarks. To establish widely applicable prescriptions, we examine rather general methods for the SIMD architecture of AVX-512, such as using intrinsics and manual prefetching, for the matrix multiplication. Based on experience on the Oakforest-PACS system, a large scale cluster composed of Intel Xeon Phi Knights Landing, we discuss the performance tuning exploiting AVX-512 and code design on the SIMD architecture and massively parallel machines. We observe that the same code runs efficiently on an Intel Xeon Skylake-SP machine.
139 - Alex Bombrun 2011
The ESA space astrometry mission Gaia, planned to be launched in 2013, has been designed to make angular measurements on a global scale with micro-arcsecond accuracy. A key component of the data processing for Gaia is the astrometric core solution, w hich must implement an efficient and accurate numerical algorithm to solve the resulting, extremely large least-squares problem. The Astrometric Global Iterative Solution (AGIS) is a framework that allows to implement a range of different iterative solution schemes suitable for a scanning astrometric satellite. In order to find a computationally efficient and numerically accurate iteration scheme for the astrometric solution, compatible with the AGIS framework, we study an adaptation of the classical conjugate gradient (CG) algorithm, and compare it to the so-called simple iteration (SI) scheme that was previously known to converge for this problem, although very slowly. The different schemes are implemented within a software test bed for AGIS known as AGISLab, which allows to define, simulate and study scaled astrometric core solutions. After successful testing in AGISLab, the CG scheme has been implemented also in AGIS. The two algorithms CG and SI eventually converge to identical solutions, to within the numerical noise (of the order of 0.00001 micro-arcsec). These solutions are independent of the starting values (initial star catalogue), and we conclude that they are equivalent to a rigorous least-squares estimation of the astrometric parameters. The CG scheme converges up to a factor four faster than SI in the tested cases, and in particular spatially correlated truncation errors are much more efficiently damped out with the CG scheme.
We consider three mathematically equivalent variants of the conjugate gradient (CG) algorithm and how they perform in finite precision arithmetic. It was shown in [{em Behavior of slightly perturbed Lanczos and conjugate-gradient recurrences}, Lin.~A lg.~Appl., 113 (1989), pp.~7-63] that under certain conditions the convergence of a slightly perturbed CG computation is like that of exact CG for a matrix with many eigenvalues distributed throughout tiny intervals about the eigenvalues of the given matrix, the size of the intervals being determined by how closely these conditions are satisfied. We determine to what extent each of these variants satisfies the desired conditions, using a set of test problems and show that there is significant correlation between how well these conditions are satisfied and how well the finite precision computation converges before reaching its ultimately attainable accuracy. We show that for problems where the width of the intervals containing the eigenvalues of the associated exact CG matrix makes a significant difference in the behavior of exact CG, the different CG variants behave differently in finite precision arithmetic. For problems where the interval width makes little difference or where the convergence of exact CG is essentially governed by the upper bound based on the square root of the condition number of the matrix, the different CG variants converge similarly in finite precision arithmetic until the ultimate level of accuracy is achieved, although this ultimate level of accuracy may be different for the different variants. This points to the need for testing new CG variants on problems that are especially sensitive to rounding errors.
In this paper we describe a single-node, double precision Field Programmable Gate Array (FPGA) implementation of the Conjugate Gradient algorithm in the context of Lattice Quantum Chromodynamics. As a benchmark of our proposal we invert numerically t he Dirac-Wilson operator on a 4-dimensional grid on three Xilinx hardware solutions: Zynq Ultrascale+ evaluation board, the Alveo U250 accelerator and the largest device available on the market, the VU13P device. In our implementation we separate software/hardware parts in such a way that the entire multiplication by the Dirac operator is performed in hardware, and the rest of the algorithm runs on the host. We find out that the FPGA implementation can offer a performance comparable with that obtained using current CPU or Intels many core Xeon Phi accelerators. A possible multiple node FPGA-based system is discussed and we argue that power-efficient High Performance Computing (HPC) systems can be implemented using FPGA devices only.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا