ﻻ يوجد ملخص باللغة العربية
Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip performance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy to solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for lowest energy to solution at minimal performance degradation. Then we extrapolate to the MPI-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup we could achieve energy savings of 35% in this case, compared to a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy saving potential unused.
Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, b
This paper has been withdrawn by the author
With the advent of increasingly complex hardware in real-time embedded systems (processors with performance enhancing features such as pipelines, cache hierarchy, multiple cores), many processors now have a set-associative L2 cache. Thus, there is a
The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance
More and more massive parallel codes running on several hundreds of thousands of cores enter the computational science and engineering domain, allowing high-fidelity computations on up to trillions of unknowns for very detailed analyses of the underl