The development of the A64FX processor by Fujitsu has been a massive innovation in vectorized processors and led to Fugaku: the current worlds fastest supercomputer. We use a variety of tools to analyze the behavior and performance of several OpenMP applications with different compilers, and how these applications scale on the different A64FX processors on clusters at Stony Brook University and RIKEN.
The development of the A64FX processor by Fujitsu has created a massive innovation in High-Performance Computing and the birth of Fugaku: the current worlds fastest supercomputer. A variety of tools are used to analyze the run-times and performances of several applications, and in particular, how these applications scale on the A64FX processor. We examine the performance and behavior of applications through OpenMP scaling and how their performance differs across different compilers on the new Ookami cluster at Stony Brook University as well as the Fugaku supercomputer at RIKEN in Japan.
Parallel code design is a challenging task especially when addressing petascale systems for massive parallel processing (MPP), i.e. parallel computations on several hundreds of thousands of cores. An in-house computational fluid dynamics code, developed by our group, was designed for such high-fidelity runs in order to exhibit excellent scalability values. Basis for this code is an adaptive hierarchical data structure together with an efficient communication and (numerical) computation scheme that supports MPP. For a detailled scalability analysis, we performed several experiments on two of Germanys national supercomputers up to 140,000 processes. In this paper, we will show the results of those experiments and discuss any bottlenecks that could be observed while solving engineering-based problems such as porous media flows or thermal comfort assessments for problem sizes up to several hundred billion degrees of freedom.
The A64FX CPU is arguably the most powerful Arm-based processor design to date. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. A good understanding of its performance features is of paramount importance for developers who wish to leverage its full potential. We present an architectural analysis of the A64FX used in the Fujitsu FX1000 supercomputer at a level of detail that allows for the construction of Execution-Cache-Memory (ECM) performance models for steady-state loops. In the process we identify architectural peculiarities that point to viable generic optimization strategies. After validating the model using simple streaming loops we apply the insight gained to sparse matrix-vector multiplication (SpMV) and the domain wall (DW) kernel from quantum chromodynamics (QCD). For SpMV we show why the CRS matrix storage format is not a good practical choice on this architecture and how the SELL-C-sigma format can achieve bandwidth saturation. For the DW kernel we provide a cache-reuse analysis and show how an appropriate choice of data layout for complex arrays can realize memory-bandwidth saturation in this case as well. A comparison with state-of-the-art high-end Intel Cascade Lake AP and Nvidia V100 systems puts the capabilities of the A64FX into perspective. We also explore the potential for power optimizations using the tuning knobs provided by the Fugaku system, achieving energy savings of about 31% for SpMV and 18% for DW.
The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV.
This paper describes two basic queueing models of service platforms in digital sharing economy by means of two different policies of platform matching information. We show that the two queueing models of service platforms can be expressed as the level-independent quasi birth-and-death (QBD) processes. Using the proposed QBD processes, we provide a detailed analysis for the two queueing models of service platforms, including the system stability, the average stationary numbers of seekers and of idle owners, the expected sojourn time of an arriving seeker, and the expected profits for both the service platform and each owner. Finally, numerical examples are employed to verify our theoretical results, and demonstrate how the performance measures of service platforms are influenced by some key system parameters. We believe that the methodology and results developed in this paper not only can be applied to develop a broad class of queuing models of service platforms, but also will open a series of promising innovative research on performance evaluation, optimal control and queueing-game of service platforms and digital sharing economy.
Benjamin Michalowicz
,Eric Raut
,Yan Kang
.
(2021)
.
"Comparing the behavior of OpenMP Implementations with various Applications on two different Fujitsu A64FX platforms"
.
Benjamin Michalowicz
هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا