Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

96 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Markus Wittmann

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Markus Wittmann - Viktor Haag - Thomas Zeiser

الأداء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Lattice Boltzmann methods (LBM) are an important part of current computational fluid dynamics (CFD). They allow easy implementations and boundary handling. However, competitive time to solution not only depends on the choice of a reasonable method, but also on an efficient implementation on modern hardware. Hence, performance optimization has a long history in the lattice Boltzmann community. A variety of options exists regarding the implementation with direct impact on the solver performance. Experimenting and evaluating each option often is hard as the kernel itself is typically embedded in a larger code base. With our suite of lattice Boltzmann kernels we provide the infrastructure for such endeavors. Already included are several kernels ranging from simple to fully optimized implementations. Although these kernels are not fully functional CFD solvers, they are equipped with a solid verification method. The kernels may act as an reference for performance comparisons and as a blue print for optimization strategies. In this paper we give an overview of already available kernels, establish a performance model for each kernel, and show a comparison of implementations and recent architectures.

قيم البحث

134 - Johannes Habich , Christian Feichtinger , Harald Kostler andn Georg Hager 2011

GPUs offer several times the floating point performance and memory bandwidth of current standard two socket CPU servers, e.g. NVIDIA C2070 vs. Intel Xeon Westmere X5650. The lattice Boltzmann method has been established as a flow solver in recent yea rs and was one of the first flow solvers to be successfully ported and that performs well on GPUs. We demonstrate advanced optimization strategies for a D3Q19 lattice Boltzmann based incompressible flow solver for GPGPUs and CPUs based on NVIDIA CUDA and OpenCL. Since the implemented algorithm is limited by memory bandwidth, we concentrate on improving memory access. Basic data layout issues for optimal data access are explained and discussed. Furthermore, the algorithmic steps are rearranged to improve scattered access of the GPU memory. The importance of occupancy is discussed as well as optimization strategies to improve overall concurrency. We arrive at a well-optimized GPU kernel, which is integrated into a larger framework that can handle single phase fluid flow simulations as well as particle-laden flows. Our 3D LBM GPU implementation reaches up to 650 MLUPS in single precision and 290 MLUPS in double precision on an NVIDIA Tesla C2070.

الأداء

Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations

324 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip perf ormance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy to solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for lowest energy to solution at minimal performance degradation. Then we extrapolate to the MPI-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup we could achieve energy savings of 35% in this case, compared to a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy saving potential unused.

الأداء النظم الموزعة والتوازية والحوسبة العنقودية

Software Performance Analysis

84 - Michel R. Dagenais (Dept. of Computer Engineering , Ecolen Polytechnique , Montreal 2005

The key to speeding up applications is often understanding where the elapsed time is spent, and why. This document reviews in depth the full array of performance analysis tools and techniques available on Linux for this task, from the traditional too ls like gcov and gprof, to the more advanced tools still under development like oprofile and the Linux Trace Toolkit. The focus is more on the underlying data collection and processing algorithms, and their overhead and precision, than on the cosmetic details of the graphical user interface frontends.

الأداء أنظمة التشغيل

Breakdown of a Benchmark Score Without Internal Analysis of Benchmarking Program

406 - Naoki Matagawa , Kazuyuki Shudo 2016

A breakdown of a benchmark score is how much each aspect of the system performance affects the score. Existing methods require internal analysis on the benchmarking program and then involve the following problems: (1) require a certain amount of labo r for code analysis, profiling, simulation, and so on and (2) require the benchmarking program itself. In this paper, we present a method for breaking down a benchmark score without internal analysis of the benchmarking program. The method utilizes regression analysis of benchmark scores on a number of systems. Experimental results with 3 benchmarks on 15 Android smartphones showed that our method could break down those benchmark scores even though there is room for improvement in accuracy.

الأداء

Performance Modeling of Streaming Kernels and Sparse Matrix-Vector Multiplication on A64FX

86 - Christie L. Alappat , Jan Laukemann , Thomas Gruber 2020

The A64FX CPU powers the current number one supercomputer on the Top500 list. Although it is a traditional cache-based multicore processor, its peak performance and memory bandwidth rival accelerator devices. Generating efficient code for such a new architecture requires a good understanding of its performance features. Using these features, we construct the Execution-Cache-Memory (ECM) performance model for the A64FX processor in the FX700 supercomputer and validate it using streaming loops. We also identify architectural peculiarities and derive optimization hints. Applying the ECM model to sparse matrix-vector multiplication (SpMV), we motivate why the CRS matrix storage format is inappropriate and how the SELL-C-sigma format with suitable code optimizations can achieve bandwidth saturation for SpMV.

الأداء هندسة العتاد النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الأكاديمية العربية للعلوم والتكنولوجيا والنقل البحري

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Lattice Boltzmann Benchmark Kernels as a Testbed for Performance Analysis

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً