أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Thomas Zeiser

Short Note on Costs of Floating Point Operations on current x86-64 Architectures: Denormals, Overflow, Underflow, and Division by Zero

64 - Markus Wittmann , Thomas Zeiser , Georg Hager 2015

Simple floating point operations like addition or multiplication on normalized floating point values can be computed by current AMD and Intel processors in three to five cycles. This is different for denormalized numbers, which appear when an underfl ow occurs and the value can no longer be represented as a normalized floating-point value. Here the costs are about two magnitudes higher.

الأداء

Building a fault tolerant application using the GASPI communication layer

89 - Faisal Shahzad , Moritz Kreutzer , Thomas Zeiser 2015

It is commonly agreed that highly parallel software on Exascale computers will suffer from many more runtime failures due to the decreasing trend in the mean time to failures (MTTF). Therefore, it is not surprising that a lot of research is going on in the area of fault tolerance and fault mitigation. Applications should survive a failure and/or be able to recover with minimal cost. MPI is not yet very mature in handling failures, the User-Level Failure Mitigation (ULFM) proposal being currently the most promising approach is still in its prototype phase. In our work we use GASPI, which is a relatively new communication library based on the PGAS model. It provides the missing features to allow the design of fault-tolerant applications. Instead of introducing algorithm-based fault tolerance in its true sense, we demonstrate how we can build on (existing) clever checkpointing and extend applications to allow integrate a low cost fault detection mechanism and, if necessary, recover the application on the fly. The aspects of process management, the restoration of groups and the recovery mechanism is presented in detail. We use a sparse matrix vector multiplication based application to perform the analysis of the overhead introduced by such modifications. Our fault detection mechanism causes no overhead in failure-free cases, whereas in case of failure(s), the failure detection and recovery cost is of reasonably acceptable order and shows good scalability.

النظم الموزعة والتوازية والحوسبة العنقودية

Chip-level and multi-node analysis of energy-optimized lattice-Boltzmann CFD simulations

271 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

Memory-bound algorithms show complex performance and energy consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) on an Intel Sandy Bridge cluster as a prototype scenario to investigate if and how single-chip perf ormance and power characteristics can be generalized to the highly parallel case. First we perform an analysis of a sparse-lattice LBM implementation for complex geometries. Using a single-core performance model, we predict the intra-chip saturation characteristics and the optimal operating point in terms of energy to solution as a function of implementation details, clock frequency, vectorization, and number of active cores per chip. We show that high single-core performance and a correct choice of the number of active cores per chip are the essential optimizations for lowest energy to solution at minimal performance degradation. Then we extrapolate to the MPI-parallel level and quantify the energy-saving potential of various optimizations and execution modes, where we find these guidelines to be even more important, especially when communication overhead is non-negligible. In our setup we could achieve energy savings of 35% in this case, compared to a naive approach. We also demonstrate that a simple non-reflective reduction of the clock speed leaves most of the energy saving potential unused.

الأداء النظم الموزعة والتوازية والحوسبة العنقودية

Asynchronous MPI for the Masses

179 - Markus Wittmann , Georg Hager , Thomas Zeiser 2013

We present a simple library which equips MPI implementations with truly asynchronous non-blocking point-to-point operations, and which is independent of the underlying communication infrastructure. It utilizes the MPI profiling interface (PMPI) and t he MPI_THREAD_MULTIPLE thread compatibility level, and works with curre

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Domain decomposition and locality optimization for large-scale lattice Boltzmann simulations

73 - Markus Wittmann , Thomas Zeiser , Georg Hager 2011

We present a simple, parallel and distributed algorithm for setting up and partitioning a sparse representation of a regular discretized simulation domain. This method is scalable for a large number of processes even for complex geometries and ensure s load balance between the domains, reasonable communication interfaces, and good data locality within the domain. Applying this scheme to a list-based lattice Boltzmann flow solver can achieve similar or even higher flow solver performance than widely used standard graph partition based tools such as METIS and PT-SCOTCH.

النظم الموزعة والتوازية والحوسبة العنقودية

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد