بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Understanding performance variability in standard and pipelined parallel Krylov solvers

127 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Matthew Knepley

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Hannah Morgan - Patrick Sanan - Matthew G. Knepley

البرمجيات الرياضية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this work, we collect data from runs of Krylov subspace methods and pipelined Krylov algorithms in an effort to understand and model the impact of machine noise and other sources of variability on performance. We find large variability of Krylov iterations between compute nodes for standard methods that is reduced in pipelined algorithms, directly supporting conjecture, as well as large variation between statistical distributions of runtimes across iterations. Based on these results, we improve upon a previously introduced nondeterministic performance model by allowing iterations to fluctuate over time. We present our data from runs of various Krylov algorithms across multiple platforms as well as our updated non-stationary model that provides good agreement with observations. We also suggest how it can be used as a predictive tool.

قيم البحث

316 - Christian T. Jacobs , Satya P. Jammy , Neil D. Sandham 2016

Exascale computing will feature novel and potentially disruptive hardware architectures. Exploiting these to their full potential is non-trivial. Numerical modelling frameworks involving finite difference methods are currently limited by the static n ature of the hand-coded discretisation schemes and repeatedly may have to be re-written to run efficiently on new hardware. In contrast, OpenSBLI uses code generation to derive the models code from a high-level specification. Users focus on the equations to solve, whilst not concerning themselves with the detailed implementation. Source-to-source translation is used to tailor the code and enable its execution on a variety of hardware.

البرمجيات الرياضية الحساب الرمزي هندسة البرمجيات

A performance spectrum for parallel computational frameworks that solve PDEs

121 - J. Chang , K. B. Nakshatrala , M. G. Knepley 2017

Important computational physics problems are often large-scale in nature, and it is highly desirable to have robust and high performing computational frameworks that can quickly address these problems. However, it is no trivial task to determine whet her a computational framework is performing efficiently or is scalable. The aim of this paper is to present various strategies for better understanding the performance of any parallel computational frameworks for solving PDEs. Important performance issues that negatively impact time-to-solution are discussed, and we propose a performance spectrum analysis that can enhance ones understanding of critical aforementioned performance issues. As proof of concept, we examine commonly used finite element simulation packages and software and apply the performance spectrum to quickly analyze the performance and scalability across various hardware platforms, software implementations, and numerical discretizations. It is shown that the proposed performance spectrum is a versatile performance model that is not only extendable to more complex PDEs such as hydrostatic ice sheet flow equations, but also useful for understanding hardware performance in a massively parallel computing environment. Potential applications and future extensions of this work are also discussed.

البرمجيات الرياضية التحليل العددي

Low-Synch Gram-Schmidt with Delayed Reorthogonalization for Krylov Solvers

144 - Daniel Bielich , Julien Langou , Stephen Thomas 2021

The parallel strong-scaling of Krylov iterative methods is largely determined by the number of global reductions required at each iteration. The GMRES and Krylov-Schur algorithms employ the Arnoldi algorithm for nonsymmetric matrices. The underlying orthogonalization scheme is left-looking and processes one column at a time. Thus, at least one global reduction is required per iteration. The traditional algorithm for generating the orthogonal Krylov basis vectors for the Krylov-Schur algorithm is classical Gram Schmidt applied twice with reorthogonalization (CGS2), requiring three global reductions per step. A new variant of CGS2 that requires only one reduction per iteration is applied to the Arnoldi-QR iteration. Strong-scaling results are presented for finding eigenvalue-pairs of nonsymmetric matrices. A preliminary attempt to derive a similar algorithm (one reduction per Arnoldi iteration with a robust orthogonalization scheme) was presented by Hernandez et al.(2007). Unlike our approach, their method is not forward stable for eigenvalues.

التحليل العددي التحليل العددي

Unstructured Overlapping Mesh Distribution in Parallel

604 - Matthew G. Knepley , Michael Lange , Gerard J. Gorman 2015

We present a simple mathematical framework and API for parallel mesh and data distribution, load balancing, and overlap generation. It relies on viewing the mesh as a Hasse diagram, abstracting away information such as cell shape, dimension, and coor dinates. The high level of abstraction makes our interface both concise and powerful, as the same algorithm applies to any representable mesh, such as hybrid meshes, meshes embedded in higher dimension, and overlapped meshes in parallel. We present evidence, both theoretical and experimental, that the algorithms are scalable and efficient. A working implementation can be found in the latest release of the PETSc libraries.

البرمجيات الرياضية النظم الموزعة والتوازية والحوسبة العنقودية التحليل العددي

DAPPLE: A Pipelined Data Parallel Approach for Training Large Models

54 - Shiqing Fan , Yi Rong , Chen Meng 2020

It is a challenging task to train large DNN models on sophisticated GPU platforms with diversified interconnect capabilities. Recently, pipelined training has been proposed as an effective approach for improving device utilization. However, there are still several tricky issues to address: improving computing efficiency while ensuring convergence, and reducing memory usage without incurring additional computing costs. We propose DAPPLE, a synchronous training framework which combines data parallelism and pipeline parallelism for large DNN models. It features a novel parallelization strategy planner to solve the partition and placement problems, and explores the optimal hybrid strategy of data and pipeline parallelism. We also propose a new runtime scheduling algorithm to reduce device memory usage, which is orthogonal to re-computation approach and does not come at the expense of training throughput. Experiments show that DAPPLE planner consistently outperforms strategies generated by PipeDreams planner by up to 3.23x under synchronous training scenarios, and DAPPLE runtime outperforms GPipe by 1.6x speedup of training throughput and reduces the memory consumption of 12% at the same time.

النظم الموزعة والتوازية والحوسبة العنقودية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة السورية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Understanding performance variability in standard and pipelined parallel Krylov solvers

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً