بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

A Note on Parallel Algorithmic Speedup Bounds

434 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Neil J. Gunther

تاريخ النشر 2011

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Neil J. Gunther

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

A parallel program can be represented as a directed acyclic graph. An important performance bound is the time to execute the critical path through the graph. We show how this performance metric is related to Amdahl speedup and the degree of average parallelism. These bounds formally exclude superlinear performance.

قيم البحث

اقرأ أيضاً

Bounds on series-parallel slowdown

354 - Andras Z. Salamon , Vashti Galpin 2009

We use activity networks (task graphs) to model parallel programs and consider series-parallel extensions of these networks. Our motivation is two-fold: the benefits of series-parallel activity networks and the modelling of programming constructs, su ch as those imposed by current parallel computing environments. Series-parallelisation adds precedence constraints to an activity network, usually increasing its makespan (execution time). The slowdown ratio describes how additional constraints affect the makespan. We disprove an existing conjecture positing a bound of two on the slowdown when workload is not considered. Where workload is known, we conjecture that 4/3 slowdown is always achievable, and prove our conjecture for small networks using max-plus algebra. We analyse a polynomial-time algorithm showing that achieving 4/3 slowdown is in exp-APX. Finally, we discuss the implications of our results.

النظم الموزعة والتوازية والحوسبة العنقودية التعقيد الحسابي الأداء

Bi-objective Optimisation of Data-parallel Applications on Heterogeneous Platforms for Performance and Energy via Workload Distribution

428 - Hamidreza Khaleghzadeh , Muhammad Fahad , Arsalan Shahid 2019

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance a nd energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for performance and energy on heterogeneous processors results in a large number of Pareto-optimal optimal solutions (workload distributions) even in the simple case of linear performance and energy profiles. We then study performance and energy profiles of real-life data-parallel applications and find that their shapes are non-linear, complex and non-smooth. We, therefore, propose an efficient and exact global optimisation algorithm, which takes as an input most general discrete performance and dynamic energy profiles of the heterogeneous processors and solves the bi-objective optimisation problem. The algorithm is also used as a building block to solve the bi-objective optimisation problem for performance and total energy. We also propose a novel methodology to build discrete dynamic energy profiles of individual computing devices, which are input to the algorithm. The methodology is based purely on system-level measurements and addresses the fundamental challenge of accurate component-level energy modelling of a hybrid data-parallel application running on a heterogeneous platform integrating CPUs and accelerators. We experimentally validate the proposed method using two data-parallel applications, matrix multiplication and 2D fast Fourier transform (2D-FFT).

النظم الموزعة والتوازية والحوسبة العنقودية الأداء أنظمة وتحكم

Legio: Fault Resiliency for Embarrassingly Parallel MPI Applications

75 - Roberto Rocco , Davide Gadioli , Gianluca Palermo 2021

Due to the increasing size of HPC machines, the fault presence is becoming an eventuality that applications must face. Natively, MPI provides no support for the execution past the detection of a fault, and this is becoming more and more constraining. With the introduction of ULFM (User Level Fault Mitigation library), it has been provided with a possible way to overtake a fault during the application execution at the cost of code modifications. ULFM is intrusive in the application and requires also a deep understanding of its recovery procedures. In this paper we propose Legio, a framework that lowers the complexity of introducing resiliency in an embarrassingly parallel MPI application. By hiding ULFM behind the MPI calls, the library is capable to expose resiliency features to the application in a transparent manner thus removing any integration effort. Upon fault, the failed nodes are discarded and the execution continues only with the non-failed ones. A hierarchical implementation of the solution has been also proposed to reduce the overhead of the repair process when scaling towards a large number of nodes. We evaluated our solutions on the Marconi100 cluster at CINECA, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI. Moreover, we also integrated the solution in real-world applications to further prove its robustness by injecting faults.

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

CROFT: A scalable three-dimensional parallel Fast Fourier Transform (FFT) implementation for High Performance Clusters

97 - Vivek Gavane , Supriya Prabhugawankar , Shivam Garg 2020

The FFT of three-dimensional (3D) input data is an important computational kernel of numerical simulations and is widely used in High Performance Computing (HPC) codes running on a large number of processors. Performance of many scientific applicatio ns such as Molecular Dynamic simulations depends on the underlying 3D parallel FFT library being used. In this paper, we present C-DACs three-dimensional Fast Fourier Transform (CROFT) library which implements three-dimensional parallel FFT using pencil decomposition. To exploit the hyperthreading capabilities of processor cores without affecting performance, CROFT is designed to use multithreading along with MPI. CROFT implementation has an innovative feature of overlapping compute and memory-I/O with MPI communication using multithreading. As opposed to other 3D FFT implementations, CROFT uses only two threads where one thread is dedicated for communication so that it can be effectively overlapped with computations. Thus, depending on the number of processes used, CROFT achieves performance improvement of about 51% to 42% as compared to FFTW3 library.

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

Desynchronization and Wave Pattern Formation in MPI-Parallel and Hybrid Memory-Bound Programs

80 - Ayesha Afzal , Georg Hager , Gerhard Wellein 2020

Analytic, first-principles performance modeling of distributed-memory parallel codes is notoriously imprecise. Even for applications with extremely regular and homogeneous compute-communicate phases, simply adding communication time to computation ti me does often not yield a satisfactory prediction of parallel runtime due to deviations from the expected simple lockstep pattern caused by system noise, variations in communication time, and inherent load imbalance. In this paper, we highlight the specific cases of provoked and spontaneous desynchronization of memory-bound, bulk-synchronous pure MPI and hybrid MPI+OpenMP programs. Using simple microbenchmarks we observe that although desynchronization can introduce increased waiting time per process, it does not necessarily cause lower resource utilization but can lead to an increase in available bandwidth per core. In case of significant communication overhead, even natural noise can shove the system into a state of automatic overlap of communication and computation, improving the overall time to solution. The saturation point, i.e., the number of processes per memory domain required to achieve full memory bandwidth, is pivotal in the dynamics of this process and the emerging stable wave pattern. We also demonstrate how hybrid MPI-OpenMP programming can prevent desirable desynchronization by eliminating the bandwidth bottleneck among processes. A Chebyshev filter diagonalization application is used to demonstrate some of the observed effects in a realistic setting.

النظم الموزعة والتوازية والحوسبة العنقودية الأداء

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

معهد تكنولوجيا المعلومات ITI

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Note on Parallel Algorithmic Speedup Bounds

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً