Hierarchical Parallel Matrix Multiplication on Large-Scale Distributed Memory Platforms

402 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Alexey Lastovetsky

تاريخ النشر 2013

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jean-Noel Quintin - Khalid Hasanov - Alexey Lastovetsky

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.

قيم البحث

61 - Anuj Sharma , Syed Mohammed Arshad Zaidi 2020

Graphs and their traversal is becoming significant as it is applicable to various areas of mathematics, science and technology. Various problems in fields as varied as biochemistry (genomics), electrical engineering (communication networks), computer science (algorithms and computation) can be modeled as Graph problems. Real world scenarios including communities their interconnections and related properties can be studied using graphs. So fast, scalable, low-cost execution of parallel graph algorithms is very important. In this implementation of parallel breadth first search of graphs, we implemented Parallel BFS algorithm with 1-D partitioning of graph as described in [2] and have reduced execution time by optimizing communication for local buffers.

النظم الموزعة والتوازية والحوسبة العنقودية

Straggler Mitigation through Unequal Error Protection for Distributed Approximate Matrix Multiplication

122 - Busra Tegin , Eduin. E. Hernandez , Stefano Rini 2021

Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or po or channel conditions giving rise to the straggler problem. As a remedy to this problem, we employ Unequal Error Protection (UEP) codes to obtain an approximation of the matrix product in the distributed computation setting to provide higher protection for the blocks with higher effect on the final result. We characterize the performance of the proposed approach from a theoretical perspective by bounding the expected reconstruction error for matrices with uncorrelated entries. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN) for an image classification task in the evaluation of the gradients. Our numerical experiments show that it is indeed possible to obtain significant improvements in the overall time required to achieve the DNN training convergence by producing approximation of matrix products using UEP codes in the presence of stragglers.

النظم الموزعة والتوازية والحوسبة العنقودية نظرية المعلومات نظرية المعلومات

Toward Interlanguage Parallel Scripting for Distributed-Memory Scientific Computing

494 - Justin M. Wozniak , Timothy G. Armstrong , Ketan C. Maheshwari 2021

Scripting languages such as Python and R have been widely adopted as tools for the productive development of scientific software because of the power and expressiveness of the languages and available libraries. However, deploying scripted application s on large-scale parallel computer systems such as the IBM Blue Gene/Q or Cray XE6 is a challenge because of issues including operating system limitations, interoperability challenges, parallel filesystem overheads due to the small file system accesses common in scripted approaches, and other issues. We present here a new approach to these problems in which the Swift scripting system is used to integrate high-level scripts written in Python, R, and Tcl, with native code developed in C, C++, and Fortran, by linking Swift to the library interfaces to the script interpreters. In this approach, Swift handles data management, movement, and marshaling among distributed-memory processes without direct user manipulation of low-level communication libraries such as MPI. We present a technique to efficiently launch scripted applications on large-scale supercomputers using a hierarchical programming model.

النظم الموزعة والتوازية والحوسبة العنقودية

Distributed-memory $mathcal{H}$-matrix Algebra I: Data Distribution and Matrix-vector Multiplication

152 - Yingzhou Li , Jack Poulson , Lexing Ying 2020

We introduce a data distribution scheme for $mathcal{H}$-matrices and a distributed-memory algorithm for $mathcal{H}$-matrix-vector multiplication. Our data distribution scheme avoids an expensive $Omega(P^2)$ scheduling procedure used in previous wo rk, where $P$ is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among $P$ processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is $OBig(frac{N log N}{P} + alpha log P + beta log^2 P Big)$ for $mathcal{H}$-matrices under weak admissibility condition, where $N$ is the matrix size, $alpha$ denotes the latency, and $beta$ denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.

التحليل العددي النظم الموزعة والتوازية والحوسبة العنقودية التحليل العددي

Algebrisation in Distributed Graph Algorithms: Fast Matrix Multiplication in the Congested Clique

457 - Petteri Kaski , Janne H. Korhonen , Christoph Lenzen 2014

While algebrisation constitutes a powerful technique in the design and analysis of centralised algorithms, to date there have been hardly any applications of algebraic techniques in the context of distributed graph algorithms. This work is a case stu dy that demonstrates the potential of algebrisation in the distributed context. We will focus on distributed graph algorithms in the congested clique model; the graph problems that we will consider include, e.g., the triangle detection problem and the all-pairs shortest path problem (APSP). There is plenty of prior work on combinatorial algorithms in the congested clique model: for example, Dolev et al. (DISC 2012) gave an algorithm for triangle detection with a running time of $tilde O(n^{1/3})$, and Nanongkai (STOC 2014) gave an approximation algorithm for APSP with a running time of $tilde O(n^{1/2})$. In this work, we will use algebraic techniques -- in particular, algorithms based on fast matrix multiplication -- to solve both triangle detection and the unweighted APSP in time $O(n^{0.15715})$; for weighted APSP, we give a $(1+o(1))$-approximation with this running time, as well as an exact $tilde O(n^{1/3})$ solution.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات