ﻻ يوجد ملخص باللغة العربية
Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.
Graphs and their traversal is becoming significant as it is applicable to various areas of mathematics, science and technology. Various problems in fields as varied as biochemistry (genomics), electrical engineering (communication networks), computer
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for the computations at the agents is affected by the availability of local resources and/or po
Scripting languages such as Python and R have been widely adopted as tools for the productive development of scientific software because of the power and expressiveness of the languages and available libraries. However, deploying scripted application
We introduce a data distribution scheme for $mathcal{H}$-matrices and a distributed-memory algorithm for $mathcal{H}$-matrix-vector multiplication. Our data distribution scheme avoids an expensive $Omega(P^2)$ scheduling procedure used in previous wo
While algebrisation constitutes a powerful technique in the design and analysis of centralised algorithms, to date there have been hardly any applications of algebraic techniques in the context of distributed graph algorithms. This work is a case stu