No Arabic abstract
We introduce a data distribution scheme for $mathcal{H}$-matrices and a distributed-memory algorithm for $mathcal{H}$-matrix-vector multiplication. Our data distribution scheme avoids an expensive $Omega(P^2)$ scheduling procedure used in previous work, where $P$ is the number of processes, while data balancing is well-preserved. Based on the data distribution, our distributed-memory algorithm evenly distributes all computations among $P$ processes and adopts a novel tree-communication algorithm to reduce the latency cost. The overall complexity of our algorithm is $OBig(frac{N log N}{P} + alpha log P + beta log^2 P Big)$ for $mathcal{H}$-matrices under weak admissibility condition, where $N$ is the matrix size, $alpha$ denotes the latency, and $beta$ denotes the inverse bandwidth. Numerically, our algorithm is applied to address both two- and three-dimensional problems of various sizes among various numbers of processes. On thousands of processes, good parallel efficiency is still observed.
A cumbersome operation in numerical analysis and linear algebra, optimization, machine learning and engineering algorithms; is inverting large full-rank matrices which appears in various processes and applications. This has both numerical stability and complexity issues, as well as high expected time to compute. We address the latter issue, by proposing an algorithm which uses a black-box least squares optimization solver as a subroutine, to give an estimate of the inverse (and pseudoinverse) of real nonsingular matrices; by estimating its columns. This also gives it the flexibility to be performed in a distributed manner, thus the estimate can be obtained a lot faster, and can be made robust to textit{stragglers}. Furthermore, we assume a centralized network with no message passing between the computing nodes, and do not require a matrix factorization; e.g. LU, SVD or QR decomposition beforehand.
Sparse matrix vector multiplication (SpMV) is an important kernel in scientific and engineering applications. The previous optimizations are sparse matrix format specific and expose the choice of the best format to application programmers. In this work we develop an auto-tuning framework to bridge gap between the specific optimized kernels and their general-purpose use. We propose an SpMV auto-tuner (SMAT) that provides an unified interface based on compressed sparse row (CSR) to programmers by implicitly choosing the best format and the fastest implementation of any input sparse matrix in runtime. SMAT leverage a data mining model, which is formulated based on a set of performance parameters extracted from 2373 matrices in UF sparse matrix collection, to fast search the best combination. The experiments show that SMAT achieves the maximum performance of 75 GFLOP/s in single-precision and 33 GFLOP/s in double-precision on Intel, and 41 GFLOP/s in single-precision and 34 GFLOP/s in double-precision on AMD. Compared with the sparse functions in MKL library, SMAT runs faster by more than 3 times.
Large-scale machine learning and data mining methods routinely distribute computations across multiple agents to parallelize processing. The time required for computation at the agents is affected by the availability of local resources giving rise to the straggler problem in which the computation results are held back by unresponsive agents. For this problem, linear coding of the matrix sub-blocks can be used to introduce resilience toward straggling. The Parameter Server (PS) utilizes a channel code and distributes the matrices to the workers for multiplication. It then produces an approximation to the desired matrix multiplication using the results of the computations received at a given deadline. In this paper, we propose to employ Unequal Error Protection (UEP) codes to alleviate the straggler problem. The resiliency level of each sub-block is chosen according to its norm as blocks with larger norms have higher effects on the result of the matrix multiplication. We validate the effectiveness of our scheme both theoretically and through numerical evaluations. We derive a theoretical characterization of the performance of UEP using random linear codes, and compare it the case of equal error protection. We also apply the proposed coding strategy to the computation of the back-propagation step in the training of a Deep Neural Network (DNN), for which we investigate the fundamental trade-off between precision and the time required for the computations.
Matrix multiplication is a very important computation kernel both in its own right as a building block of many scientific applications and as a popular representative for other scientific applications. Cannon algorithm which dates back to 1969 was the first efficient algorithm for parallel matrix multiplication providing theoretically optimal communication cost. However this algorithm requires a square number of processors. In the mid 1990s, the SUMMA algorithm was introduced. SUMMA overcomes the shortcomings of Cannon algorithm as it can be used on a non-square number of processors as well. Since then the number of processors in HPC platforms has increased by two orders of magnitude making the contribution of communication in the overall execution time more significant. Therefore, the state of the art parallel matrix multiplication algorithms should be revisited to reduce the communication cost further. This paper introduces a new parallel matrix multiplication algorithm, Hierarchical SUMMA (HSUMMA), which is a redesign of SUMMA. Our algorithm reduces the communication cost of SUMMA by introducing a two-level virtual hierarchy into the two-dimensional arrangement of processors. Experiments on an IBM BlueGene-P demonstrate the reduction of communication cost up to 2.08 times on 2048 cores and up to 5.89 times on 16384 cores.
We consider the problem of designing codes with flexible rate (referred to as rateless codes), for private distributed matrix-matrix multiplication. A master server owns two private matrices $mathbf{A}$ and $mathbf{B}$ and hires worker nodes to help computing their multiplication. The matrices should remain information-theoretically private from the workers. Codes with fixed rate require the master to assign tasks to the workers and then wait for a predetermined number of workers to finish their assigned tasks. The size of the tasks, hence the rate of the scheme, depends on the number of workers that the master waits for. We design a rateless private matrix-matrix multiplication scheme, called RPM3. In contrast to fixed-rate schemes, our scheme fixes the size of the tasks and allows the master to send multiple tasks to the workers. The master keeps sending tasks and receiving results until it can decode the multiplication; rendering the scheme flexible and adaptive to heterogeneous environments. Despite resulting in a smaller rate than known straggler-tolerant schemes, RPM3 provides a smaller mean waiting time of the master by leveraging the heterogeneity of the workers. The waiting time is studied under two different models for the workers service time. We provide upper bounds for the mean waiting time under both models. In addition, we provide lower bounds on the mean waiting time under the worker-dependent fixed service time model.