New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

106 0 0.0 ( 0 )

Download Cite

Added by Grzegorz Kwasniewski

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Grzegorz Kwasniewski

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We propose COSMA: a parallel matrix-matrix multiplication algorithm that is near communication-optimal for all combinations of matrix dimensions, processor counts, and memory sizes. The key idea behind COSMA is to derive an optimal (up to a factor of 0.03% for 10MB of fast memory) sequential schedule and then parallelize it, preserving I/O optimality. To achieve this, we use the red-blue pebble game to precisely model MMM dependencies and derive a constructive and tight sequential and parallel I/O lower bound proofs. Compared to 2D or 3D algorithms, which fix processor decomposition upfront and then map it to the matrix dimensions, it reduces communication volume by up to $sqrt{3}$ times. COSMA outperforms the established ScaLAPACK, CARMA, and CTF algorithms in all scenarios up to 12.8x (2.2x on average), achieving up to 88% of Piz Daints peak performance. Our work does not require any hand tuning and is maintained as an open source implementation.

rate research

On the Parallel I/O Optimality of Linear Algebra Kernels: Near-Optimal Matrix Factorizations

136 - Grzegorz Kwasniewski , Marko Kabic , Tal Ben-Nun 2021

Matrix factorizations are among the most important building blocks of scientific computing. State-of-the-art libraries, however, are not communication-optimal, underutilizing current parallel architectures. We present novel algorithms for Cholesky and LU factorizations that utilize an asymptotically communication-optimal 2.5D decomposition. We first establish a theoretical framework for deriving parallel I/O lower bounds for linear algebra kernels, and then utilize its insights to derive Cholesky and LU schedules, both communicating N^3/(P*sqrt(M)) elements per processor, where M is the local memory size. The empirical results match our theoretical analysis: our implementations communicate significantly less than Intel MKL, SLATE, and the asymptotically communication-optimal CANDMC and CAPITAL libraries. Our code outperforms these state-of-the-art libraries in almost all tested scenarios, with matrix sizes ranging from 2,048 to 262,144 on up to 512 CPU nodes of the Piz Daint supercomputer, decreasing the time-to-solution by up to three times. Our code is ScaLAPACK-compatible and available as an open-source library.

Distributed Parallel and Cluster Computing Computational Complexity Performance

Matrix multiplication algorithms from group orbits

200 - Joshua A. Grochow , Cristopher Moore 2016

We show how to construct highly symmetric algorithms for matrix multiplication. In particular, we consider algorithms which decompose the matrix multiplication tensor into a sum of rank-1 tensors, where the decomposition itself consists of orbits under some finite group action. We show how to use the representation theory of the corresponding group to derive simple constraints on the decomposition, which we solve by hand for n=2,3,4,5, recovering Strassens algorithm (in a particularly symmetric form) and new algorithms for larger n. While these new algorithms do not improve the known upper bounds on tensor rank or the matrix multiplication exponent, they are beautiful in their own right, and we point out modifications of this idea that could plausibly lead to further improvements. Our constructions also suggest further patterns that could be mined for new algorithms, including a tantalizing connection with lattices. In particular, using lattices we give the most transparent proof to date of Strassens algorithm; the same proof works for all n, to yield a decomposition with $n^3 - n + 1$ terms.

Computational Complexity Data Structures and Algorithms Algebraic Geometry

Inner Rank and Lower Bounds for Matrix Multiplication

181 - Joel Friedman 2017

We develop a notion of {em inner rank} as a tool for obtaining lower bounds on the rank of matrix multiplication tensors. We use it to give a short proof that the border rank (and therefore rank) of the tensor associated with $ntimes n$ matrix multiplication over an arbitrary field is at least $2n^2-n+1$. While inner rank does not provide improvements to currently known lower bounds, we argue that this notion merits further study.

Computational Complexity Data Structures and Algorithms

SpArch: Efficient Architecture for Sparse Matrix Multiplication

92 - Zhekai Zhang , Hanrui Wang , Song Han 2020

Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a ubiquitous task in various engineering and scientific applications. However, inner product based SpGENN introduces redundant input fetches for mismatched nonzero operands, while outer product based approach suffers from poor output locality due to numerous partial product matrices. Inefficiency in the reuse of either inputs or outputs data leads to extensive and expensive DRAM access. To address this problem, this paper proposes an efficient sparse matrix multiplication accelerator architecture, SpArch, which jointly optimizes the data locality for both input and output matrices. We first design a highly parallelized streaming-based merger to pipeline the multiply and merge stage of partial matrices so that partial matrices are merged on chip immediately after produced. We then propose a condensed matrix representation that reduces the number of partial matrices by three orders of magnitude and thus reduces DRAM access by 5.4x. We further develop a Huffman tree scheduler to improve the scalability of the merger for larger sparse matrices, which reduces the DRAM access by another 1.8x. We also resolve the increased input matrix read induced by the new representation using a row prefetcher with near-optimal buffer replacement policy, further reducing the DRAM access by 1.5x. Evaluated on 20 benchmarks, SpArch reduces the total DRAM access by 2.8x over previous state-of-the-art. On average, SpArch achieves 4x, 19x, 18x, 17x, 1285x speedup and 6x, 164x, 435x, 307x, 62x energy savings over OuterSPACE, MKL, cuSPARSE, CUSP, and ARM Armadillo, respectively.

Hardware Architecture Distributed Parallel and Cluster Computing

Optimal Sampling Algorithms for Block Matrix Multiplication

149 - Chengmei Niu , Hanyu Li 2021

In this paper, we investigate the randomized algorithms for block matrix multiplication from random sampling perspective. Based on the A-optimal design criterion, the optimal sampling probabilities and sampling block sizes are obtained. To improve the practicability of the block sizes, two modified ones with less computation cost are provided. With respect to the second one, a two step algorithm is also devised. Moreover, the probability error bounds for the proposed algorithms are given. Extensive numerical results show that our methods outperform the existing one in the literature.

Numerical Analysis Numerical Analysis

comments

Fetching comments

Sham Higher Institute of Forensic Sciences and the Arabic language and Islamic studies and research

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication

Ask ChatGPT about the research

No Arabic abstract

Read More