Implementing Strassens Algorithm with CUTLASS on NVIDIA Volta GPUs

134 0 0.0 ( 0 )

Download Cite

Added by Jianyu Huang

Publication date 2018

fields Informatics Engineering

and research's language is English

Authors Jianyu Huang - Chenhan D. Yu - Robert A. van de Geijn

Mathematical Software

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Conventional GPU implementations of Strassens algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, squarish matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11x speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19x speedup with a crossover point at 7,680.

rate research

Implementing Strassens Algorithm with BLIS

68 - Jianyu Huang , Tyler M. Smith , Greg M. Henry 2016

We dispel with street wisdom regarding the practical implementation of Strassens algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel(R) Xeon Phi(TM) coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

Mathematical Software

Strassens Algorithm for Tensor Contraction

103 - Jianyu Huang , Devin A. Matthews , Robert A. van de Geijn 2017

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassens algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassens algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassens algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3x speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.

Mathematical Software

Implementation of 3D degridding algorithm on the NVIDIA GPUs using CUDA

55 - Karel Adamek , Peter Wortmann , Bojan Nikolic 2021

Practical aperture synthesis imaging algorithms work by iterating between estimating the sky brightness distribution and a comparison of a prediction based on this estimate with the measured data (visibilities). Accuracy in the latter step is crucial but is made difficult by irregular and non-planar sampling of data by the telescope. In this work we present a GPU implementation of 3d de-gridding which accurately deals with these two difficulties and is designed for distributed operation. We address the load balancing issues caused by large variation in visibilities that need to be computed. Using CUDA and NVidia GPUs we measure performance up to 1.2 billion visibilities per second.

Instrumentation and Methods for Astrophysics

Designing Strassens algorithm

185 - Joshua A. Grochow , Cristopher Moore 2017

In 1969, Strassen shocked the world by showing that two n x n matrices could be multiplied in time asymptotically less than $O(n^3)$. While the recursive construction in his algorithm is very clear, the key gain was made by showing that 2 x 2 matrix multiplication could be performed with only 7 multiplications instead of 8. The latter construction was arrived at by a process of elimination and appears to come out of thin air. Here, we give the simplest and most transparent proof of Strassens algorithm that we are aware of, using only a simple unitary 2-design and a few easy lines of calculation. Moreover, using basic facts from the representation theory of finite groups, we use 2-designs coming from group orbits to generalize our construction to all n (although the resulting algorithms arent optimal for n at least 3).

Data Structures and Algorithms Computational Complexity Symbolic Computation

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

117 - Wajih Halim Boukaram , George Turkiyyah , Hatem Ltaief 2017

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.

Mathematical Software Data Structures and Algorithms Numerical Analysis