Strassens Algorithm for Tensor Contraction

104 0 0.0 ( 0 )

Download Cite

Added by Jianyu Huang

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Jianyu Huang - Devin A. Matthews - Robert A. van de Geijn

Mathematical Software

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassens algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed up tensor contraction with Strassens algorithm. By adopting a Block-Scatter-Matrix format, a novel matrix-centric tensor layout, we can conceptually view TC as GEMM for a general stride storage, with an implicit tensor-to-matrix transformation. This insight enables us to tailor a recent state-of-the-art implementation of Strassens algorithm to TC, avoiding explicit transpositions (permutations) and extra workspace, and reducing the overhead of memory movement that is incurred. Performance benefits are demonstrated with a performance model as well as in practice on modern single core, multicore, and distributed memory parallel architectures, achieving up to 1.3x speedup. The resulting implementations can serve as a drop-in replacement for various applications with significant speedup.

rate research

Implementing Strassens Algorithm with BLIS

68 - Jianyu Huang , Tyler M. Smith , Greg M. Henry 2016

We dispel with street wisdom regarding the practical implementation of Strassens algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is relatively small (a shape of importance for libraries like LAPACK). Conventional wisdom: it inherently requires substantial workspace. Our implementation requires no workspace beyond buffers already incorporated into conventional high-performance DGEMM implementations. Conventional wisdom: a Strassen DGEMM interface must pass in workspace. Our implementation requires no such workspace and can be plug-compatible with the standard DGEMM interface. Conventional wisdom: it is hard to demonstrate speedup on multi-core architectures. Our implementation demonstrates speedup over conventional DGEMM even on an Intel(R) Xeon Phi(TM) coprocessor utilizing 240 threads. We show how a distributed memory matrix-matrix multiplication also benefits from these advances.

Mathematical Software

Implementing Strassens Algorithm with CUTLASS on NVIDIA Volta GPUs

133 - Jianyu Huang , Chenhan D. Yu , Robert A. van de Geijn 2018

Conventional GPU implementations of Strassens algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, squarish matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11x speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19x speedup with a crossover point at 7,680.

Mathematical Software

Designing Strassens algorithm

185 - Joshua A. Grochow , Cristopher Moore 2017

In 1969, Strassen shocked the world by showing that two n x n matrices could be multiplied in time asymptotically less than $O(n^3)$. While the recursive construction in his algorithm is very clear, the key gain was made by showing that 2 x 2 matrix multiplication could be performed with only 7 multiplications instead of 8. The latter construction was arrived at by a process of elimination and appears to come out of thin air. Here, we give the simplest and most transparent proof of Strassens algorithm that we are aware of, using only a simple unitary 2-design and a few easy lines of calculation. Moreover, using basic facts from the representation theory of finite groups, we use 2-designs coming from group orbits to generalize our construction to all n (although the resulting algorithms arent optimal for n at least 3).

Data Structures and Algorithms Computational Complexity Symbolic Computation

Standards for Graph Algorithm Primitives

334 - Tim Mattson 2014

It is our view that the state of the art in constructing a large collection of graph algorithms in terms of linear algebraic operations is mature enough to support the emergence of a standard set of primitive building blocks. This paper is a position paper defining the problem and announcing our intention to launch an open effort to define this standard.

Mathematical Software Discrete Mathematics Data Structures and Algorithms

Parameterization of tensor network contraction

126 - Bryan OGorman 2019

We present a conceptually clear and algorithmically useful framework for parameterizing the costs of tensor network contraction. Our framework is completely general, applying to tensor networks with arbitrary bond dimensions, open legs, and hyperedges. The fundamental objects of our framework are rooted and unrooted contraction trees, which represent classes of contraction orders. Properties of a contraction tree correspond directly and precisely to the time and space costs of tensor network contraction. The properties of rooted contraction trees give the costs of parallelized contraction algorithms. We show how contraction trees relate to existing tree-like objects in the graph theory literature, bringing to bear a wide range of graph algorithms and tools to tensor network contraction. Independent of tensor networks, we show that the edge congestion of a graph is almost equal to the branchwidth of its line graph.

Data Structures and Algorithms Quantum Physics