Do you want to publish a course? Click here

Faster arbitrary-precision dot product and matrix multiplication

75   0   0.0 ( 0 )
 Added by Fredrik Johansson
 Publication date 2019
and research's language is English




Ask ChatGPT about the research

We present algorithms for real and complex dot product and matrix multiplication in arbitrary-precision floating-point and ball arithmetic. A low-overhead dot product is implemented on the level of GMP limb arrays; it is about twice as fast as previous code in MPFR and Arb at precision up to several hundred bits. Up to 128 bits, it is 3-4 times as fast, costing 20-30 cycles per term for floating-point evaluation and 40-50 cycles per term for balls. We handle large matrix multiplications even more efficiently via blocks of scaled integer matrices. The new methods are implemented in Arb and significantly speed up polynomial operations and linear algebra.



rate research

Read More

As the ratio between the rate of computation and rate with which data can be retrieved from various layers of memory continues to deteriorate, a question arises: Will the current best algorithms for computing matrix-matrix multiplication on future CPUs continue to be (near) optimal? This paper provides compelling analytical and empirical evidence that the answer is no. The analytical results guide us to a new family of algorithms of which the current state-of-the-art Gotos algorithm is but one member. The empirical results, on architectures that were custom built to reduce the amount of bandwidth to main memory, show that under different circumstances, different and particular members of the family become more superior. Thus, this family will likely start playing a prominent role going forward.
287 - Weifeng Liu , Brian Vinter 2015
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. The source code of this work is available at https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR
Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace and modifications to the standard BLAS interface. We propose a code generator framework to automatically implement a large family of FMM algorithms suitable for multiplications of arbitrary matrix sizes and shapes. By representing FMM with a triple of matrices [U,V,W] that capture the linear combinations of submatrices that are formed, we can use the Kronecker product to define a multi-level representation of Strassen-like algorithms. Incorporating the matrix additions that must be performed for Strassen-like algorithms into the inherent packing and micro-kernel operations inside GEMM avoids extra workspace and reduces the cost of memory movement. Adopting the same loop structures as high-performance GEMM implementations allows parallelization of all FMM algorithms with simple but efficient data parallelism without the overhead of task parallelism. We present a simple performance model for general FMM algorithms and compare actual performance of 20+ FMM algorithms to modeled predictions. Our implementations demonstrate a performance benefit over conventional GEMM on single core and multi-core systems. This study shows that Strassen-like fast matrix multiplication can be incorporated into libraries for practical use.
Quaternion symmetry is ubiquitous in the physical sciences. As such, much work has been afforded over the years to the development of efficient schemes to exploit this symmetry using real and complex linear algebra. Recent years have also seen many advances in the formal theoretical development of explicitly quaternion linear algebra with promising applications in image processing and machine learning. Despite these advances, there do not currently exist optimized software implementations of quaternion linear algebra. The leverage of optimized linear algebra software is crucial in the achievement of high levels of performance on modern computing architectures, and thus provides a central tool in the development of high-performance scientific software. In this work, a case will be made for the efficacy of high-performance quaternion linear algebra software for appropriate problems. In this pursuit, an optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation. The results of this work pave the path for further development of high-performance quaternion linear algebra software which will improve the performance of the next generation of applicable scientific applications.
120 - Brice Boyer 2009
We propose several new schedules for Strassen-Winograds matrix multiplication algorithm, they reduce the extra memory allocation requirements by three different means: by introducing a few pre-additions, by overwriting the input matrices, or by using a first recursive level of classical multiplication. In particular, we show two fully in-place schedules: one having the same number of operations, if the input matrices can be overwritten; the other one, slightly increasing the constant of the leading term of the complexity, if the input matrices are read-only. Many of these schedules have been found by an implementation of an exhaustive search algorithm based on a pebble game.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا