The BLAS API of BLASFEO: optimizing performance for small matrices

73 0 0.0 ( 0 )

Download Cite

Added by Gianluca Frison

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Gianluca Frison - Tommaso Sartor - Andrea Zanelli

Mathematical Software

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an API which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open-source and proprietary. This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this paper investigates the benefits in scientific programming languages such as Octave, SciPy and Julia.

rate research

BLASFEO: basic linear algebra subroutines for embedded optimization

118 - Gianluca Frison , Dimitris Kouzoupis , Tommaso Sartor 2017

BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization. A key difference with respect to existing high-performance implementations of BLAS is that the computational performance is optimized for small to medium scale matrices, i.e., for sizes up to a few hundred. BLASFEO comes with three different implementations: a high-performance implementation aiming at providing the highest performance for matrices fitting in cache, a reference implementation providing portability and embeddability and optimized for very small matrices, and a wrapper to standard BLAS and LAPACK providing high-performance on large matrices. The three implementations of BLASFEO together provide high-performance dense linear algebra routines for matrices ranging from very small to large. Compared to both open-source and proprietary highly-tuned BLAS libraries, for matrices of size up to about one hundred the high-performance implementation of BLASFEO is about 20-30% faster than the corresponding level 3 BLAS routines and 2-3 times faster than the corresponding LAPACK routines.

Mathematical Software

Towards an Efficient Use of the BLAS Library for Multilinear Tensor Contractions

177 - Edoardo Di Napoli 2013

Mathematical operators whose transformation rules constitute the building blocks of a multi-linear algebra are widely used in physics and engineering applications where they are very often represented as tensors. In the last century, thanks to the advances in tensor calculus, it was possible to uncover new research fields and make remarkable progress in the existing ones, from electromagnetism to the dynamics of fluids and from the mechanics of rigid bodies to quantum mechanics of many atoms. By now, the formal mathematical and geometrical properties of tensors are well defined and understood; conversely, in the context of scientific and high-performance computing, many tensor- related problems are still open. In this paper, we address the problem of efficiently computing contractions among two tensors of arbitrary dimension by using kernels from the highly optimized BLAS library. In particular, we establish precise conditions to determine if and when GEMM, the kernel for matrix products, can be used. Such conditions take into consideration both the nature of the operation and the storage scheme of the tensors, and induce a classification of the contractions into three groups. For each group, we provide a recipe to guide the users towards the most effective use of BLAS.

Mathematical Software Discrete Mathematics

BLISlab: A Sandbox for Optimizing GEMM

160 - Jianyu Huang , Robert A. van de Geijn 2016

Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation to illustrate how high performance can be attained on modern CPUs with hierarchical memories (multiple caches). It does so by building on the insights that underly the BLAS-like Library Instantiation Software (BLIS) framework by exposing a simplified sandbox that mimics the implementation in BLIS. As such, it also becomes a vehicle for the crowd sourcing of the optimization of BLIS. We call this set of exercises BLISlab.

Mathematical Software

COFFEE: an Optimizing Compiler for Finite Element Local Assembly

181 - Fabio Luporini , Ana Lucia Varbanescu , Florian Rathgeber 2014

The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.

Mathematical Software Computational Engineering Performance

Efficient recursive least squares solver for rank-deficient matrices

101 - Ruben Staub 2021

Updating a linear least squares solution can be critical for near real-time signalprocessing applications. The Greville algorithm proposes a simple formula for updating the pseudoinverse of a matrix A $in$ R nxm with rank r. In this paper, we explicitly derive a similar formula by maintaining a general rank factorization, which we call rank-Greville. Based on this formula, we implemented a recursive least squares algorithm exploiting the rank-deficiency of A, achieving the update of the minimum-norm least-squares solution in O(mr) operations and, therefore, solving the linear least-squares problem from scratch in O(nmr) operations. We empirically confirmed that this algorithm displays a better asymptotic time complexity than LAPACK solvers for rank-deficient matrices. The numerical stability of rank-Greville was found to be comparable to Cholesky-based solvers. Nonetheless, our implementation supports exact numerical representations of rationals, due to its remarkable algebraic simplicity.

Mathematical Software