No Arabic abstract
Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation to illustrate how high performance can be attained on modern CPUs with hierarchical memories (multiple caches). It does so by building on the insights that underly the BLAS-like Library Instantiation Software (BLIS) framework by exposing a simplified sandbox that mimics the implementation in BLIS. As such, it also becomes a vehicle for the crowd sourcing of the optimization of BLIS. We call this set of exercises BLISlab.
The numerical solution of partial differential equations using the finite element method is one of the key applications of high performance computing. Local assembly is its characteristic operation. This entails the execution of a problem-specific kernel to numerically evaluate an integral for each element in the discretized problem domain. Since the domain size can be huge, executing efficient kernels is fundamental. Their op- timization is, however, a challenging issue. Even though affine loop nests are generally present, the short trip counts and the complexity of mathematical expressions make it hard to determine a single or unique sequence of successful transformations. Therefore, we present the design and systematic evaluation of COF- FEE, a domain-specific compiler for local assembly kernels. COFFEE manipulates abstract syntax trees generated from a high-level domain-specific language for PDEs by introducing domain-aware composable optimizations aimed at improving instruction-level parallelism, especially SIMD vectorization, and register locality. It then generates C code including vector intrinsics. Experiments using a range of finite-element forms of increasing complexity show that significant performance improvement is achieved.
BLASFEO is a dense linear algebra library providing high-performance implementations of BLAS- and LAPACK-like routines for use in embedded optimization and other applications targeting relatively small matrices. BLASFEO defines an API which uses a packed matrix format as its native format. This format is analogous to the internal memory buffers of optimized BLAS, but it is exposed to the user and it removes the packing cost from the routine call. For matrices fitting in cache, BLASFEO outperforms optimized BLAS implementations, both open-source and proprietary. This paper investigates the addition of a standard BLAS API to the BLASFEO framework, and proposes an implementation switching between two or more algorithms optimized for different matrix sizes. Thanks to the modular assembly framework in BLASFEO, tailored linear algebra kernels with mixed column- and panel-major arguments are easily developed. This BLAS API has lower performance than the BLASFEO API, but it nonetheless outperforms optimized BLAS and especially LAPACK libraries for matrices fitting in cache. Therefore, it can boost a wide range of applications, where standard BLAS and LAPACK libraries are employed and the matrix size is moderate. In particular, this paper investigates the benefits in scientific programming languages such as Octave, SciPy and Julia.
Based on a dynamical model describing how stationary, powerful and self-collimated jets are being launched from a magnetized disk, we build a consistent disk+jet microquasar picture. Our disk is a new type of disk solution called the Jet Emitting Disk (JED), and whose characteristics are directly constrained by the presence of a jet. We assume a one-temperature plasma with thermal particles only. By solving the radiative equilibrium of the disk, we obtain three branches of solutions, a hot and a cold ones (both thermally stable), and an intermediate one, thermally unstable. The hot solution possess the global observed characteristics of what has been often called a corona located above the inner disk region. We present this new disk solution, and how the radiative equilibrium is computed. We discuss the richness of the solution, and show the ability of the model to reproduce an observed spectral energy distribution of XTE J1118+480 with reasonable parameters. We finally outline some perspectives of the model.
We introduce a graphical user interface for constructing arbitrary tensor networks and specifying common operations like contractions or splitting, denoted GuiTeNet. Tensors are represented as nodes with attached legs, corresponding to the ordered dimensions of the tensor. GuiTeNet visualizes the current network, and instantly generates Python/NumPy source code for the hitherto sequence of user actions. Support for additional programming languages is planned for the future. We discuss the elementary operations on tensor networks used by GuiTeNet, together with high-level optimization strategies. The software runs directly in web browsers and is available online at http://guitenet.org.
Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose LightSeq, a highly efficient inference library for models in the Transformer family. LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint. LightSeq can easily import models trained using PyTorch and Tensorflow. Experimental results on machine translation benchmarks show that LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared with FasterTransformer, a concurrent CUDA implementation. The code is available at https://github.com/bytedance/lightseq.