No Arabic abstract
We present a rejection method based on recursive covering of the probability density function with equal tiles. The concept works for any probability density function that is pointwise computable or representable by tabular data. By the implicit construction of piecewise constant majorizing and minorizing functions that are arbitrarily close to the density function the production of random variates is arbitrarily independent of the computation of the density function and extremely fast. The method works unattended for probability densities with discontinuities (jumps and poles). The setup time is short, marginally independent of the shape of the probability density and linear in table size. Recently formulated requirements to a general and automatic non-uniform random number generator are topped. We give benchmarks together with a similar rejection method and with a transformation method.
The speed of many one-line transformation methods for the production of, for example, Levy alpha-stable random numbers, which generalize Gaussian ones, and Mittag-Leffler random numbers, which generalize exponential ones, is very high and satisfactory for most purposes. However, for the class of decreasing probability densities fast rejection implementations like the Ziggurat by Marsaglia and Tsang promise a significant speed-up if it is possible to complement them with a method that samples the tails of the infinite support. This requires the fast generation of random numbers greater or smaller than a certain value. We present a method to achieve this, and also to generate random numbers within any arbitrary interval. We demonstrate the method showing the properties of the transform maps of the above mentioned distributions as examples of stable and geometric stable random numbers used for the stochastic solution of the space-time fractional diffusion equation.
High-order optimization methods, including Newtons method and its variants as well as alternating minimization methods, dominate the optimization algorithms for tensor decompositions and tensor networks. These tensor methods are used for data analysis and simulation of quantum systems. In this work, we introduce AutoHOOT, the first automatic differentiation (AD) framework targeting at high-order optimization for tensor computations. AutoHOOT takes input tensor computation expressions and generates optimized derivative expressions. In particular, AutoHOOT contains a new explicit Jacobian / Hessian expression generation kernel whose outputs maintain the input tensors granularity and are easy to optimize. The expressions are then optimized by both the traditional compiler optimization techniques and specific tensor algebra transformations. Experimental results show that AutoHOOT achieves competitive CPU and GPU performance for both tensor decomposition and tensor network applications compared to existing AD software and other tensor computation libraries with manually written kernels. The tensor methods generated by AutoHOOT are also well-parallelizable, and we demonstrate good scalability on a distributed memory supercomputer.
In the prequel to this paper, we presented a systematic framework for processing spline spaces. In this paper, we take the results of that framework and provide a code generation pipeline that automatically generates efficient implementations of spline spaces. We decompose the final algorithm from Part I and translate the resulting components into LLVM-IR (a low level language that can be compiled to various targets/architectures). Our design provides a handful of parameters for a practitioner to tune - this is one of the avenues that provides us with the flexibility to target many different computational architectures and tune performance on those architectures. We also provide an evaluation of the effect of the different parameters on performance.
The level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.
Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system