No Arabic abstract
Lattice Boltzmann methods are a popular mesoscopic alternative to macroscopic computational fluid dynamics solvers. Many variants have been developed that vary in complexity, accuracy, and computational cost. Extensions are available to simulate multi-phase, multi-component, turbulent, or non-Newtonian flows. In this work we present lbmpy, a code generation package that supports a wide variety of different methods and provides a generic development environment for new schemes as well. A high-level domain-specific language allows the user to formulate, extend and test various lattice Boltzmann schemes. The method specification is represented in a symbolic intermediate representation. Transformations that operate on this intermediate representation optimize and parallelize the method, yielding highly efficient lattice Boltzmann compute kernels not only for single- and two-relaxation-time schemes but also for multi-relaxation-time, cumulant, and entropically stabilized methods. An integration into the HPC framework waLBerla makes massively parallel, distributed simulations possible, which is demonstrated through scaling experiments on the SuperMUC-NG supercomputing system
In the prequel to this paper, we presented a systematic framework for processing spline spaces. In this paper, we take the results of that framework and provide a code generation pipeline that automatically generates efficient implementations of spline spaces. We decompose the final algorithm from Part I and translate the resulting components into LLVM-IR (a low level language that can be compiled to various targets/architectures). Our design provides a handful of parameters for a practitioner to tune - this is one of the avenues that provides us with the flexibility to target many different computational architectures and tune performance on those architectures. We also provide an evaluation of the effect of the different parameters on performance.
We describe a new parallel implementation, mplrs, of the vertex enumeration code lrs that uses the MPI parallel environment and can be run on a network of computers. The implementation makes use of a C wrapper that essentially uses the existing lrs code with only minor modifications. mplrs was derived from the earlier parallel implementation plrs, written by G. Roumanis in C++. plrs uses the Boost library and runs on a shared memory machine. In developing mplrs we discovered a method of balancing the parallel tree search, called budgeting, that greatly improves parallelization beyond the bottleneck encountered previously at around 32 cores. This method can be readily adapted for use in other reverse search enumeration codes. We also report some preliminary computational results comparing parallel and sequential codes for vertex/facet enumeration problems for convex polyhedra. The problems chosen span the range from simple to highly degenerate polytopes. For most problems tested, the results clearly show the advantage of using the parallel implementation mplrs of the reverse search based code lrs, even when as few as 8 cores are available. For some problems almost linear speedup was observed up to 1200 cores, the largest number of cores tested.
This paper presents a 55-line code written in python for 2D and 3D topology optimization (TO) based on the open-source finite element computing software (FEniCS), equipped with various finite element tools and solvers. PETSc is used as the linear algebra back-end, which results in significantly less computational time than standard python libraries. The code is designed based on the popular solid isotropic material with penalization (SIMP) methodology. Extensions to multiple load cases, different boundary conditions, and incorporation of passive elements are also presented. Thus, this implementation is the most compact implementation of SIMP based topology optimization for 3D as well as 2D problems. Utilizing the concept of Euclidean distance matrix to vectorize the computation of the weight matrix for the filter, we have achieved a substantial reduction in the computational time and have also made it possible for the code to work with complex ground structure configurations. We have also presented the codes extension to large-scale topology optimization problems with support for parallel computations on complex structural configuration, which could help students and researchers explore novel insights into the TO problem with dense meshes. Appendix-A contains the complete code, and the website: url{https://github.com/iitrabhi/topo-fenics} also contains the complete code.
The level of abstraction at which application experts reason about linear algebra computations and the level of abstraction used by developers of high-performance numerical linear algebra libraries do not match. The former is conveniently captured by high-level languages and libraries such as Matlab and Eigen, while the latter expresses the kernels included in the BLAS and LAPACK libraries. Unfortunately, the translation from a high-level computation to an efficient sequence of kernels is a task, far from trivial, that requires extensive knowledge of both linear algebra and high-performance computing. Internally, almost all high-level languages and libraries use efficient kernels; however, the translation algorithms are too simplistic and thus lead to a suboptimal use of said kernels, with significant performance losses. In order to both achieve the productivity that comes with high-level languages, and make use of the efficiency of low level kernels, we are developing Linnea, a code generator for linear algebra problems. As input, Linnea takes a high-level description of a linear algebra problem and produces as output an efficient sequence of calls to high-performance kernels. In 25 application problems, the code generated by Linnea always outperforms Matlab, Julia, Eigen and Armadillo, with speedups up to and exceeding 10x.
Interpolation is a fundamental technique in scientific computing and is at the heart of many scientific visualization techniques. There is usually a trade-off between the approximation capabilities of an interpolation scheme and its evaluation efficiency. For many applications, it is important for a user to be able to navigate their data in real time. In practice, the evaluation efficiency (or speed) outweighs any incremental improvements in reconstruction fidelity. In this two-part work, we first analyze from a general standpoint the use of compact piece-wise polynomial basis functions to efficiently interpolate data that is sampled on a lattice. In the sequel, we detail how we generate efficient implementations via automatic code generation on both CPU and GPU architectures. Specifically, in this paper, we propose a general framework that can produce a fast evaluation scheme by analyzing the algebro-geometric structure of the convolution sum for a given lattice and basis function combination. We demonstrate the utility and generality of our framework by providing fast implementations of various box splines on the Body Centered and Face Centered Cubic lattices, as well as some non-separable box splines on the Cartesian lattice. We also provide fast implementations for certain Voronoi splines that have not yet appeared in the literature. Finally, we demonstrate that this framework may also be used for non-Cartesian lattices in 4D.