No Arabic abstract
One of the greatest efforts of computational scientists is to translate the mathematical model describing a class of physical phenomena into large and complex codes. Many of these codes face the difficulty of implementing the mathematical operations in the model in terms of low level optimized kernels offering both performance and portability. Legacy codes suffer from the additional curse of rigid design choices based on outdated performance metrics (e.g. minimization of memory footprint). Using a representative code from the Materials Science community, we propose a methodology to restructure the most expensive operations in terms of an optimized combination of dense linear algebra kernels. The resulting algorithm guarantees an increased performance and an extended life span of this code enabling larger scale simulations.
In this paper we focus on the integration of high-performance numerical libraries in ab initio codes and the portability of performance and scalability. The target of our work is FLEUR, a software for electronic structure calculations developed in the Forschungszentrum Julich over the course of two decades. The presented work follows up on a previous effort to modernize legacy code by re-engineering and rewriting it in terms of highly optimized libraries. We illustrate how this initial effort to get efficient and portable shared-memory code enables fast porting of the code to emerging heterogeneous architectures. More specifically, we port the code to nodes equipped with multiple GPUs. We divide our study in two parts. First, we show considerable speedups attained by minor and relatively straightforward code changes to off-load parts of the computation to the GPUs. Then, we identify further possible improvements to achieve even higher performance and scalability. On a system consisting of 16-cores and 2 GPUs, we observe speedups of up to 5x with respect to our optimized shared-memory code, which in turn means between 7.5x and 12.5x speedup with respect to the original FLEUR code.
The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using largescale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.
We present a successful deployment of high-fidelity Large-Eddy Simulation (LES) technologies based on spectral/hp element methods to industrial flow problems, which are characterized by high Reynolds numbers and complex geometries. In particular, we describe the numerical methods, software development and steps that were required to perform the implicit LES of a real automotive car, namely the Elemental Rp1 model. To the best of the authors knowledge, this simulation represents the first fifth-order accurate transient LES of an entire real car geometry. Moreover, this constitutes a key milestone towards considerably expanding the computational design envelope currently allowed in industry, where steady-state modelling remains the standard. To this end, a number of novel developments had to be made in order to overcome obstacles in mesh generation and solver technology to achieve this simulation, which we detail in this paper. The main objective is to present to the industrial and applied mathematics community, a viable pathway to translate academic developments into industrial tools, that can substantially advance the analysis and design capabilities of high-end engineering stakeholders. The novel developments and results were achieved using the academic-driven open-source framework Nektar++.
Additive Runge-Kutta methods designed for preserving highly accurate solutions in mixed-precision computation were proposed and analyzed in [8]. These specially designed methods use reduced precision or the implicit computations and full precision for the explicit computations. We develop a FORTRAN code to solve a nonlinear system of ordinary differential equations using the mixed precision additive Runge-Kutta (MP-ARK) methods on IBM POWER9 and Intel x86_64 chips. The convergence, accuracy, runtime, and energy consumption of these methods is explored. We show that these MP-ARK methods efficiently produce accurate solutions with significant reductions in runtime (and by extension energy consumption).
The serverless computing model strengthens the cloud computing tendency to abstract resource management. Serverless platforms are responsible for deploying and scaling the developers applications. Serverless also incorporated the pay-as-you-go billing model, which only considers the time spent processing client requests. Such a decision created a natural incentive for improving the platforms efficient resource usage. This search for efficiency can lead to the cold start problem, which represents a delay to execute serverless applications. Among the solutions proposed to deal with the cold start, those based on the snapshot method stand out. Despite the rich exploration of the technique, there is a lack of research that evaluates the solutions trade-offs. In this direction, this work compares two solutions to mitigate the cold start: Prebaking and SEUSS. We analyzed the solutions performance with functions of different levels of complexity: NoOp, a function that renders Markdown to HTML, and a function that loads 41 MB of dependencies. Preliminary results indicated that Prebaking showed a 33% and 25% superior performance to startup the NoOp and Markdown functions, respectively. Further analysis also revealed that Prebakings warmup mechanism reduced the Markdown first request processing time by 69%.