ترغب بنشر مسار تعليمي؟ اضغط هنا

The use of composable abstractions allows the application of new and established algorithms to a wide range of problems while automatically inheriting the benefits of well-known performance optimisations. This work highlights the composition of the P ETSc DMPlex domain topology abstraction with the Firedrake automated finite element system to create a PDE solving environment that combines expressiveness, flexibility and high performance. We describe how Firedrake utilises DMPlex to provide the indirection maps required for finite element assembly, while supporting various mesh input formats and runtime domain decomposition. In particular, we describe how DMPlex and its accompanying data structures allow the generic creation of user-defined discretisations, while utilising data layout optimisations that improve cache coherency and ensure overlapped communication during assembly computation.
We present a simple mathematical framework and API for parallel mesh and data distribution, load balancing, and overlap generation. It relies on viewing the mesh as a Hasse diagram, abstracting away information such as cell shape, dimension, and coor dinates. The high level of abstraction makes our interface both concise and powerful, as the same algorithm applies to any representable mesh, such as hybrid meshes, meshes embedded in higher dimension, and overlapped meshes in parallel. We present evidence, both theoretical and experimental, that the algorithms are scalable and efficient. A working implementation can be found in the latest release of the PETSc libraries.
256 - Bruno Grenet 2015
In this paper, we report on an implementation in the free software Mathemagix of lacunary factorization algorithms, distributed as a library called Lacunaryx. These algorithms take as input a polynomial in sparse representation, that is as a list of nonzero monomials, and an integer $d$, and compute its irreducible degree-$le d$ factors. The complexity of these algorithms is polynomial in the sparse size of the input polynomial and $d$.
Study of general purpose computation by GPU (Graphics Processing Unit) can improve the image processing capability of micro-computer system. This paper studies the parallelism of the different stages of decimation in time radix 2 FFT algorithm, desig ns the butterfly and scramble kernels and implements 2D FFT on GPU. The experiment result demonstrates the validity and advantage over general CPU, especially in the condition of large input size. The approach can also be generalized to other transforms alike.
We give a Descartes-like bound on the number of positive solutions to a system of fewnomials that holds when its exponent vectors are not in convex position and a sign condition is satisfied. This was discovered while developing algorithms and softwa re for computing the Gale transform of a fewnomial system, which is our main goal. This software is a component of a package we are developing for Khovanskii-Rolle continuation, which is a numerical algorithm to compute the real solutions to a system of fewnomials.
This paper shows an application of the theory of sorting networks to facilitate the synthesis of optimized general purpose sorting libraries. Standard sorting libraries are often based on combinations of the classic Quicksort algorithm with insertion sort applied as the base case for small fixed numbers of inputs. Unrolling the code for the base case by ignoring loop conditions eliminates branching and results in code which is equivalent to a sorting network. This enables the application of further program transformations based on sorting network optimizations, and eventually the synthesis of code from sorting networks. We show that if considering the number of comparisons and swaps then theory predicts no real advantage of this approach. However, significant speed-ups are obtained when taking advantage of instruction level parallelism and non-branching conditional assignment instructions, both of which are common in modern CPU architectures. We provide empirical evidence that using code synthesized from efficient sorting networks as the base case for Quicksort libraries results in significant real-world speed-ups.
122 - Weifeng Liu , Brian Vinter 2015
Sparse matrix-vector multiplication (SpMV) is a central building block for scientific software and graph applications. Recently, heterogeneous processors composed of different types of cores attracted much attention because of their flexible core con figuration and high energy efficiency. In this paper, we propose a compressed sparse row (CSR) format based SpMV algorithm utilizing both types of cores in a CPU-GPU heterogeneous processor. We first speculatively execute segmented sum operations on the GPU part of a heterogeneous processor and generate a possibly incorrect results. Then the CPU part of the same chip is triggered to re-arrange the predicted partial sums for a correct resulting vector. On three heterogeneous processors from Intel, AMD and nVidia, using 20 sparse matrices as a benchmark suite, the experimental results show that our method obtains significant performance improvement over the best existing CSR-based SpMV algorithms. The source code of this work is downloadable at https://github.com/bhSPARSE/Benchmark_SpMV_using_CSR
212 - Weifeng Liu , Brian Vinter 2015
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an effici ent parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. The source code of this work is available at https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR
Spectral deferred corrections (SDC) is an iterative approach for constructing higher- order accurate numerical approximations of ordinary differential equations. SDC starts with an initial approximation of the solution defined at a set of Gaussian or spectral collocation nodes over a time interval and uses an iterative application of lower-order time discretizations applied to a correction equation to improve the solution at these nodes. Each deferred correction sweep increases the formal order of accuracy of the method up to the limit inherent in the accuracy defined by the collocation points. In this paper, we demonstrate that SDC is well suited to recovering from soft (transient) hardware faults in the data. A strategy where extra correction iterations are used to recover from soft errors and provide algorithmic resilience is proposed. Specifically, in this approach the iteration is continued until the residual (a measure of the error in the approximation) is small relative to the residual on the first correction iteration and changes slowly between successive iterations. We demonstrate the effectiveness of this strategy for both canonical test problems and a comprehen- sive situation involving a mature scientific application code that solves the reacting Navier-Stokes equations for combustion research.
Automatic and adaptive approximation, optimization, or integration of functions in a cone with guarantee of accuracy is a relatively new paradigm. Our purpose is to create an open-source MATLAB package, Guaranteed Automatic Integration Library (GAIL) , following the philosophy of reproducible research and sustainable practices of robust scientific software development. For our conviction that true scholarship in computational sciences are characterized by reliable reproducibility, we employ the best practices in mathematical research and software engineering known to us and available in MATLAB. This document describes the key features of functions in GAIL, which includes one-dimensional function approximation and minimization using linear splines, one-dimensional numerical integration using trapezoidal rule, and last but not least, mean estimation and multidimensional integration by Monte Carlo methods or Quasi Monte Carlo methods.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا