New community

Subscribe to the gold package and get unlimited access to Shamra Academy

A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW

187 0 0.0 ( 0 )

Download Cite

Added by Edoardo Di Napoli

Publication date 2013

fields Informatics Engineering

and research's language is English

Authors Mario Berljafa Department ofn Mathematics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In one of the most important methods in Density Functional Theory - the Full-Potential Linearized Augmented Plane Wave (FLAPW) method - dense generalized eigenproblems are organized in long sequences. Moreover each eigenproblem is strongly correlated to the next one in the sequence. We propose a novel approach which exploits such correlation through the use of an eigensolver based on subspace iteration and accelerated with Chebyshev polynomials. The resulting solver, parallelized using the Elemental library framework, achieves excellent scalability and is competitive with current dense parallel eigensolvers.

rate research

Parallel GPU Implementation of Iterative PCA Algorithms

416 - M. Andrecut 2008

Principal component analysis (PCA) is a key statistical technique for multivariate data analysis. For large data sets the common approach to PCA computation is based on the standard NIPALS-PCA algorithm, which unfortunately suffers from loss of orthogonality, and therefore its applicability is usually limited to the estimation of the first few components. Here we present an algorithm based on Gram-Schmidt orthogonalization (called GS-PCA), which eliminates this shortcoming of NIPALS-PCA. Also, we discuss the GPU (Graphics Processing Unit) parallel implementation of both NIPALS-PCA and GS-PCA algorithms. The numerical results show that the GPU parallel optimize

Quantitative Methods Mathematical Software Computational Physics

Programming Parallel Dense Matrix Factorizations with Look-Ahead and OpenMP

157 - Sandra Catalan , Adrian Castello , Francisco D. Igual 2018

We investigate a parallelization strategy for dense matrix factorization (DMF) algorithms, using OpenMP, that departs from the legacy (or conventional) solution, which simply extracts concurrency from a multithreaded version of BLAS. This approach is also different from the more sophisticated runtime-assisted implementations, which decompose the operation into tasks and identify dependencies via directives and runtime support. Instead, our strategy attains high performance by explicitly embedding a static look-ahead technique into the DMF code, in order to overcome the performance bottleneck of the panel factorization, and realizing the trailing update via a cache-aware multi-threaded implementation of the BLAS. Although the parallel algorithms are specified with a highlevel of abstraction, the actual implementation can be easily derived from them, paving the road to deriving a high performance implementation of a considerable fraction of LAPACK functionality on any multicore platform with an OpenMP-like runtime.

Distributed Parallel and Cluster Computing Mathematical Software

Fast Strassen-based $A^t A$ Parallel Multiplication

313 - Viviana Arrigoni , Annalisa Massini 2019

Matrix multiplication $A^t A$ appears as intermediate operation during the solution of a wide set of problems. In this paper, we propose a new cache-oblivious algorithm for the $A^t A$ multiplication. Our algorithm, A$scriptstyle mathsf{T}$A, calls classical Strassens algorithm as sub-routine, decreasing the computational cost %(expressed in number of performed products) of the conventional $A^t A$ multiplication to $frac{2}{7}n^{log_2 7}$. It works for generic rectangular matrices and exploits the peculiar symmetry of the resulting product matrix for sparing memory. We used the MPI paradigm to implement A$scriptstyle mathsf{T}$A in parallel, and we tested its performances on a small subset of nodes of the Galileo cluster. Experiments highlight good scalability and speed-up, also thanks to minimal number of exchanged messages in the designed communication system. Parallel overhead and inherently sequential time fraction are negligible in the tested configurations.

Distributed Parallel and Cluster Computing Mathematical Software

Sparse matrix-vector multiplication on GPGPU clusters: A new storage format and a scalable implementation

159 - Moritz Kreutzer , Georg Hager , Gerhard Wellein 2011

Sparse matrix-vector multiplication (spMVM) is the dominant operation in many sparse solvers. We investigate performance properties of spMVM with matrices of various sparsity patterns on the nVidia Fermi class of GPGPUs. A new padded jagged diagonals storage (pJDS) format is proposed which may substantially reduce the memory overhead intrinsic to the widespread ELLPACK-R scheme. In our test scenarios the pJDS format cuts the overall spMVM memory footprint on the GPGPU by up to 70%, and achieves 95% to 130% of the ELLPACK-R performance. Using a suitable performance model we identify performance bottlenecks on the node level that invalidate some types of matrix structures for efficient multi-GPGPU parallelization. For appropriate sparsity patterns we extend previous work on distributed-memory parallel spMVM to demonstrate a scalable hybrid MPI-GPGPU code, achieving efficient overlap of communication and computation.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

A Massively Parallel Algorithm for the Approximate Calculation of Inverse p-th Roots of Large Sparse Matrices

75 - Michael Lass , Stephan Mohr , Hendrik Wiebeler 2017

We present the submatrix method, a highly parallelizable method for the approximate calculation of inverse p-th roots of large sparse symmetric matrices which are required in different scientific applications. We follow the idea of Approximate Computing, allowing imprecision in the final result in order to be able to utilize the sparsity of the input matrix and to allow massively parallel execution. For an n x n matrix, the proposed algorithm allows to distribute the calculations over n nodes with only little communication overhead. The approximate result matrix exhibits the same sparsity pattern as the input matrix, allowing for efficient reuse of allocated data structures. We evaluate the algorithm with respect to the error that it introduces into calculated results, as well as its performance and scalability. We demonstrate that the error is relatively limited for well-conditioned matrices and that results are still valuable for error-resilient applications like preconditioning even for ill-conditioned matrices. We discuss the execution time and scaling of the algorithm on a theoretical level and present a distributed implementation of the algorithm using MPI and OpenMP. We demonstrate the scalability of this implementation by running it on a high-performance compute cluster comprised of 1024 CPU cores, showing a speedup of 665x compared to single-threaded execution.

Distributed Parallel and Cluster Computing Mathematical Software Numerical Analysis

comments

Fetching comments

Syrian International University for Science and Technology

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Parallel and Scalable Iterative Solver for Sequences of Dense Eigenproblems Arising in FLAPW

Ask ChatGPT about the research

No Arabic abstract

Read More