ترغب بنشر مسار تعليمي؟ اضغط هنا

Batched QR and SVD Algorithms on GPUs with Applications in Hierarchical Matrix Compression

118   0   0.0 ( 0 )
 نشر من قبل Wajih Halim Boukaram
 تاريخ النشر 2017
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

We present high performance implementations of the QR and the singular value decomposition of a batch of small matrices hosted on the GPU with applications in the compression of hierarchical matrices. The one-sided Jacobi algorithm is used for its simplicity and inherent parallelism as a building block for the SVD of low rank blocks using randomized methods. We implement multiple kernels based on the level of the GPU memory hierarchy in which the matrices can reside and show substantial speedups against streamed cuSOLVER SVDs. The resulting batched routine is a key component of hierarchical matrix compression, opening up opportunities to perform H-matrix arithmetic efficiently on GPUs.

قيم البحث

اقرأ أيضاً

Hierarchical matrices are space and time efficient representations of dense matrices that exploit the low rank structure of matrix blocks at different levels of granularity. The hierarchically low rank block partitioning produces representations that can be stored and operated on in near-linear complexity instead of the usual polynomial complexity of dense matrices. In this paper, we present high performance implementations of matrix vector multiplication and compression operations for the $mathcal{H}^2$ variant of hierarchical matrices on GPUs. This variant exploits, in addition to the hierarchical block partitioning, hierarchical bases for the block representations and results in a scheme that requires only $O(n)$ storage and $O(n)$ complexity for the mat-vec and compression kernels. These two operations are at the core of algebraic operations for hierarchical matrices, the mat-vec being a ubiquitous operation in numerical algorithms while compression/recompression represents a key building block for other algebraic operations, which require periodic recompression during execution. The difficulties in developing efficient GPU algorithms come primarily from the irregular tree data structures that underlie the hierarchical representations, and the key to performance is to recast the computations on flattened trees in ways that allow batched linear algebra operations to be performed. This requires marshaling the irregularly laid out data in a way that allows them to be used by the batched routines. Marshaling operations only involve pointer arithmetic with no data movement and as a result have minimal overhead. Our numerical results on covariance matrices from 2D and 3D problems from spatial statistics show the high efficiency our routines achieve---over 550GB/s for the bandwidth-limited mat-vec and over 850GFLOPS/s in sustained performance for the compression on the P100 Pascal GPU.
218 - Weifeng Liu , Brian Vinter 2015
General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method (AMG), breadth first search and shortest path problem. Compared to other sparse BLAS routines, an effici ent parallel SpGEMM implementation has to handle extra irregularity from three aspects: (1) the number of nonzero entries in the resulting sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the resulting sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. In this work we propose a framework for SpGEMM on GPUs and emerging CPU-GPU heterogeneous processors. This framework particularly focuses on the above three problems. Memory pre-allocation for the resulting matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art CPU and GPU SpGEMM methods, our approach delivers excellent absolute performance and relative speedups on various benchmarks multiplying matrices with diverse sparsity structures. Furthermore, on heterogeneous processors, our SpGEMM approach achieves higher throughput by using re-allocatable shared virtual memory. The source code of this work is available at https://github.com/bhSPARSE/Benchmark_SpGEMM_using_CSR
In addition to hardware wall-time restrictions commonly seen in high-performance computing systems, it is likely that future systems will also be constrained by energy budgets. In the present work, finite difference algorithms of varying computationa l and memory intensity are evaluated with respect to both energy efficiency and runtime on an Intel Ivy Bridge CPU node, an Intel Xeon Phi Knights Landing processor, and an NVIDIA Tesla K40c GPU. The conventional way of storing the discretised derivatives to global arrays for solution advancement is found to be inefficient in terms of energy consumption and runtime. In contrast, a class of algorithms in which the discretised derivatives are evaluated on-the-fly or stored as thread-/process-local variables (yielding high compute intensity) is optimal both with respect to energy consumption and runtime. On all three hardware architectures considered, a speed-up of ~2 and an energy saving of ~2 are observed for the high compute intensive algorithms compared to the memory intensive algorithm. The energy consumption is found to be proportional to runtime, irrespective of the power consumed and the GPU has an energy saving of ~5 compared to the same algorithm on a CPU node.
Conventional GPU implementations of Strassens algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively la rge, squarish matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen primitives for GPUs that can be composed to generate a family of Strassen algorithms. Our algorithms utilize both the memory and thread hierarchies on GPUs, reusing shared memory and register files inherited from GEMM, fusing additional operations, and avoiding extra workspace. We further exploit intra- and inter-kernel parallelism by batching, streaming, and employing atomic operations. We also develop a performance model for NVIDIA Volta GPUs to select the appropriate blocking parameters and predict the performance for GEMM and Strassen. Overall, our 1-level Strassen can achieve up to 1.11x speedup with a crossover point as small as 1,536 compared to cublasSgemm on a NVIDIA Tesla V100 GPU. With additional workspace, our 2-level Strassen can achieve 1.19x speedup with a crossover point at 7,680.
Nonsmooth Nonnegative Matrix Factorization (nsNMF) is capable of producing more localized, less overlapped feature representations than other variants of NMF while keeping satisfactory fit to data. However, nsNMF as well as other existing NMF methods is incompetent to learn hierarchical features of complex data due to its shallow structure. To fill this gap, we propose a deep nsNMF method coined by the fact that it possesses a deeper architecture compared with standard nsNMF. The deep nsNMF not only gives parts-based features due to the nonnegativity constraints, but also creates higher-level, more abstract features by combing lower-level ones. The in-depth description of how deep architecture can help to efficiently discover abstract features in dnsNMF is presented. And we also show that the deep nsNMF has close relationship with the deep autoencoder, suggesting that the proposed model inherits the major advantages from both deep learning and NMF. Extensive experiments demonstrate the standout performance of the proposed method in clustering analysis.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا