أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Guangming Tan

SMAT: An Input Adaptive Sparse Matrix-Vector Multiplication Auto-Tuner

136 - Jiajia Li , Xiuxia Zhang , Guangming Tan 2012

Sparse matrix vector multiplication (SpMV) is an important kernel in scientific and engineering applications. The previous optimizations are sparse matrix format specific and expose the choice of the best format to application programmers. In this wo rk we develop an auto-tuning framework to bridge gap between the specific optimized kernels and their general-purpose use. We propose an SpMV auto-tuner (SMAT) that provides an unified interface based on compressed sparse row (CSR) to programmers by implicitly choosing the best format and the fastest implementation of any input sparse matrix in runtime. SMAT leverage a data mining model, which is formulated based on a set of performance parameters extracted from 2373 matrices in UF sparse matrix collection, to fast search the best combination. The experiments show that SMAT achieves the maximum performance of 75 GFLOP/s in single-precision and 33 GFLOP/s in double-precision on Intel, and 41 GFLOP/s in single-precision and 34 GFLOP/s in double-precision on AMD. Compared with the sparse functions in MKL library, SMAT runs faster by more than 3 times.

البرمجيات الرياضية النظم الموزعة والتوازية والحوسبة العنقودية

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

157 - Huiwei Lv , Guangming Tan , Mingyu Chen 2012

For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communicat ion cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages before communication. Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. Experiments on a 6,144-core SMP cluster show our algorithm outperforms the baseline implementation in Graph500 by 2.2 times, reduces its communication time by 79.0%, and achieves a performance rate of 12.1 GTEPS (billion edge visits per second)

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد