Deriving Correct High-Performance Algorithms

68 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Devangi Parikh

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Devangi N. Parikh - Maggie E. Myers - Robert A. van de Geijn

البرمجيات الرياضية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Dijkstra observed that verifying correctness of a program is difficult and conjectured that derivation of a program hand-in-hand with its proof of correctness was the answer. We illustrate this goal-oriented approach by applying it to the domain of dense linear algebra libraries for distributed memory parallel computers. We show that algorithms that underlie the implementation of most functionality for this domain can be systematically derived to be correct. The benefit is that an entire family of algorithms for an operation is discovered so that the best algorithm for a given architecture can be chosen. This approach is very practical: Ideas inspired by it have been used to rewrite the dense linear algebra software stack starting below the Basic Linear Algebra Subprograms (BLAS) and reaching up through the Elemental distributed memory library, and every level in between. The paper demonstrates how formal methods and rigorous mathematical techniques for correctness impact HPC.

قيم البحث

76 - Xiaohui Wang , Ying Xiong , Yang Wei 2020

Transformer, BERT and their variants have achieved great success in natural language processing. Since Transformer models are huge in size, serving these models is a challenge for real industrial applications. In this paper, we propose LightSeq, a hi ghly efficient inference library for models in the Transformer family. LightSeq includes a series of GPU optimization techniques to to streamline the computation of neural layers and to reduce memory footprint. LightSeq can easily import models trained using PyTorch and Tensorflow. Experimental results on machine translation benchmarks show that LightSeq achieves up to 14x speedup compared with TensorFlow and 1.4x compared with FasterTransformer, a concurrent CUDA implementation. The code is available at https://github.com/bytedance/lightseq.

البرمجيات الرياضية التعلم الآلي

COMET: A Domain-Specific Compilation of High-Performance Computational Chemistry

49 - Erdal Mutlu , Ruiqin Tian , Bin Ren 2021

The computational power increases over the past decades havegreatly enhanced the ability to simulate chemical reactions andunderstand ever more complex transformations. Tensor contractions are the fundamental computational building block of these sim ulations. These simulations have often been tied to one platform and restricted in generality by the interface provided to the user. The expanding prevalence of accelerators and researcher demands necessitate a more general approach which is not tied to specific hardware or requires contortion of algorithms to specific hardware platforms. In this paper we present COMET, a domain-specific programming language and compiler infrastructure for tensor contractions targeting heterogeneous accelerators. We present a system of progressive lowering through multiple layers of abstraction and optimization that achieves up to 1.98X speedup for 30 tensor contractions commonly used in computational chemistry and beyond.

البرمجيات الرياضية

On the Efficacy and High-Performance Implementation of Quaternion Matrix Multiplication

60 - David Williams-Young , Xiaosong Li 2019

Quaternion symmetry is ubiquitous in the physical sciences. As such, much work has been afforded over the years to the development of efficient schemes to exploit this symmetry using real and complex linear algebra. Recent years have also seen many a dvances in the formal theoretical development of explicitly quaternion linear algebra with promising applications in image processing and machine learning. Despite these advances, there do not currently exist optimized software implementations of quaternion linear algebra. The leverage of optimized linear algebra software is crucial in the achievement of high levels of performance on modern computing architectures, and thus provides a central tool in the development of high-performance scientific software. In this work, a case will be made for the efficacy of high-performance quaternion linear algebra software for appropriate problems. In this pursuit, an optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation. The results of this work pave the path for further development of high-performance quaternion linear algebra software which will improve the performance of the next generation of applicable scientific applications.

البرمجيات الرياضية

CUDACLAW: A high-performance programmable GPU framework for the solution of hyperbolic PDEs

114 - H. Gorune Ohannessian , George Turkiyyah , Aron Ahmadia 2018

We present cudaclaw, a CUDA-based high performance data-parallel framework for the solution of multidimensional hyperbolic partial differential equation (PDE) systems, equations describing wave motion. cudaclaw allows computational scientists to solv e such systems on GPUs without being burdened by the need to write CUDA code, worry about thread and block details, data layout, and data movement between the different levels of the memory hierarchy. The user defines the set of PDEs to be solved via a CUDA- independent serial Riemann solver and the framework takes care of orchestrating the computations and data transfers to maximize arithmetic throughput. cudaclaw treats the different spatial dimensions separately to allow suitable block sizes and dimensions to be used in the different directions, and includes a number of optimizations to minimize access to global memory.

البرمجيات الرياضية التحليل العددي التحليل العددي

Accelerating Scientific Computations with Mixed Precision Algorithms

120 - Marc Baboulin , Alfredo Buttari , Jack Dongarra 2008

On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse lin ear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.

البرمجيات الرياضية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة الإسلامية في لبنان

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Deriving Correct High-Performance Algorithms

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً