Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

144 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jingcheng Shen

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jingcheng Shen - Yifan Wu - Masao Okita

النظم الموزعة والتوازية والحوسبة العنقودية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Stencil computation is an important class of scientific applications that can be efficiently executed by graphics processing units (GPUs). Out-of-core approach helps run large scale stencil codes that process data with sizes larger than the limited capacity of GPU memory. However, the performance of the GPU-based out-of-core stencil computation is always limited by the data transfer between the CPU and GPU. Many optimizations have been explored to reduce such data transfer, but the study on the use of on-the-fly compression techniques is far from sufficient. In this study, we propose a method that accelerates the GPU-based out-of-core stencil computation with on-the-fly compression. We introduce a novel data compression approach that solves the data dependency between two contiguous decomposed data blocks. We also modify a widely used GPU-based compression library to support pipelining that overlaps CPU/GPU data transfer with GPU computation. Experimental results show that the proposed method achieved a speedup of 1.2x compared the method without compression. Moreover, although the precision loss involved by compression increased with the number of time steps, the precision loss was trivial up to 4,320 time steps, demonstrating the usefulness of the proposed method.

قيم البحث

118 - Vitor Hugo Mickus Rodrigues , Lucas Cavalcante , Maelso Bruno Pereira 2019

The growth of data to be processed in the Oil & Gas industry matches the requirements imposed by evolving algorithms based on stencil computations, such as Full Waveform Inversion and Reverse Time Migration. Graphical processing units (GPUs) are an a ttractive architectural target for stencil computations because of its high degree of data parallelism. However, the rapid architectural and technological progression makes it difficult for even the most proficient programmers to remain up-to-date with the technological advances at a micro-architectural level. In this work, we present an extension for an open source compiler designed to produce highly optimized finite difference kernels for use in inversion methods named Devito. We embed it with the Oxford Parallel Domain Specific Language (OP-DSL) in order to enable automatic code generation for GPU architectures from a high-level representation. We aim to enable users coding in a symbolic representation level to effortlessly get their implementations leveraged by the processing capacities of GPU architectures. The implemented backend is evaluated on a NVIDIA GTX Titan Z, and on a NVIDIA Tesla V100 in terms of operational intensity through the roof-line model for varying space-order discretization levels of 3D acoustic isotropic wave propagation stencil kernels with and without symbolic optimizations. It achieves approximately 63% of V100s peak performance and 24% of Titan Zs peak performance for stencil kernels over grids with 256 points. Our study reveals that improving memory usage should be the most efficient strategy for leveraging the performance of the implemented solution on the evaluated architectures.

النظم الموزعة والتوازية والحوسبة العنقودية

An Efficient Vectorization Scheme for Stencil Computation

107 - Kun Li , Liang Yuan , Yunquan Zhang 2021

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data locality res pectively. In this paper, the downsides of existing vectorization schemes are analyzed. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. Then we propose a novel transpose layout to preserve the data locality for tiling and reduce the data reorganization overhead for vectorization simultaneously. To further improve the data reuse at the register level, a time loop unroll-and-jam strategy is designed to perform multistep stencil computation along the time dimension. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.

النظم الموزعة والتوازية والحوسبة العنقودية

GAMER with out-of-core computation

347 - Hsi-Yu Schive , Yu-Chih Tsai , 2010

GAMER is a GPU-accelerated Adaptive-MEsh-Refinement code for astrophysical simulations. In this work, two further extensions of the code are reported. First, we have implemented the MUSCL-Hancock method with the Roes Riemann solver for the hydrodynam ic evolution, by which the accuracy, overall performance and the GPU versus CPU speed-up factor are improved. Second, we have implemented the out-of-core computation, which utilizes the large storage space of multiple hard disks as the additional run-time virtual memory and permits an extremely large problem to be solved in a relatively small-size GPU cluster. The communication overhead associated with the data transfer between the parallel hard disks and the main memory is carefully reduced by overlapping it with the CPU/GPU computations.

الأجهزة والأساليب للزيئات الفيزياء الفلكية

APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores

107 - Boyuan Feng , Yuke Wang , Tong Geng 2021

Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g. , int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.

النظم الموزعة والتوازية والحوسبة العنقودية الذكاء الاصطناعي هندسة العتاد

On-the-Fly Computation of Bisimilarity Distances

158 - Giorgio Bacci , Giovanni Bacci , Kim G. Larsen 2017

We propose a distance between continuous-time Markov chains (CTMCs) and study the problem of computing it by comparing three different algorithmic methodologies: iterative, linear program, and on-the-fly. In a work presented at FoSSaCS12, Chen et al. characterized the bisimilarity distance of Desharnais et al. between discrete-time Markov chains as an optimal solution of a linear program that can be solved by using the ellipsoid method. Inspired by their result, we propose a novel linear program characterization to compute the distance in the continuous-time setting. Differently from previous proposals, ours has a number of constraints that is bounded by a polynomial in the size of the CTMC. This, in particular, proves that the distance we propose can be computed in polynomial time. Despite its theoretical importance, the proposed linear program characterization turns out to be inefficient in practice. Nevertheless, driven by the encouraging results of our previous work presented at TACAS13, we propose an efficient on-the-fly algorithm, which, unlike the other mentioned solutions, computes the distances between two given states avoiding an exhaustive exploration of the state space. This technique works by successively refining over-approximations of the target distances using a greedy strategy, which ensures that the state space is further explored only when the current approximations are improved. Tests performed on a consistent set of (pseudo)randomly generated CTMCs show that our algorithm improves, on average, the efficiency of the corresponding iterative and linear program methods with orders of magnitude.

المنطق في علوم الحاسوب

سجل دخول لتتمكن من نشر تعليقات