Subscribe to the gold package and get unlimited access to Shamra Academy

An Efficient Vectorization Scheme for Stencil Computation

108 0 0.0 ( 0 )

Download Cite

Added by Kun Li

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Kun Li - Liang Yuan - Yunquan Zhang

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data locality respectively. In this paper, the downsides of existing vectorization schemes are analyzed. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. Then we propose a novel transpose layout to preserve the data locality for tiling and reduce the data reorganization overhead for vectorization simultaneously. To further improve the data reuse at the register level, a time loop unroll-and-jam strategy is designed to perform multistep stencil computation along the time dimension. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.

rate research

FPDetect: Efficient Reasoning About Stencil Programs Using Selective Direct Evaluation

141 - Arnab Das , Sriram Krishnamoorthy , Ian Briggs 2020

We present FPDetect, a low overhead approach for detecting logical errors and soft errors affecting stencil computations without generating false positives. We develop an offline analysis that tightly estimates the number of floating-point bits preserved across stencil applications. This estimate rigorously bounds the values expected in the data space of the computation. Violations of this bound can be attributed with certainty to errors. FPDetect helps synthesize error detectors customized for user-specified levels of accuracy and coverage. FPDetect also enables overhead reduction techniques based on deploying these detectors coarsely in space and time. Experimental evaluations demonstrate the practicality of our approach.

Distributed Parallel and Cluster Computing Numerical Analysis Performance

An efficient and secure scheme of verifiable computation for Intel SGX

265 - Wenxiu Ding , Wei Sun , Zheng Yan 2021

Cloud computing offers resource-constrained users big-volume data storage and energy-consuming complicated computation. However, owing to the lack of full trust in the cloud, the cloud users prefer privacy-preserving outsourced data computation with correctness verification. However, cryptography-based schemes introduce high computational costs to both the cloud and its users for verifiable computation with privacy preservation, which makes it difficult to support complicated computations in practice. Intel Software Guard Extensions (SGX) as a trusted execution environment is widely researched in various fields (such as secure data analytics and computation), and is regarded as a promising way to achieve efficient outsourced data computation with privacy preservation over the cloud. But we find two types of threats towards the computation with SGX: Disarranging Data-Related Code threat and Output Tampering and Misrouting threat. In this paper, we depict these threats using formal methods and successfully conduct the two threats on the enclave program constructed by Rust SGX SDK to demonstrate their impacts on the correctness of computations over SGX enclaves. In order to provide countermeasures, we propose an efficient and secure scheme to resist the threats and realize verifiable computation for Intel SGX. We prove the security and show the efficiency and correctness of our proposed scheme through theoretic analysis and extensive experiments. Furthermore, we compare the performance of our scheme with that of some cryptography-based schemes to show its high efficiency.

Cryptography and Security

Accelerating GPU-Based Out-of-Core Stencil Computation with On-the-Fly Compression

143 - Jingcheng Shen , Yifan Wu , Masao Okita 2021

Stencil computation is an important class of scientific applications that can be efficiently executed by graphics processing units (GPUs). Out-of-core approach helps run large scale stencil codes that process data with sizes larger than the limited capacity of GPU memory. However, the performance of the GPU-based out-of-core stencil computation is always limited by the data transfer between the CPU and GPU. Many optimizations have been explored to reduce such data transfer, but the study on the use of on-the-fly compression techniques is far from sufficient. In this study, we propose a method that accelerates the GPU-based out-of-core stencil computation with on-the-fly compression. We introduce a novel data compression approach that solves the data dependency between two contiguous decomposed data blocks. We also modify a widely used GPU-based compression library to support pipelining that overlaps CPU/GPU data transfer with GPU computation. Experimental results show that the proposed method achieved a speedup of 1.2x compared the method without compression. Moreover, although the precision loss involved by compression increased with the number of time steps, the precision loss was trivial up to 4,320 time steps, demonstrating the usefulness of the proposed method.

Distributed Parallel and Cluster Computing

Distributed Management of Massive Data: an Efficient Fine-Grain Data Access Scheme

482 - Bogdan Nicolae 2008

This paper addresses the problem of efficiently storing and accessing massive data blocks in a large-scale distributed environment, while providing efficient fine-grain access to data subsets. This issue is crucial in the context of applications in the field of databases, data mining and multimedia. We propose a data sharing service based on distributed, RAM-based storage of data, while leveraging a DHT-based, natively parallel metadata management scheme. As opposed to the most commonly used grid storage infrastructures that provide mechanisms for explicit data localization and transfer, we provide a transparent access model, where data are accessed through global identifiers. Our proposal has been validated through a prototype implementation whose preliminary evaluation provides promising results.

Distributed Parallel and Cluster Computing

An Efficient ADER-DG Local Time Stepping Scheme for 3D HPC Simulation of Seismic Waves in Poroelastic Media

110 - Sebastian Wolf , Martin Galis , Carsten Uphoff 2021

Many applications from geosciences require simulations of seismic waves in porous media. Biots theory of poroelasticity describes the coupling between solid and fluid phases and introduces a stiff source term, thereby increasing computational cost and motivating efficient methods utilising High-Performance Computing. We present a novel realisation of the discontinuous Galerkin scheme with Arbitrary DERivative time stepping (ADER-DG) that copes with stiff source terms. To integrate this source term with a reasonable time step size, we use an element-local space-time predictor, which needs to solve medium-sized linear systems - with 1000 to 10000 unknowns - in each element update (i.e., billions of times). We present a novel block-wise back-substitution algorithm for solving these systems efficiently. In comparison to LU decomposition, we reduce the number of floating-point operations by a factor of up to 25. The block-wise back-substitution is mapped to a sequence of small matrix-matrix multiplications, for which code generators are available to generate highly optimised code. We verify the new solver thoroughly in problems of increasing complexity. We demonstrate high-order convergence for 3D problems. We verify the correct treatment of point sources, material interfaces and traction-free boundary conditions. In addition, we compare against a finite difference code for a newly defined layer over half-space problem. We find that extremely high accuracy is required to resolve the slow P-wave at a free surface, while solid particle velocities are not affected by coarser resolutions. By using a clustered local time stepping scheme, we reduce time to solution by a factor of 6 to 10 compared to global time stepping. We conclude our study with a scaling and performance analysis, demonstrating our implementations efficiency and its potential for extreme-scale simulations.

Distributed Parallel and Cluster Computing Mathematical Software Computational Physics

comments

Fetching comments

Al-Etihad University

Additional details More universities

An Efficient Vectorization Scheme for Stencil Computation

Ask ChatGPT about the research

No Arabic abstract

Read More