Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations

77 0 0.0 ( 0 )

Download Cite

Added by Kun Li

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Kun Li - Liang Yuan - Yunquan Zhang

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization techniques, aiming at exploiting the in-core data parallelism. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. In this paper, a novel transpose layout is devised to preserve the data locality for tiling in the data space and reduce the data reorganization overhead for vectorization simultaneously. We then propose an approach of temporal computation folding designed to further reduce the redundancy of arithmetic calculations by exploiting the register reuse, alleviating the increased register pressure, and deducing generalization with a linear regression model. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.

rate research

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations

76 - Maciej Besta , Michal Podstawski , Linus Groner 2020

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state.We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters.We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.

Distributed Parallel and Cluster Computing Data Structures and Algorithms

Optimal redundancy in computations from random oracles

47 - George Barmpalias , Andrew Lewis-Pye 2016

A classic result in algorithmic information theory is that every infinite binary sequence is computable from a Martin-Loef random infinite binary sequence. Proved independently by Kucera and Gacs, this result answered a question by Charles Bennett and has seen numerous applications in the last 30 years. The optimal redundancy in such a coding process has, however, remained unknown. If the computation of the first n bits of a sequence requires n + g(n) bits of the random oracle, then g is the redundancy of the computation. Kucera implicitly achieved redundancy n log n while Gacs used a more elaborate block-coding procedure which achieved redundancy sqrt(n) log n. Different approaches to coding such as the one by Merkle and Mihailovic have not improved this redundancy bound. In this paper we devise a new coding method that achieves optimal logarithmic redundancy. Our redundancy bound is exponentially smaller than the previously best known bound and is known to be the best possible. It follows that redundancy r log n in computation from a random oracle is possible for every stream, if and only if r > 1.

Computational Complexity

An Efficient Vectorization Scheme for Stencil Computation

107 - Kun Li , Liang Yuan , Yunquan Zhang 2021

Stencil computation is one of the most important kernels in various scientific and engineering applications. A variety of work has focused on vectorization and tiling techniques, aiming at exploiting the in-core data parallelism and data locality respectively. In this paper, the downsides of existing vectorization schemes are analyzed. Briefly, they either incur data alignment conflicts or hurt the data locality when integrated with tiling. Then we propose a novel transpose layout to preserve the data locality for tiling and reduce the data reorganization overhead for vectorization simultaneously. To further improve the data reuse at the register level, a time loop unroll-and-jam strategy is designed to perform multistep stencil computation along the time dimension. Experimental results on the AVX-2 and AVX-512 CPUs show that our approach obtains a competitive performance.

Distributed Parallel and Cluster Computing

The CORE Storage Primitive: Cross-Object Redundancy for Efficient Data Repair & Access in Erasure Coded Storage

414 - Kyumars Sheykh Esmaili , Lluis Pamies-Juarez , Anwitaman Datta 2013

Erasure codes are an integral part of many distributed storage systems aimed at Big Data, since they provide high fault-tolerance for low overheads. However, traditional erasure codes are inefficient on reading stored data in degraded environments (when nodes might be unavailable), and on replenishing lost data (vital for long term resilience). Consequently, novel codes optimized to cope with distributed storage system nuances are vigorously being researched. In this paper, we take an engineering alternative, exploring the use of simple and mature techniques -juxtaposing a standard erasure code with RAID-4 like parity. We carry out an analytical study to determine the efficacy of this approach over traditional as well as some novel codes. We build upon this study to design CORE, a general storage primitive that we integrate into HDFS. We benchmark this implementation in a proprietary cluster and in EC2. Our experiments show that compared to traditional erasure codes, CORE uses 50% less bandwidth and is up to 75% faster while recovering a single failed node, while the gains are respectively 15% and 60% for double node failures.

Distributed Parallel and Cluster Computing

Secure multiparty computations in floating-point arithmetic

109 - Chuan Guo , Awni Hannun , Brian Knott 2020

Secure multiparty computations enable the distribution of so-called shares of sensitive data to multiple parties such that the multiple parties can effectively process the data while being unable to glean much information about the data (at least not without collusion among all parties to put back together all the shares). Thus, the parties may conspire to send all their processed results to a trusted third party (perhaps the data provider) at the conclusion of the computations, with only the trusted third party being able to view the final results. Secure multiparty computations for privacy-preserving machine-learning turn out to be possible using solely standard floating-point arithmetic, at least with a carefully controlled leakage of information less than the loss of accuracy due to roundoff, all backed by rigorous mathematical proofs of worst-case bounds on information loss and numerical stability in finite-precision arithmetic. Numerical examples illustrate the high performance attained on commodity off-the-shelf hardware for generalized linear models, including ordinary linear least-squares regression, binary and multinomial logistic regression, probit regression, and Poisson regression.

Cryptography and Security Information Theory Machine Learning

comments

Fetching comments

University of Mosul

Additional details More universities

Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations

Ask ChatGPT about the research

No Arabic abstract

Read More