Parallelization of Weighted Sequence Comparison by using EBWT

329 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Shashank Srikant

تاريخ النشر 2010

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Shashank Srikant

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The Extended Burrows Wheeler transform (EBWT) helps to find the distance between two sequences. Implementation of an existing algorithm takes considerable amount of time for small size sequences. In this paper, we give a parallel implementation of this algorithm using NVIDIA Compute Unified Device Architecture (CUDA). We have obtained, on an average, a 2X improvement in the performance.

قيم البحث

61 - Berenger Bramas 2020

In this paper, we present several improvements in the parallelization of the in-place merge algorithm, which merges two contiguous sorted arrays into one with an O(T) space complexity (where T is the number of threads). The approach divides the two a rrays into as many pairs of partitions as there are threads available; such that each thread can later merge a pair of partitions independently of the others. We extend the existing method by proposing a new algorithm to find the median of two partitions. Additionally, we provide a new strategy to divide the input arrays where we minimize the data movement, but at the cost of making this stage sequential. Finally, we provide the so-called linear shifting algorithm that swaps two partitions in-place with contiguous data access. We emphasize that our approach is straightforward to implement and that it can also be used for external (out of place) merging. The results demonstrate that it provides a significant speedup compared to sequential executions, when the size of the arrays is greater than a thousand elements.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Distributed Weighted Matching via Randomized Composable Coresets

82 - Sepehr Assadi , MohammadHossein Bateni , Vahab Mirrokni 2019

Maximum weight matching is one of the most fundamental combinatorial optimization problems with a wide range of applications in data mining and bioinformatics. Developing distributed weighted matching algorithms is challenging due to the sequential n ature of efficient algorithms for this problem. In this paper, we develop a simple distributed algorithm for the problem on general graphs with approximation guarantee of $2+varepsilon$ that (nearly) matches that of the sequential greedy algorithm. A key advantage of this algorithm is that it can be easily implemented in only two rounds of computation in modern parallel computation frameworks such as MapReduce. We also demonstrate the efficiency of our algorithm in practice on various graphs (some with half a trillion edges) by achieving objective values always close to what is achievable in the centralized setting.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

The Sparsest Additive Spanner via Multiple Weighted BFS Trees

66 - Keren Censor-Hillel , Ami Paz , Noam Ravid 2018

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

327 - Qinggang Zhou , Yawen Zhang , Pengcheng Li 2020

The state-of-the-art deep learning algorithms rely on distributed training systems to tackle the increasing sizes of models and training data sets. Minibatch stochastic gradient descent (SGD) algorithm requires workers to halt forward/back propagatio ns, to wait for gradients aggregated from all workers, and to receive weight updates before the next batch of tasks. This synchronous execution model exposes the overheads of gradient/weight communication among a large number of workers in a distributed training system. We propose a new SGD algorithm, DaSGD (Local SGD with Delayed Averaging), which parallelizes SGD and forward/back propagations to hide 100% of the communication overhead. By adjusting the gradient update scheme, this algorithm uses hardware resources more efficiently and reduces the reliance on the low-latency and high-throughput inter-connects. The theoretical analysis and the experimental results show its convergence rate O(1/sqrt(K)), the same as SGD. The performance evaluation demonstrates it enables a linear performance scale-up with the cluster size.

النظم الموزعة والتوازية والحوسبة العنقودية التعلم الآلي

Optimizing the hybrid parallelization of BHAC

145 - Salvatore Cielo , Oliver Porth , Luigi Iapichino 2021

We present our experience with the modernization on the GR-MHD code BHAC, aimed at improving its novel hybrid (MPI+OpenMP) parallelization scheme. In doing so, we showcase the use of performance profiling tools usable on x86 (Intel-based) architectur es. Our performance characterization and threading analysis provided guidance in improving the concurrency and thus the efficiency of the OpenMP parallel regions. We assess scaling and communication patterns in order to identify and alleviate MPI bottlenecks, with both runtime switches and precise code interventions. The performance of optimized version of BHAC improved by $sim28%$, making it viable for scaling on several hundreds of supercomputer nodes. We finally test whether porting such optimizations to different hardware is likewise beneficial on the new architecture by running on ARM A64FX vector nodes.

النظم الموزعة والتوازية والحوسبة العنقودية الأجهزة والأساليب للزيئات الفيزياء الفلكية

سجل دخول لتتمكن من نشر تعليقات