Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

335 0 0.0 ( 0 )

Download Cite

Added by Md Vasimuddin

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Vasimuddin Md - Sanchit Misra - Heng Li

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Innovations in Next-Generation Sequencing are enabling generation of DNA sequence data at ever faster rates and at very low cost. Large sequencing centers typically employ hundreds of such systems. Such high-throughput and low-cost generation of data underscores the need for commensurate acceleration in downstream computational analysis of the sequencing data. A fundamental step in downstream analysis is mapping of the reads to a long reference DNA sequence, such as a reference human genome. Sequence mapping is a compute-intensive step that accounts for more than 30% of the overall time of the GATK workflow. BWA-MEM is one of the most widely used tools for sequence mapping and has tens of thousands of users. In this work, we focus on accelerating BWA-MEM through an efficient architecture aware implementation, while maintaining identical output. The volume of data requires distributed computing environment, usually deploying multicore processors. Since the application can be easily parallelized for distributed memory systems, we focus on performance improvements on a single socket multicore processor. BWA-MEM run time is dominated by three kernels, collectively responsible for more than 85% of the overall compute time. We improved the performance of these kernels by 1) improving cache reuse, 2) simplifying the algorithms, 3) replacing small fragmented memory allocations with a few large contiguous ones, 4) software prefetching, and 5) SIMD utilization wherever applicable - and massive reorganization of the source code enabling these improvements. As a result, we achieved nearly 2x, 183x, and 8x speedups on the three kernels, respectively, resulting in up to 3.5x and 2.4x speedups on end-to-end compute time over the original BWA-MEM on single thread and single socket of Intel Xeon Skylake processor. To the best of our knowledge, this is the highest reported speedup over BWA-MEM.

rate research

Efficient Replication of Queued Tasks for Latency Reduction in Cloud Systems

113 - Gauri Joshi , Emina Soljanin , Gregory Wornell 2015

In cloud computing systems, assigning a job to multiple servers and waiting for the earliest copy to finish is an effective method to combat the variability in response time of individual servers. Although adding redundant replicas always reduces service time, the total computing time spent per job may be higher, thus increasing waiting time in queue. The total time spent per job is also proportional to the cost of computing resources. We analyze how different redundancy strategies, for eg. number of replicas, and the time when they are issued and canceled, affect the latency and computing cost. We get the insight that the log-concavity of the service time distribution is a key factor in determining whether adding redundancy reduces latency and cost. If the service distribution is log-convex, then adding maximum redundancy reduces both latency and cost. And if it is log-concave, then having fewer replicas and canceling the redundant requests early is more effective.

Distributed Parallel and Cluster Computing Information Theory Performance

Efficient Realization of Givens Rotation through Algorithm-Architecture Co-design for Acceleration of QR Factorization

101 - Farhad Merchant , Tarun Vatwani , Anupam Chattopadhyay 2018

We present efficient realization of Generalized Givens Rotation (GGR) based QR factorization that achieves 3-100x better performance in terms of Gflops/watt over state-of-the-art realizations on multicore, and General Purpose Graphics Processing Units (GPGPUs). GGR is an improvement over classical Givens Rotation (GR) operation that can annihilate multiple elements of rows and columns of an input matrix simultaneously. GGR takes 33% lesser multiplications compared to GR. For custom implementation of GGR, we identify macro operations in GGR and realize them on a Reconfigurable Data-path (RDP) tightly coupled to pipeline of a Processing Element (PE). In PE, GGR attains speed-up of 1.1x over Modified Householder Transform (MHT) presented in the literature. For parallel realization of GGR, we use REDEFINE, a scalable massively parallel Coarse-grained Reconfigurable Architecture, and show that the speed-up attained is commensurate with the hardware resources in REDEFINE. GGR also outperforms General Matrix Multiplication (gemm) by 10% in-terms of Gflops/watt which is counter-intuitive.

Distributed Parallel and Cluster Computing Hardware Architecture Mathematical Software

Modern Multicore CPUs are not Energy Proportional: Opportunity for Bi-objective Optimization for Performance and Energy

97 - Semyon Khokhriakov , Ravi Reddy Manumachu , Alexey Lastovetsky 2019

Energy proportionality is the key design goal followed by architects of modern multicore CPUs. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore CPUs. This finding creates the opportunity for bi-objective optimization of applications for performance and energy. We propose and study the first application-level method for bi-objective optimization of multithreaded data-parallel applications for performance and energy. The method uses two decision variables, the number of identical multithreaded kernels (threadgroups) executing the application and the number of threads in each threadgroup, with the workload always partitioned equally between the threadgroups. We experimentally demonstrate the efficiency of the method using four highly optimized multithreaded data-parallel applications, 2D fast Fourier transform based on FFTW and Intel MKL, and dense matrix-matrix multiplication using OpenBLAS and Intel MKL. Four modern multicore CPUs are used in the experiments. The experiments show that optimization for performance alone results in the increase in dynamic energy consumption by up to 89% and optimization for dynamic energy alone degrades the performance by up to 49%. By solving the bi-objective optimization problem, the method determines up to 11 globally Pareto-optimal solutions. Finally, we propose a qualitative dynamic energy model employing performance monitoring counters as parameters, which we use to explain the discovered energy nonproportionality and the Pareto-optimal solutions determined by our method. The model shows that the energy nonproportionality in our case is due to the activity of the data translation lookaside buffer (dTLB), which is disproportionately energy expensive.

Distributed Parallel and Cluster Computing Hardware Architecture Performance

Distributed Multiscale Computing with MUSCLE 2, the Multiscale Coupling Library and Environment

706 - Joris Borgdorff , Mariusz Mamonski , Bartosz Bosak 2013

We present the Multiscale Coupling Library and Environment: MUSCLE 2. This multiscale component-based execution environment has a simple to use Java, C++, C, Python and Fortran API, compatible with MPI, OpenMP and threading codes. We demonstrate its local and distributed computing capabilities and compare its performance to MUSCLE 1, file copy, MPI, MPWide, and GridFTP. The local throughput of MPI is about two times higher, so very tightly coupled code should use MPI as a single submodel of MUSCLE 2; the distributed performance of GridFTP is lower, especially for small messages. We test the performance of a canal system model with MUSCLE 2, where it introduces an overhead as small as 5% compared to MPI.

Distributed Parallel and Cluster Computing Computational Engineering Performance

Efficient Replication for Straggler Mitigation in Distributed Computing

158 - Amir Behrouzi-Far , Emina Soljanin 2020

Master-worker distributed computing systems use task replication in order to mitigate the effect of slow workers, known as stragglers. Tasks are grouped into batches and assigned to one or more workers for execution. We first consider the case when the batches do not overlap and, using the results from majorization theory, show that, for a general class of workers service time distributions, a balanced assignment of batches to workers minimizes the average job compute time. We next show that this balanced assignment of non-overlapping batches achieves lower average job compute time compared to the overlapping schemes proposed in the literature. Furthermore, we derive the optimum redundancy level as a function of the service time distribution at workers. We show that the redundancy level that minimizes average job compute time is not necessarily the same as the redundancy level that maximizes the predictability of job compute time, and thus there exists a trade-off between optimizing the two metrics. Finally, by running experiments on Google cluster traces, we observe that redundancy can reduce the compute time of the jobs in Google clusters by an order of magnitude, and that the optimum level of redundancy depends on the distribution of tasks service time.

Distributed Parallel and Cluster Computing Information Theory Performance

comments

Fetching comments

Kalamoon Private University

Additional details More universities

Efficient Architecture-Aware Acceleration of BWA-MEM for Multicore Systems

Ask ChatGPT about the research

No Arabic abstract

Read More