Extreme-Scale De Novo Genome Assembly

67 0 0.0 ( 0 )

Download Cite

Added by Aydin Buluc

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Evangelos Georganas - Steven Hofmeyr - Rob Egan

Distributed Parallel and Cluster Computing

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

De novo whole genome assembly reconstructs genomic sequence from short, overlapping, and potentially erroneous DNA segments and is one of the most important computations in modern genomics. This work presents HipMER, a high-quality end-to-end de novo assembler designed for extreme scale analysis, via efficient parallelization of the Meraculous code. Genome assembly software has many components, each of which stresses different components of a computer system. This chapter explains the computational challenges involved in each step of the HipMer pipeline, the key distributed data structures, and communication costs in detail. We present performance results of assembling the human genome and the large hexaploid wheat genome on large supercomputers up to tens of thousands of cores.

rate research

Extreme Scale De Novo Metagenome Assembly

418 - Evangelos Georganas , Rob Egan , Steven Hofmeyr 2018

Metagenome assembly is the process of transforming a set of short, overlapping, and potentially erroneous DNA segments from environmental samples into the accurate representation of the underlying microbiomess genomes. State-of-the-art tools require big shared memory machines and cannot handle contemporary metagenome datasets that exceed Terabytes in size. In this paper, we introduce the MetaHipMer pipeline, a high-quality and high-performance metagenome assembler that employs an iterative de Bruijn graph approach. MetaHipMer leverages a specialized scaffolding algorithm that produces long scaffolds and accommodates the idiosyncrasies of metagenomes. MetaHipMer is end-to-end parallelized using the Unified Parallel C language and therefore can run seamlessly on shared and distributed-memory systems. Experimental results show that MetaHipMer matches or outperforms the state-of-the-art tools in terms of accuracy. Moreover, MetaHipMer scales efficiently to large concurrencies and is able to assemble previously intractable grand challenge metagenomes. We demonstrate the unprecedented capability of MetaHipMer by computing the first full assembly of the Twitchell Wetlands dataset, consisting of 7.5 billion reads - size 2.6 TBytes.

Distributed Parallel and Cluster Computing Genomics

Parallel String Graph Construction and Transitive Reduction for De Novo Genome Assembly

320 - Giulia Guidi , Oguz Selvitopi , Marquita Ellis 2020

One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.

Distributed Parallel and Cluster Computing Genomics

Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species

414 - Keith R. Bradnam 2013

Background - The process of generating raw genome sequence data continues to become cheaper, faster, and more accurate. However, assembly of such data into high-quality, finished genome sequences remains challenging. Many genome assembly tools are available, but they differ greatly in terms of their performance (speed, scalability, hardware requirements, acceptance of newer read technologies) and in their final output (composition of assembled sequence). More importantly, it remains largely unclear how to best assess the quality of assembled genome sequences. The Assemblathon competitions are intended to assess current state-of-the-art methods in genome assembly. Results - In Assemblathon 2, we provided a variety of sequence data to be assembled for three vertebrate species (a bird, a fish, and snake). This resulted in a total of 43 submitted assemblies from 21 participating teams. We evaluated these assemblies using a combination of optical map data, Fosmid sequences, and several statistical methods. From over 100 different metrics, we chose ten key measures by which to assess the overall quality of the assemblies. Conclusions - Many current genome assemblers produced useful assemblies, containing a significant representation of their genes, regulatory sequences, and overall genome structure. However, the high degree of variability between the entries suggests that there is still much room for improvement in the field of genome assembly and that approaches which work well in assembling the genome of one species may not necessarily work well for another.

Genomics

Towards de novo RNA 3D structure prediction

444 - Sandro Bottaro , Francesco Di Palma , Giovanni Bussi 2015

RNA is a fundamental class of biomolecules that mediate a large variety of molecular processes within the cell. Computational algorithms can be of great help in the understanding of RNA structure-function relationship. One of the main challenges in this field is the development of structure-prediction algorithms, which aim at the prediction of the three-dimensional (3D) native fold from the sole knowledge of the sequence. In a recent paper, we have introduced a scoring function for RNA structure prediction. Here, we analyze in detail the performance of the method, we underline strengths and shortcomings, and we discuss the results with respect to state-of-the-art techniques. These observations provide a starting point for improving current methodologies, thus paving the way to the advances of more accurate approaches for RNA 3D structure prediction.

Biomolecules Biological Physics Chemical Physics

Multiscale modelling of de novo anaerobic granulation

80 - A. Tenore , F. Russo , M.R. Mattei 2021

A multiscale mathematical model is presented to describe the de novo granulation and the evolution of multispecies granular biofilms within a continuous reactor. The granule is modelled as a spherical free boundary domain with radial symmetry. The equation which governs the free boundary is derived from global mass balance considerations and takes into account the growth of sessile biomass and the exchange fluxes with the bulk liquid. Starting from a vanishing initial value, the expansion of the free boundary is initiated by the attachment process, which depends on the microbial species concentrations within the bulk liquid and their specific attachment velocity. Nonlinear hyperbolic PDEs model the growth of the sessile microbial species, while quasi-linear parabolic PDEs govern the dynamics of substrates and invading species within the granular biofilm. Nonlinear ODEs govern the evolution of soluble substrates and planktonic biomass within the bulk liquid. The model is applied to an anaerobic granular-based system and solved numerically to test its qualitative behaviour and explore the main aspects of de novo anaerobic granulation: ecology, biomass distribution, relative abundance, dimensional evolution of the granules and soluble substrates and planktonic biomass dynamics within the reactor. The numerical results confirm that the model accurately describes the ecology and the concentrically-layered structure of anaerobic granules observed experimentally, and is able to predict the effects of some significant factors, such as influent wastewater composition, granulation properties of planktonic biomass, biomass density and hydrodynamic and shear stress conditions, on the process performance.

Populations and Evolution Biological Physics