New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Faster and More Accurate Sequence Alignment with SNAP

129 0 0.0 ( 0 )

Download Cite

Added by Matei Zaharia

Publication date 2011

fields Informatics Engineering Biology

and research's language is English

Authors Matei Zaharia - William J. Bolosky - Kristal Curtis

Data Structures and Algorithms Genomics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We present the Scalable Nucleotide Alignment Program (SNAP), a new short and long read aligner that is both more accurate (i.e., aligns more reads with fewer errors) and 10-100x faster than state-of-the-art tools such as BWA. Unlike recent aligners based on the Burrows-Wheeler transform, SNAP uses a simple hash index of short seed sequences from the genome, similar to BLASTs. However, SNAP greatly reduces the number and cost of local alignment checks performed through several measures: it uses longer seeds to reduce the false positive locations considered, leverages larger memory capacities to speed index lookup, and excludes most candidate locations without fully computing their edit distance to the read. The result is an algorithm that scales well for reads from one hundred to thousands of bases long and provides a rich error model that can match classes of mutations (e.g., longer indels) that todays fast aligners ignore. We calculate that SNAP can align a dataset with 30x coverage of a human genome in less than an hour for a cost of $2 on Amazon EC2, with higher accuracy than BWA. Finally, we describe ongoing work to further improve SNAP.

rate research

Using Sequence Ensembles for Seeding Alignments of MinION Sequencing Data

68 - Rastislav Rabatin , Brov{n}a Brejova , Tomav{s} Vinav{r} 2016

Oxford Nanopore MinION sequencer is currently the smallest sequencing device available. While being able to produce very long reads (reads of up to 100~kbp were reported), it is prone to high sequencing error rates of up to 30%. Since most of these errors are insertions or deletions, it is very difficult to adapt popular seed-based algorithms designed for aligning data sets with much lower error rates. Base calling of MinION reads is typically done using hidden Markov models. In this paper, we propose to represent each sequencing read by an ensemble of sequences sampled from such a probabilistic model. This approach can improve the sensitivity and false positive rate of seeding an alignment compared to using a single representative base call sequence for each read.

Data Structures and Algorithms Genomics

A new method for faster and more accurate inference of species associations from big community data

153 - Maximilian Pichler , Florian Hartig 2020

1. Joint Species Distribution models (JSDMs) explain spatial variation in community composition by contributions of the environment, biotic associations, and possibly spatially structured residual covariance. They show great promise as a general analytical framework for community ecology and macroecology, but current JSDMs, even when approximated by latent variables, scale poorly on large datasets, limiting their usefulness for currently emerging big (e.g., metabarcoding and metagenomics) community datasets. 2. Here, we present a novel, more scalable JSDM (sjSDM) that circumvents the need to use latent variables by using a Monte-Carlo integration of the joint JSDM likelihood and allows flexible elastic net regularization on all model components. We implemented sjSDM in PyTorch, a modern machine learning framework that can make use of CPU and GPU calculations. Using simulated communities with known species-species associations and different number of species and sites, we compare sjSDM with state-of-the-art JSDM implementations to determine computational runtimes and accuracy of the inferred species-species and species-environmental associations. 3. We find that sjSDM is orders of magnitude faster than existing JSDM algorithms (even when run on the CPU) and can be scaled to very large datasets. Despite the dramatically improved speed, sjSDM produces more accurate estimates of species association structures than alternative JSDM implementations. We demonstrate the applicability of sjSDM to big community data using eDNA case study with thousands of fungi operational taxonomic units (OTU). 4. Our sjSDM approach makes the analysis of JSDMs to large community datasets with hundreds or thousands of species possible, substantially extending the applicability of JSDMs in ecology. We provide our method in an R package to facilitate its applicability for practical data analysis.

Quantitative Methods Populations and Evolution Applications

Faster, Simpler and More Accurate Hybrid ASR Systems Using Wordpieces

95 - Frank Zhang , Yongqiang Wang , Xiaohui Zhang 2020

In this work, we first show that on the widely used LibriSpeech benchmark, our transformer-based context-dependent connectionist temporal classification (CTC) system produces state-of-the-art results. We then show that using wordpieces as modeling units combined with CTC training, we can greatly simplify the engineering pipeline compared to conventional frame-based cross-entropy training by excluding all the GMM bootstrapping, decision tree building and force alignment steps, while still achieving very competitive word-error-rate. Additionally, using wordpieces as modeling units can significantly improve runtime efficiency since we can use larger stride without losing accuracy. We further confirm these findings on two internal VideoASR datasets: German, which is similar to English as a fusional language, and Turkish, which is an agglutinative language.

Audio and Speech Processing Computation and Language

Large scale evaluation of differences between network-based and pairwise sequence-alignment-based methods of dendrogram reconstruction

77 - Daniel Gamermann , Arnau Montagud , J. Alberto Conejero 2017

Dendrograms are a way to represent evolutionary relationships between organisms. Nowadays, these are inferred based on the comparison of genes or protein sequences by taking into account their differences and similarities. The genetic material of choice for the sequence alignments (all the genes or sets of genes) results in distinct inferred dendrograms. In this work, we evaluate differences between dendrograms reconstructed with different methodologies and obtained for different sets of organisms chosen at random from a much larger set. A statistical analysis is performed in order to estimate the fluctuation between the results obtained from the different methodologies. This analysis permit us to validate a systematic approach, based on the comparison of the organisms metabolic networks for inferring dendrograms. It has the advantage that it allows the comparison of organisms very far away in the evolutionary tree even if they have no known ortholog gene in common.

Molecular Networks Genomics

Probabilistic Approaches to Alignment with Tandem Repeats

191 - Michal Nanasi , Tomav{s} Vinav{r} , 2013

We propose a simple tractable pair hidden Markov model for pairwise sequence alignment that accounts for the presence of short tandem repeats. Using the framework of gain functions, we design several optimization criteria for decoding this model and describe the resulting decoding algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms specialized for our model. We compare the accuracy of individual decoding algorithms on simulated data and find our approach superior to the classical three-state pair HMM in simulations.

Quantitative Methods Genomics

comments

Fetching comments

Damascus University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Faster and More Accurate Sequence Alignment with SNAP

Ask ChatGPT about the research

No Arabic abstract

Read More