Optimal Seed Solver: Optimizing Seed Selection in Read Mapping

140 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Hongyi Xin

تاريخ النشر 2015

مجال البحث الهندسة المعلوماتية علم الأحياء

والبحث باللغة English

تأليف Hongyi Xin - Richard Zhu - Sunny Nahar

الهندسة الحاسوبية، المالية،العلوم الجينوم

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Motivation: Optimizing seed selection is an important problem in read mapping. The number of non-overlapping seeds a mapper selects determines the sensitivity of the mapper while the total frequency of all selected seeds determines the speed of the mapper. Modern seed-and-extend mappers usually select seeds with either an equal and fixed-length scheme or with an inflexible placement scheme, both of which limit the potential of the mapper to select less frequent seeds to speed up the mapping process. Therefore, it is crucial to develop a new algorithm that can adjust both the individual seed length and the seed placement, as well as derive less frequent seeds. Results: We present the Optimal Seed Solver (OSS), a dynamic programming algorithm that discovers the least frequently-occurring set of x seeds in an L-bp read in $O(x times L)$ operations on average and in $O(x times L^{2})$ operations in the worst case. We compared OSS against four state-of-the-art seed selection schemes and observed that OSS provides a 3-fold reduction of average seed frequency over the best previous seed selection optimizations.

قيم البحث

88 - Jeremie S Kim , Damla Senol , Hongyi Xin 2017

Motivation: Seed filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. Read mappers 1) quickly generate possible mapping locations (i.e., seeds) for each read, 2) extract reference sequences at each of the mapping locations, and then 3) check similarity between each read and its associated reference sequences with a computationally expensive dynamic programming algorithm (alignment) to determine the origin of the read. Location filters come into play before alignment, discarding seed locations that alignment would have deemed a poor match. The ideal location filter would discard all poor matching locations prior to alignment such that there is no wasted computation on poor alignments. Results: We propose a novel filtering algorithm, GRIM-Filter, optimized to exploit emerging 3D-stacked memory systems that integrate computation within a stacked logic layer, enabling processing-in-memory (PIM). GRIM-Filter quickly filters locations by 1) introducing a new representation of coarse-grained segments of the reference genome and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for 5% error acceptance rates, GRIM-Filter eliminates 5.59x-6.41x more false negatives and exhibits end-to-end speedups of 1.81x-3.65x compared to mappers employing the best previous filtering algorithm.

الجينوم

GRIM-Filter: Fast Seed Location Filtering in DNA Read Mapping Using Processing-in-Memory Technologies

69 - Jeremie S. Kim , Damla Senol Cali , Hongyi Xin 2017

Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results: We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x--6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x--3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Availability: The code is available online at: https://github.com/CMU-SAFARI/GRIM

الجينوم الهندسة الحاسوبية، المالية،العلوم

Seed balancing selection: Coexistence in seedbank models

255 - Maria Emilia Caballero , Adrian Gonzalez Casanova , Jose Luisn Perez 2021

We consider a population constituted by two types of individuals; each of them can produce offspring in two different islands (as a particular case the islands can be interpreted as active or dormant individuals). We model the evolution of the popula tion of each type using a two-type Feller diffusion with immigration, and we study the frequency of one of the types, in each island, when the total population size in each island is forced to be constant at a dense set of times. This leads to the solution of a SDE which we call the asymmetric two-island frequency process. We derive properties of this process and obtain a large population limit when the total size of each island tends to infinity. Additionally, we compute the fluctuations of the process around its deterministic limit. We establish conditions under which the asymmetric two-island frequency process has a moment dual. The dual is a continuous-time two-dimensional Markov chain that can be interpreted in terms of mutation, branching, pairwise branching, coalescence, and a novel mixed selection-migration term. Also, we conduct a stability analysis of the limiting deterministic dynamical system and present some numerical results to study fixation and a new form of balancing selection. When restricting to the seedbank model, we observe that some combinations of the parameters lead to balancing selection. Besides finding yet another way in which genetic reservoirs increase the genetic variability, we find that if a population that sustains a seedbank competes with one that does not, the seed producers will have a selective advantage if they reproduce faster, but will not have a selective disadvantage if they reproduce slower: their worst case scenario is balancing selection.

السكان والتطور الاحتمالات

Genealogy of a Wright Fisher model with strong seed bank component

162 - Jochen Blath , Bjarki Eldon , Adrian Gonzalez Casanova 2014

We investigate the behaviour of the genealogy of a Wright-Fisher population model under the influence of a strong seed-bank effect. More precisely, we consider a simple seed-bank age distribution with two atoms, leading to either classical or long ge nealogical jumps (the latter modeling the effect of seed-dormancy). We assume that the length of these long jumps scales like a power $N^beta$ of the original population size $N$, thus giving rise to a `strong seed-bank effect. For a certain range of $beta$, we prove that the ancestral process of a sample of $n$ individuals converges under a non-classical time-scaling to Kingmans $n-$coalescent. Further, for a wider range of parameters, we analyze the time to the most recent common ancestor of two individuals analytically and by simulation.

الاحتمالات السكان والتطور

Stable Prediction via Leveraging Seed Variable

203 - Kun Kuang , Bo Li , Peng Cui 2020

In this paper, we focus on the problem of stable prediction across unknown test data, where the test distribution is agnostic and might be totally different from the training one. In such a case, previous machine learning methods might exploit subtly spurious correlations in training data induced by non-causal variables for prediction. Those spurious correlations are changeable across data, leading to instability of prediction across data. By assuming the relationships between causal variables and response variable are invariant across data, to address this problem, we propose a conditional independence test based algorithm to separate those causal variables with a seed variable as priori, and adopt them for stable prediction. By assuming the independence between causal and non-causal variables, we show, both theoretically and with empirical experiments, that our algorithm can precisely separate causal and non-causal variables for stable prediction across test data. Extensive experiments on both synthetic and real-world datasets demonstrate that our algorithm outperforms state-of-the-art methods for stable prediction.

التعلم الآلي الذكاء الاصطناعي التعلم الالي

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة المأمون الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Optimal Seed Solver: Optimizing Seed Selection in Read Mapping

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً