ترغب بنشر مسار تعليمي؟ اضغط هنا

Genetic Sequence Matching Using D4M Big Data Approaches

268   0   0.0 ( 0 )
 نشر من قبل Stephanie Dodson
 تاريخ النشر 2014
  مجال البحث علم الأحياء
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.



قيم البحث

اقرأ أيضاً

The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we pre sent D$^{4}$RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D$^{4}$RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.
We propose a simple tractable pair hidden Markov model for pairwise sequence alignment that accounts for the presence of short tandem repeats. Using the framework of gain functions, we design several optimization criteria for decoding this model and describe the resulting decoding algorithms, ranging from the traditional Viterbi and posterior decoding to block-based decoding algorithms specialized for our model. We compare the accuracy of individual decoding algorithms on simulated data and find our approach superior to the classical three-state pair HMM in simulations.
PURPOSE: The popularity of germline genetic panel testing has led to a vast accumulation of variant-level data. Variant names are not always consistent across laboratories and not easily mappable to public variant databases such as ClinVar. A tool th at can automate the process of variants harmonization and mapping is needed to help clinicians ensure their variant interpretations are accurate. METHODS: We present a Python-based tool, Ask2Me VarHarmonizer, that incorporates data cleaning, name harmonization, and a four-attempt mapping to ClinVar procedure. We applied this tool to map variants from a pilot dataset collected from 11 clinical practices. Mapping results were evaluated with and without the transcript information. RESULTS: Using Ask2Me VarHarmonizer, 4728 out of 6027 variant entries (78%) were successfully mapped to ClinVar, corresponding to 3699 mappable unique variants. With the addition of 1099 unique unmappable variants, a total of 4798 unique variants were eventually identified. 427 (9%) of these had multiple names, of which 343 (7%) had multiple names within-practice. 99% mapping consistency was observed with and without transcript information. CONCLUSION: Ask2Me VarHarmonizer aggregates and structures variant data, harmonizes names, and maps variants to ClinVar. Performing harmonization removes the ambiguity and redundancy of variants from different sources.
When analyzing the genome, researchers have discovered that proteins bind to DNA based on certain patterns of the DNA sequence known as motifs. However, it is difficult to manually construct motifs due to their complexity. Recently, externally learne d memory models have proven to be effective methods for reasoning over inputs and supporting sets. In this work, we present memory matching networks (MMN) for classifying DNA sequences as protein binding sites. Our model learns a memory bank of encoded motifs, which are dynamic memory modules, and then matches a new test sequence to each of the motifs to classify the sequence as a binding or nonbinding site.
Data from discovery proteomic and phosphoproteomic experiments typically include missing values that correspond to proteins that have not been identified in the analyzed sample. Replacing the missing values with random numbers, a process known as imp utation, avoids apparent infinite fold-change values. However, the procedure comes at a cost: Imputing a large number of missing values has the potential to significantly impact the results of the subsequent differential expression analysis. We propose a method that identifies differentially expressed proteins by ranking their observed changes with respect to the changes observed for other proteins. Missing values are taken into account by this method directly, without the need to impute them. We illustrate the performance of the new method on two distinct datasets and show that it is robust to missing values and, at the same time, provides results that are otherwise similar to those obtained with edgeR which is a state-of-art differential expression analysis method. The new method for the differential expression analysis of proteomic data is available as an easy to use Python package.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا