No Arabic abstract
The decreasing costs and increasing speed and accuracy of DNA sample collection, preparation, and sequencing has rapidly produced an enormous volume of genetic data. However, fast and accurate analysis of the samples remains a bottleneck. Here we present D$^{4}$RAGenS, a genetic sequence identification algorithm that exhibits the Big Data handling and computational power of the Dynamic Distributed Dimensional Data Model (D4M). The method leverages linear algebra and statistical properties to increase computational performance while retaining accuracy by subsampling the data. Two run modes, Fast and Wise, yield speed and precision tradeoffs, with applications in biodefense and medical diagnostics. The D$^{4}$RAGenS analysis algorithm is tested over several datasets, including three utilized for the Defense Threat Reduction Agency (DTRA) metagenomic algorithm contest.
Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.
Recent chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone modfications at the boundaries.
Human associated microbial communities exert tremendous influence over human health and disease. With modern metagenomic sequencing methods it is possible to follow the relative abundance of microbes in a community over time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer the interactions between species from sequence data. Any algorithm for inferring species interactions must overcome three obstacles: 1) a correlation between the abundances of two species does not imply that those species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of sequencing reads into operational taxonomic units, bias inferences of species interactions. Here we introduce an approach, Learning Interactions from MIcrobial Time Series (LIMITS), that overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka-Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut microbiomes of two individuals and found that the interaction networks varied significantly between individuals. Furthermore, we found that the interaction networks of the two individuals are dominated by distinct keystone species, Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of certain keystone species may be responsible for individuality in the human gut microbiome.
Motivation: We introduce TRONCO (TRanslational ONCOlogy), an open-source R package that implements the state-of-the-art algorithms for the inference of cancer progression models from (epi)genomic mutational profiles. TRONCO can be used to extract population-level models describing the trends of accumulation of alterations in a cohort of cross-sectional samples, e.g., retrieved from publicly available databases, and individual-level models that reveal the clonal evolutionary history in single cancer patients, when multiple samples, e.g., multiple biopsies or single-cell sequencing data, are available. The resulting models can provide key hints in uncovering the evolutionary trajectories of cancer, especially for precision medicine or personalized therapy. Availability: TRONCO is released under the GPL license, it is hosted in the Software section at http://bimib.disco.unimib.it/ and archived also at bioconductor.org. Contact:
[email protected]
PURPOSE: The popularity of germline genetic panel testing has led to a vast accumulation of variant-level data. Variant names are not always consistent across laboratories and not easily mappable to public variant databases such as ClinVar. A tool that can automate the process of variants harmonization and mapping is needed to help clinicians ensure their variant interpretations are accurate. METHODS: We present a Python-based tool, Ask2Me VarHarmonizer, that incorporates data cleaning, name harmonization, and a four-attempt mapping to ClinVar procedure. We applied this tool to map variants from a pilot dataset collected from 11 clinical practices. Mapping results were evaluated with and without the transcript information. RESULTS: Using Ask2Me VarHarmonizer, 4728 out of 6027 variant entries (78%) were successfully mapped to ClinVar, corresponding to 3699 mappable unique variants. With the addition of 1099 unique unmappable variants, a total of 4798 unique variants were eventually identified. 427 (9%) of these had multiple names, of which 343 (7%) had multiple names within-practice. 99% mapping consistency was observed with and without transcript information. CONCLUSION: Ask2Me VarHarmonizer aggregates and structures variant data, harmonizes names, and maps variants to ClinVar. Performing harmonization removes the ambiguity and redundancy of variants from different sources.