No Arabic abstract
PURPOSE: The popularity of germline genetic panel testing has led to a vast accumulation of variant-level data. Variant names are not always consistent across laboratories and not easily mappable to public variant databases such as ClinVar. A tool that can automate the process of variants harmonization and mapping is needed to help clinicians ensure their variant interpretations are accurate. METHODS: We present a Python-based tool, Ask2Me VarHarmonizer, that incorporates data cleaning, name harmonization, and a four-attempt mapping to ClinVar procedure. We applied this tool to map variants from a pilot dataset collected from 11 clinical practices. Mapping results were evaluated with and without the transcript information. RESULTS: Using Ask2Me VarHarmonizer, 4728 out of 6027 variant entries (78%) were successfully mapped to ClinVar, corresponding to 3699 mappable unique variants. With the addition of 1099 unique unmappable variants, a total of 4798 unique variants were eventually identified. 427 (9%) of these had multiple names, of which 343 (7%) had multiple names within-practice. 99% mapping consistency was observed with and without transcript information. CONCLUSION: Ask2Me VarHarmonizer aggregates and structures variant data, harmonizes names, and maps variants to ClinVar. Performing harmonization removes the ambiguity and redundancy of variants from different sources.
Although somatic mutations are the main contributor to cancer, underlying germline alterations may increase the risk of cancer, mold the somatic alteration landscape and cooperate with acquired mutations to promote the tumor onset and/or maintenance. Therefore, both tumor genome and germline sequence data have to be analyzed to have a more complete picture of the overall genetic foundation of the disease. To reinforce such notion we quantitatively assess the bias of restricting the analysis to somatic mutation data using mutational data from well-known cancer genes which displays both types of alterations, inherited and somatically acquired mutations.
Comparative transcriptomics has gained increasing popularity in genomic research thanks to the development of high-throughput technologies including microarray and next-generation RNA sequencing that have generated numerous transcriptomic data. An important question is to understand the conservation and differentiation of biological processes in different species. We propose a testing-based method TROM (Transcriptome Overlap Measure) for comparing transcriptomes within or between different species, and provide a different perspective to interpret transcriptomic similarity in contrast to traditional correlation analyses. Specifically, the TROM method focuses on identifying associated genes that capture molecular characteristics of biological samples, and subsequently comparing the biological samples by testing the overlap of their associated genes. We use simulation and real data studies to demonstrate that TROM is more powerful in identifying similar transcriptomes and more robust to stochastic gene expression noise than Pearson and Spearman correlations. We apply TROM to compare the developmental stages of six Drosophila species, C. elegans, S. purpuratus, D. rerio and mouse liver, and find interesting correspondence patterns that imply conserved gene expression programs in the development of these species. The TROM method is available as an R package on CRAN (http://cran.r-project.org/) with manuals and source codes available at http://www.stat.ucla.edu/ jingyi.li/software-and-data/trom.html.
Motivation: We introduce TRONCO (TRanslational ONCOlogy), an open-source R package that implements the state-of-the-art algorithms for the inference of cancer progression models from (epi)genomic mutational profiles. TRONCO can be used to extract population-level models describing the trends of accumulation of alterations in a cohort of cross-sectional samples, e.g., retrieved from publicly available databases, and individual-level models that reveal the clonal evolutionary history in single cancer patients, when multiple samples, e.g., multiple biopsies or single-cell sequencing data, are available. The resulting models can provide key hints in uncovering the evolutionary trajectories of cancer, especially for precision medicine or personalized therapy. Availability: TRONCO is released under the GPL license, it is hosted in the Software section at http://bimib.disco.unimib.it/ and archived also at bioconductor.org. Contact:
[email protected]
Aggregating transcriptomics data across hospitals can increase sensitivity and robustness of differential expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequently employed to pool local results. However, if class labels are inhomogeneously distributed between cohorts, their accuracy may drop. Flimma (https://exbio.wzw.tum.de/flimma/) addresses this issue by implementing the state-of-the-art workflow limma voom in a privacy-preserving manner, i.e. patient data never leaves its source site. Flimma results are identical to those generated by limma voom on combined datasets even in imbalanced scenarios where meta-analysis approaches fail.
Recent technological advances in Next Generation Sequencing tools have led to increasing speeds of DNA sample collection, preparation, and sequencing. One instrument can produce over 600 Gb of genetic sequence data in a single run. This creates new opportunities to efficiently handle the increasing workload. We propose a new method of fast genetic sequence analysis using the Dynamic Distributed Dimensional Data Model (D4M) - an associative array environment for MATLAB developed at MIT Lincoln Laboratory. Based on mathematical and statistical properties, the method leverages big data techniques and the implementation of an Apache Acculumo database to accelerate computations one-hundred fold over other methods. Comparisons of the D4M method with the current gold-standard for sequence analysis, BLAST, show the two are comparable in the alignments they find. This paper will present an overview of the D4M genetic sequence algorithm and statistical comparisons with BLAST.