ترغب بنشر مسار تعليمي؟ اضغط هنا

Identifying cancer subtypes in glioblastoma by combining genomic, transcriptomic and epigenomic data

105   0   0.0 ( 0 )
 نشر من قبل Richard Savage
 تاريخ النشر 2013
والبحث باللغة English




اسأل ChatGPT حول البحث

We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/



قيم البحث

اقرأ أيضاً

Intercellular heterogeneity serves as both a confounding factor in studying individual clones and an information source in characterizing any heterogeneous tissues, such as blood, tumor systems. Due to inevitable sequencing errors and other sample pr eparation artifacts such as PCR errors, systematic efforts to characterize intercellular genomic heterogeneity must effectively distinguish genuine clonal sequences from fake derivatives. We developed a novel approach (SIGH) for identifying significant genuine clonal sequences directly from mixed sequencing reads that can improve genomic analyses in many biological contexts. This method offers several attractive features: (1) it automatically estimates the error rate from raw sequence reads and identifies genuine clonal sequences; (2) it is robust to the large variety of error rate due to the various experimental conditions; (3) it is supported by a well grounded statistical framework that exploits probabilistic characteristics of sequencing errors; (4) its unbiased strategy allows detecting rare clone(s) despite that clone relative abundance; and (5) it estimates constituent proportions in each sample. Extensive realistic simulation studies show that our method can reliably estimate the error rates and faithfully distinguish the genuine clones from fake derivatives, paving the way for follow up analysis that is otherwise ruined by the often dominant fake clones.
The availability of genomic data is often essential to progress in biomedical research, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a resu lt, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of the utility and the privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.
Motivation: Predictive modelling of gene expression is a powerful framework for the in silico exploration of transcriptional regulatory interactions through the integration of high-throughput -omics data. A major limitation of previous approaches is their inability to handle conditional and synergistic interactions that emerge when collectively analysing genes subject to different regulatory mechanisms. This limitation reduces overall predictive power and thus the reliability of downstream biological inference. Results: We introduce an analytical modelling framework (TREEOME: tree of models of expression) that integrates epigenetic and transcriptomic data by separating genes into putative regulatory classes. Current predictive modelling approaches have found both DNA methylation and histone modification epigenetic data to provide little or no improvement in accuracy of prediction of transcript abundance despite, for example, distinct anti-correlation between mRNA levels and promoter-localised DNA methylation. To improve on this, in TREEOME we evaluate four possible methods of formulating gene-level DNA methylation metrics, which provide a foundation for identifying gene-level methylation events and subsequent differential analysis, whereas most previous techniques operate at the level of individual CpG dinucleotides. We demonstrate TREEOME by integrating gene-level DNA methylation (bisulfite-seq) and histone modification (ChIP-seq) data to accurately predict genome-wide mRNA transcript abundance (RNA-seq) for H1-hESC and GM12878 cell lines. Availability: TREEOME is implemented using open-source software and made available as a pre-configured bootable reference environment. All scripts and data presented in this study are available online at http://sourceforge.net/projects/budden2015treeome/.
Motivation: As cancer researchers have come to appreciate the importance of intratumor heterogeneity, much attention has focused on the challenges of accurately profiling heterogeneity in individual patients. Experimental technologies for directly pr ofiling genomes of single cells are rapidly improving, but they are still impractical for large-scale sampling. Bulk genomic assays remain the standard for population-scale studies, but conflate the influences of mixtures of genetically distinct tumor, stromal, and infiltrating immune cells. Many computational approaches have been developed to deconvolute these mixed samples and reconstruct the genomics of genetically homogeneous clonal subpopulations. All such methods, however, are limited to reconstructing only coarse approximations to a few major subpopulations. In prior work, we showed that one can improve deconvolution of genomic data by leveraging substructure in cellular mixtures through a strategy called simplicial complex inference. This strategy, however, is also limited by the difficulty of inferring mixture structure from sparse, noisy assays. Results: We improve on past work by introducing enhancements to automate learning of substructured genomic mixtures, with specific emphasis on genome-wide copy number variation (CNV) data. We introduce methods for dimensionality estimation to better decompose mixture model substructure; fuzzy clustering to better identify substructure in sparse, noisy data; and automated model inference methods for other key model parameters. We show that these improvements lead to more accurate inference of cell populations and mixture proportions in simulated scenarios. We further demonstrate their effectiveness in identifying mixture substructure in real tumor CNV data. Availability: Source code is available at http://www.cs.cmu.edu/~russells/software/WSCUnmix.zip
Identifying subgroups and properties of cancer biopsy samples is a crucial step towards obtaining precise diagnoses and being able to perform personalized treatment of cancer patients. Recent data collections provide a comprehensive characterization of cancer cell data, including genetic data on copy number alterations (CNAs). We explore the potential to capture information contained in cancer genomic information using a novel topology-based approach that encodes each cancer sample as a persistence diagram of topological features, i.e., high-dimensional voids represented in the data. We find that this technique has the potential to extract meaningful low-dimensional representations in cancer somatic genetic data and demonstrate the viability of some applications on finding substructures in cancer data as well as comparing similarity of cancer types.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا