No Arabic abstract
The Kolmogorov-Arnold stochasticity parameter technique is applied for the first time to the study of cancer genome sequencing, to reveal mutations. Using data generated by next generation sequencing technologies, we have analyzed the exome sequences of brain tumor patients with matched tumor and normal blood. We show that mutations contained in sequencing data can be revealed using this technique thus providing a new methodology for determining subsequences of given length containing mutations i.e. its value differs from those of subsequences without mutations. A potential application for this technique involves simplifying the procedure of finding segments with mutations, speeding up genomic research, and accelerating its implementation in clinical diagnostic. Moreover, the prediction of a mutation associated to a family of frequent mutations in numerous types of cancers based purely on the value of the Kolmogorov function, indicates that this applied marker may recognize genomic sequences that are in extremely low abundance and can be used in revealing new types of mutations.
The phenotypic consequences of individual mutations are modulated by the wild type genetic background in which they occur.Although such background dependence is widely observed, we do not know whether general patterns across species and traits exist, nor about the mechanisms underlying it. We also lack knowledge on how mutations interact with genetic background to influence gene expression, and how this in turn mediates mutant phenotypes. Furthermore, how genetic background influences patterns of epistasis remains unclear. To investigate the genetic basis and genomic consequences of genetic background dependence of the scallopedE3 allele on the Drosophila melanogaster wing, we generated multiple novel genome level datasets from a mapping by introgression experiment and a tagged RNA gene expression dataset. In addition we used whole genome re-sequencing of the parental lines two commonly used laboratory strains to predict polymorphic transcription factor binding sites for SD. We integrated these data with previously published genomic datasets from expression microarrays and a modifier mutation screen. By searching for genes showing a congruent signal across multiple datasets, we were able to identify a robust set of candidate loci contributing to the background dependent effects of mutations in sd. We also show that the majority of background-dependent modifiers previously reported are caused by higher-order epistasis, not quantitative non-complementation. These findings provide a useful foundation for more detailed investigations of genetic background dependence in this system, and this approach is likely to prove useful in exploring the genetic basis of other traits as well.
We present a nonparametric Bayesian method for disease subtype discovery in multi-dimensional cancer data. Our method can simultaneously analyse a wide range of data types, allowing for both agreement and disagreement between their underlying clustering structure. It includes feature selection and infers the most likely number of disease subtypes, given the data. We apply the method to 277 glioblastoma samples from The Cancer Genome Atlas, for which there are gene expression, copy number variation, methylation and microRNA data. We identify 8 distinct consensus subtypes and study their prognostic value for death, new tumour events, progression and recurrence. The consensus subtypes are prognostic of tumour recurrence (log-rank p-value of $3.6 times 10^{-4}$ after correction for multiple hypothesis tests). This is driven principally by the methylation data (log-rank p-value of $2.0 times 10^{-3}$) but the effect is strengthened by the other 3 data types, demonstrating the value of integrating multiple data types. Of particular note is a subtype of 47 patients characterised by very low levels of methylation. This subtype has very low rates of tumour recurrence and no new events in 10 years of follow up. We also identify a small gene expression subtype of 6 patients that shows particularly poor survival outcomes. Additionally, we note a consensus subtype that showly a highly distinctive data signature and suggest that it is therefore a biologically distinct subtype of glioblastoma. The code is available from https://sites.google.com/site/multipledatafusion/
Much evolutionary information is stored in the fluctuations of protein length distributions. The genome size and non-coding DNA content can be calculated based only on the protein length distributions. So there is intrinsic relationship between the coding DNA size and non-coding DNA size. According to the correlations and quasi-periodicity of protein length distributions, we can classify life into three domains. Strong evidences are found to support the order in the structures of protein length distributions.
BACOM is a statistically principled and unsupervised method that detects copy number deletion types (homozygous versus heterozygous), estimates normal cell fraction, and recovers cancer specific copy number profiles, using allele specific copy number signals. In a subsequent analysis of TCGA ovarian cancer dataset, the average normal cell fraction estimated by BACOM was found higher than expected. In this letter, we first discuss the advantages of the BACOM in relation to alternative approaches. Then, we show that this elevated estimate of normal cell fraction is the combined result of inaccurate signal modeling and normalization. Lastly, we describe an allele specific signal modeling and normalization scheme that can enhance BACOM applications in many biological contexts. An open source MATLAB program was developed to implement our extended method and it is publically available.
The availability of genomic data is often essential to progress in biomedical research, personalized medicine, drug development, etc. However, its extreme sensitivity makes it problematic, if not outright impossible, to publish or share it. As a result, several initiatives have been launched to experiment with synthetic genomic data, e.g., using generative models to learn the underlying distribution of the real data and generate artificial datasets that preserve its salient characteristics without exposing it. This paper provides the first evaluation of the utility and the privacy protection of six state-of-the-art models for generating synthetic genomic data. We assess the performance of the synthetic data on several common tasks, such as allele population statistics and linkage disequilibrium. We then measure privacy through the lens of membership inference attacks, i.e., inferring whether a record was part of the training data. Our experiments show that no single approach to generate synthetic genomic data yields both high utility and strong privacy across the board. Also, the size and nature of the training dataset matter. Moreover, while some combinations of datasets and models produce synthetic data with distributions close to the real data, there often are target data points that are vulnerable to membership inference. Looking forward, our techniques can be used by practitioners to assess the risks of deploying synthetic genomic data in the wild and serve as a benchmark for future work.