No Arabic abstract
Motivation : Molecular signatures for diagnosis or prognosis estimated from large-scale gene expression data often lack robustness and stability, rendering their biological interpretation challenging. Increasing the signatures interpretability and stability across perturbations of a given dataset and, if possible, across datasets, is urgently needed to ease the discovery of important biological processes and, eventually, new drug targets. Results : We propose a new method to construct signatures with increased stability and easier interpretability. The method uses a gene network as side interpretation and enforces a large connectivity among the genes in the signature, leading to signatures typically made of genes clustered in a few subnetworks. It combines the recently proposed graph Lasso procedure with a stability selection procedure. We evaluate its relevance for the estimation of a prognostic signature in breast cancer, and highlight in particular the increase in interpretability and stability of the signature.
Tissue heterogeneity is a major confounding factor in studying individual populations that cannot be resolved directly by global profiling. Experimental solutions to mitigate tissue heterogeneity are expensive, time consuming, inapplicable to existing data, and may alter the original gene expression patterns. Here we ask whether it is possible to deconvolute two-source mixed expressions (estimating both proportions and cell-specific profiles) from two or more heterogeneous samples without requiring any prior knowledge. Supported by a well-grounded mathematical framework, we argue that both constituent proportions and cell-specific expressions can be estimated in a completely unsupervised mode when cell-specific marker genes exist, which do not have to be known a priori, for each of constituent cell types. We demonstrate the performance of unsupervised deconvolution on both simulation and real gene expression data, together with perspective discussions.
Graph theoretical analyses of nervous systems usually omit the aspect of connection polarity, due to data insufficiency. The chemical synapse network of Caenorhabditis elegans is a well-reconstructed directed network, but the signs of its connections are yet to be elucidated. Here, we present the gene expression-based sign prediction of the ionotropic chemical synapse connectome of C. elegans (3,638 connections and 20,589 synapses total), incorporating available presynaptic neurotransmitter and postsynaptic receptor gene expression data for three major neurotransmitter systems. We made predictions for more than two-thirds of these chemical synapses and observed an excitatory-inhibitory (E:I) ratio close to 4:1 which was found similar to that observed in many real-world networks. Our open source tool (http://EleganSign.linkgroup.hu) is simple but efficient in predicting polarities by integrating neuronal connectome and gene expression data.
Aggregating transcriptomics data across hospitals can increase sensitivity and robustness of differential expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequently employed to pool local results. However, if class labels are inhomogeneously distributed between cohorts, their accuracy may drop. Flimma (https://exbio.wzw.tum.de/flimma/) addresses this issue by implementing the state-of-the-art workflow limma voom in a privacy-preserving manner, i.e. patient data never leaves its source site. Flimma results are identical to those generated by limma voom on combined datasets even in imbalanced scenarios where meta-analysis approaches fail.
Motivation: Histone modifications are among the most important factors that control gene regulation. Computational methods that predict gene expression from histone modification signals are highly desirable for understanding their combinatorial effects in gene regulation. This knowledge can help in developing epigenetic drugs for diseases like cancer. Previous studies for quantifying the relationship between histone modifications and gene expression levels either failed to capture combinatorial effects or relied on multiple methods that separate predictions and combinatorial analysis. This paper develops a unified discriminative framework using a deep convolutional neural network to classify gene expression using histone modification data as input. Our system, called DeepChrome, allows automatic extraction of complex interactions among important features. To simultaneously visualize the combinatorial interactions among histone modifications, we propose a novel optimization-based technique that generates feature pattern maps from the learnt deep model. This provides an intuitive description of underlying epigenetic mechanisms that regulate genes. Results: We show that DeepChrome outperforms state-of-the-art models like Support Vector Machines and Random Forests for gene expression classification task on 56 different cell-types from REMC database. The output of our visualization technique not only validates the previous observations but also allows novel insights about combinatorial interactions among histone modification marks, some of which have recently been observed by experimental studies.
When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious associations and to missing important signals. When the analysis is unsupervised, e.g., when the goal is to cluster the samples or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state of the art corrections.