ترغب بنشر مسار تعليمي؟ اضغط هنا

Gene expression modelling across multiple cell-lines with MapReduce

265   0   0.0 ( 0 )
 نشر من قبل David Budden
 تاريخ النشر 2015
والبحث باللغة English




اسأل ChatGPT حول البحث

With the wealth of high-throughput sequencing data generated by recent large-scale consortia, predictive gene expression modelling has become an important tool for integrative analysis of transcriptomic and epigenetic data. However, sequencing data-sets are characteristically large, and previously modelling frameworks are typically inefficient and unable to leverage multi-core or distributed processing architectures. In this study, we detail an efficient and parallelised MapReduce implementation of gene expression modelling. We leverage the computational efficiency of this framework to provide an integrative analysis of over fifty histone modification data-sets across a variety of cancerous and non-cancerous cell-lines. Our results demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched epigenetic/transcriptomic data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity.



قيم البحث

اقرأ أيضاً

Complex biological functions are carried out by the interaction of genes and proteins. Uncovering the gene regulation network behind a function is one of the central themes in biology. Typically, it involves extensive experiments of genetics, biochem istry and molecular biology. In this paper, we show that much of the inference task can be accomplished by a deep neural network (DNN), a form of machine learning or artificial intelligence. Specifically, the DNN learns from the dynamics of the gene expression. The learnt DNN behaves like an accurate simulator of the system, on which one can perform in-silico experiments to reveal the underlying gene network. We demonstrate the method with two examples: biochemical adaptation and the gap-gene patterning in fruit fly embryogenesis. In the first example, the DNN can successfully find the two basic network motifs for adaptation - the negative feedback and the incoherent feed-forward. In the second and much more complex example, the DNN can accurately predict behaviors of essentially all the mutants. Furthermore, the regulation network it uncovers is strikingly similar to the one inferred from experiments. In doing so, we develop methods for deciphering the gene regulation network hidden in the DNN black box. Our interpretable DNN approach should have broad applications in genotype-phenotype mapping.
In many situations, the gene expression signature is a unique marker of the biological state. We study the modification of the gene expression distribution function when the biological state of a system experiences a change. This change may be the re sult of a selective pressure, as in the Long Term Evolution Experiment with E. Coli populations, or the progression to Alzheimer disease in aged brains, or the progression from a normal tissue to the cancer state. The first two cases seem to belong to a class of transitions, where the initial and final states are relatively close to each other, and the distribution function for the differential expressions is short ranged, with a tail of only a few dozens of strongly varying genes. In the latter case, cancer, the initial and final states are far apart and separated by a low-fitness barrier. The distribution function shows a very heavy tail, with thousands of silenced and over-expressed genes. We characterize the biological states by means of their principal component representations, and the expression distribution functions by their maximal and minimal differential expression values and the exponents of the Pareto laws describing the tails.
Microarray techniques are widely used in Gene expression analysis. These techniques are based on discovering submatrices of genes that share similar expression patterns across a set of experimental conditions with coherence constraint. Actually, thes e submatrices are called biclusters and the extraction process is called biclustering. In this paper we present a novel binary particle swarm optimization model for the gene expression biclustering problem. Hence, we apply the binary particle swarm optimization algorithm with a proposed measure, called Discretized Column-based Measure (DCM) as a novel cost function for evaluating biclusters where biological relevance, MSR and the size of the bicluster are considered as evaluation metrics for our results. Results are compared to the existing algorithms and they show the validity of our proposed approach.
The bulk of stochastic gene expression models in the literature do not have an explicit description of the age of a cell within a generation and hence they cannot capture events such as cell division and DNA replication. Instead, many models incorpor ate cell cycle implicitly by assuming that dilution due to cell division can be described by an effective decay reaction with first-order kinetics. If it is further assumed that protein production occurs in bursts then the stationary protein distribution is a negative binomial. Here we seek to understand how accurate these implicit models are when compared with more detailed models of stochastic gene expression. We derive the exact stationary solution of the chemical master equation describing bursty protein dynamics, binomial partitioning at mitosis, age-dependent transcription dynamics including replication, and random interdivision times sampled from Erlang or more general distributions; the solution is different for single lineage and population snapshot settings. We show that protein distributions are well approximated by the solution of implicit models (a negative binomial) when the mean number of mRNAs produced per cycle is low and the cell cycle length variability is large. When these conditions are not met, the distributions are either almost bimodal or else display very flat regions near the mode and cannot be described by implicit models. We also show that for genes with low transcription rates, the size of protein noise has a strong dependence on the replication time, it is almost independent of cell cycle variability for lineage measurements and increases with cell cycle variability for population snapshot measurements. In contrast for large transcription rates, the size of protein noise is independent of replication time and increases with cell cycle variability for both lineage and population measurements.
77 - Olga Zolotareva 2020
Aggregating transcriptomics data across hospitals can increase sensitivity and robustness of differential expression analyses, yielding deeper clinical insights. As data exchange is often restricted by privacy legislation, meta-analyses are frequentl y employed to pool local results. However, if class labels are inhomogeneously distributed between cohorts, their accuracy may drop. Flimma (https://exbio.wzw.tum.de/flimma/) addresses this issue by implementing the state-of-the-art workflow limma voom in a privacy-preserving manner, i.e. patient data never leaves its source site. Flimma results are identical to those generated by limma voom on combined datasets even in imbalanced scenarios where meta-analysis approaches fail.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا