ترغب بنشر مسار تعليمي؟ اضغط هنا

Network modelling of topological domains using Hi-C data

72   0   0.0 ( 0 )
 نشر من قبل Y. X. Rachel Wang
 تاريخ النشر 2017
والبحث باللغة English




اسأل ChatGPT حول البحث

Chromosome conformation capture experiments such as Hi-C are used to map the three-dimensional spatial organization of genomes. One specific feature of the 3D organization is known as topologically associating domains (TADs), which are densely interacting, contiguous chromatin regions playing important roles in regulating gene expression. A few algorithms have been proposed to detect TADs. In particular, the structure of Hi-C data naturally inspires application of community detection methods. However, one of the drawbacks of community detection is that most methods take exchangeability of the nodes in the network for granted; whereas the nodes in this case, i.e. the positions on the chromosomes, are not exchangeable. We propose a network model for detecting TADs using Hi-C data that takes into account this non-exchangeability. In addition, our model explicitly makes use of cell-type specific CTCF binding sites as biological covariates and can be used to identify conserved TADs across multiple cell types. The model leads to a likelihood objective that can be efficiently optimized via relaxation. We also prove that when suitably initialized, this model finds the underlying TAD structure with high probability. Using simulated data, we show the advantages of our method and the caveats of popular community detection methods, such as spectral clustering, in this application. Applying our method to real Hi-C data, we demonstrate the domains identified have desirable epigenetic features and compare them across different cell types.



قيم البحث

اقرأ أيضاً

Recent chromosome conformation capture experiments have led to the discovery of dense, contiguous, megabase-sized topological domains that are similar across cell types and conserved across species. These domains are strongly correlated with a number of chromatin markers and have since been included in a number of analyses. However, functionally-relevant domains may exist at multiple length scales. We introduce a new and efficient algorithm that is able to capture persistent domains across various resolutions by adjusting a single scale parameter. The identified novel domains are substantially different from domains reported previously and are highly enriched for insulating factor CTCF binding and histone modfications at the boundaries.
In this paper we propose network methodology to infer prognostic cancer biomarkers based on the epigenetic pattern DNA methylation. Epigenetic processes such as DNA methylation reflect environmental risk factors, and are increasingly recognised for t heir fundamental role in diseases such as cancer. DNA methylation is a gene-regulatory pattern, and hence provides a means by which to assess genomic regulatory interactions. Network models are a natural way to represent and analyse groups of such interactions. The utility of network models also increases as the quantity of data and number of variables increase, making them increasingly relevant to large-scale genomic studies. We propose methodology to infer prognostic genomic networks from a DNA methylation-based measure of genomic interaction and association. We then show how to identify prognostic biomarkers from such networks, which we term `network community oncomarkers. We illustrate the power of our proposed methodology in the context of a large publicly available breast cancer dataset.
142 - Sicheng Hao , Rui Wang , Yu Zhang 2018
Alzheimers disease is the most common cause of dementia. It is the fifth-leading cause of death among elderly people. With high genetic heritability (79%), finding disease causal genes is a crucial step in find treatment for AD. Following the Interna tional Genomics of Alzheimers Project (IGAP), many disease-associated genes have been identified; however, we dont have enough knowledge about how those disease-associated genes affect gene expression and disease-related pathways. We integrated GWAS summary data from IGAP and five different expression level data by using TWAS method and identified 15 disease causal genes under strict multiple testing (alpha<0.05), 4 genes are newly identified; identified additional 29 potential disease causal genes under false discovery rate(alpha < 0.05), 21 of them are newly identified. Many genes we identified are also associated with some autoimmune disorder.
Recent technology breakthrough in spatial molecular profiling has enabled the comprehensive molecular characterizations of single cells while preserving spatial information. It provides new opportunities to delineate how cells from different origins form tissues with distinctive structures and functions. One immediate question in analysis of spatial molecular profiling data is how to identify spatially variable genes. Most of the current methods build upon the geostatistical model with a Gaussian process that relies on selecting ad hoc kernels to account for spatial expression patterns. To overcome this potential challenge and capture more types of spatial patterns, we introduce a Bayesian approach to identify spatially variable genes via Ising model. The key idea is to use the energy interaction parameter of the Ising model to characterize spatial expression patterns. We use auxiliary variable Markov chain Monte Carlo algorithms to sample from the posterior distribution with an intractable normalizing constant in the Ising model. Simulation results show that our energy-based modeling approach led to higher accuracy in detecting spatially variable genes than those kernel-based methods. Applying our method to two real spatial transcriptomics datasets, we discovered novel spatial patterns that shed light on the biological mechanisms. The proposed method presents a new perspective for analyzing spatial transcriptomics data.
When dealing with large scale gene expression studies, observations are commonly contaminated by unwanted variation factors such as platforms or batches. Not taking this unwanted variation into account when analyzing the data can lead to spurious ass ociations and to missing important signals. When the analysis is unsupervised, e.g., when the goal is to cluster the samples or to build a corrected version of the dataset - as opposed to the study of an observed factor of interest - taking unwanted variation into account can become a difficult task. The unwanted variation factors may be correlated with the unobserved factor of interest, so that correcting for the former can remove the latter if not done carefully. We show how negative control genes and replicate samples can be used to estimate unwanted variation in gene expression, and discuss how this information can be used to correct the expression data or build estimators for unsupervised problems. The proposed methods are then evaluated on three gene expression datasets. They generally manage to remove unwanted variation without losing the signal of interest and compare favorably to state of the art corrections.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا