Do you want to publish a course? Click here

With the wealth of high-throughput sequencing data generated by recent large-scale consortia, predictive gene expression modelling has become an important tool for integrative analysis of transcriptomic and epigenetic data. However, sequencing data-sets are characteristically large, and previously modelling frameworks are typically inefficient and unable to leverage multi-core or distributed processing architectures. In this study, we detail an efficient and parallelised MapReduce implementation of gene expression modelling. We leverage the computational efficiency of this framework to provide an integrative analysis of over fifty histone modification data-sets across a variety of cancerous and non-cancerous cell-lines. Our results demonstrate that the genome-wide relationships between histone modifications and mRNA transcription are lineage, tissue and karyotype-invariant, and that models trained on matched epigenetic/transcriptomic data from non-cancerous cell-lines are able to predict cancerous expression with equivalent genome-wide fidelity.
120 - Ulisse Ferrari 2015
Maximum entropy models provide the least constrained probability distributions that reproduce statistical properties of experimental datasets. In this work we characterize the learning dynamics that maximizes the log-likelihood in the case of large but finite datasets. We first show how the steepest descent dynamics is not optimal as it is slowed down by the inhomogeneous curvature of the model parameters space. We then provide a way for rectifying this space which relies only on dataset properties and does not require large computational efforts. We conclude by solving the long-time limit of the parameters dynamics including the randomness generated by the systematic use of Gibbs sampling. In this stochastic framework, rather than converging to a fixed point, the dynamics reaches a stationary distribution, which for the rectified dynamics reproduces the posterior distribution of the parameters. We sum up all these insights in a rectified Data-Driven algorithm that is fast and by sampling from the parameters posterior avoids both under- and over-fitting along all the directions of the parameters space. Through the learning of pairwise Ising models from the recording of a large population of retina neurons, we show how our algorithm outperforms the steepest descent method.
The assembly of virus capsids from free coat proteins proceeds by a complicated cascade of association and dissociation steps, the great majority of which cannot be directly experimentally observed. This has made capsid assembly a rich field for computational models to attempt to fill the gaps in what is experimentally observable. Nonetheless, accurate simulation predictions depend on accurate models and there are substantial obstacles to model inference for such systems. Here, we describe progress in learning parameters for capsid assembly systems, particularly kinetic rate constants of coat-coat interactions, by computationally fitting simulations to experimental data. We previously developed an approach to learn rate parameters of coat-coat interactions by minimizing the deviation between real and simulated light scattering data monitoring bulk capsid assembly in vitro. This is a difficult data-fitting problem, however, because of the high computational cost of simulating assembly trajectories, the stochastic noise inherent to the models, and the limited and noisy data available for fitting. Here we show that a newer classes of methods, based on derivative-free optimization (DFO), can more quickly and precisely learn physical parameters from static light scattering data. We further explore how the advantages of the approaches might be affected by alternative data sources through simulation of a model of time-resolved mass spectrometry data, an alternative technology for monitoring bulk capsid assembly that can be expected to provide much richer data. The results show that advances in both the data and the algorithms can improve model inference, with rich data leading to high-quality fits for all methods, but DFO methods showing substantial advantages over less informative data sources better representative of the current experimental practice.
BioNetGen is an open-source software package for rule-based modeling of complex biochemical systems. Version 2.2 of the software introduces numerous new features for both model specification and simulation. Here, we report on these additions, discussing how they facilitate the construction, simulation, and analysis of larger and more complex models than previously possible.
The exploration of epidemic dynamics on dynamically evolving (adaptive) networks poses nontrivial challenges to the modeler, such as the determination of a small number of informative statistics of the detailed network state (that is, a few good observables) that usefully summarize the overall (macroscopic, systems level) behavior. Trying to obtain reduced, small size, accurate models in terms of these few statistical observables - that is, coarse-graining the full network epidemic model to a small but useful macroscopic one - is even more daunting. Here we describe a data-based approach to solving the first challenge: the detection of a few informative collective observables of the detailed epidemic dynamics. This will be accomplished through Diffusion Maps, a recently developed data-mining technique. We illustrate the approach through simulations of a simple mathematical model of epidemics on a network: a model known to exhibit complex temporal dynamics. We will discuss potential extensions of the approach, as well as possible shortcomings.
Biological processes involve a variety of spatial and temporal scales. A holistic understanding of many biological processes therefore requires multi-scale models which capture the relevant properties on all these scales. In this manuscript we review mathematical modelling approaches used to describe the individual spatial scales and how they are integrated into holistic models. We discuss the relation between spatial and temporal scales and the implication of that on multi-scale modelling. Based upon this overview over state-of-the-art modelling approaches, we formulate key challenges in mathematical and computational modelling of biological multi-scale and multi-physics processes. In particular, we considered the availability of analysis tools for multi-scale models and model-based multi-scale data integration. We provide a compact review of methods for model-based data integration and model-based hypothesis testing. Furthermore, novel approaches and recent trends are discussed, including computation time reduction using reduced order and surrogate models, which contribute to the solution of inference problems. We conclude the manuscript by providing a few ideas for the development of tailored multi-scale inference methods.
RNA-Seq and gene expression microarrays provide comprehensive profiles of gene activity, but lack of reproducibility has hindered their application. A key challenge in the data analysis is the normalization of gene expression levels, which is currently performed following the implicit assumption that most genes are not differentially expressed. Here, we present a mathematical approach to normalization that makes no assumption of this sort. We have found that variation in gene expression is much larger than currently believed, and that it can be measured with available assays. Our results also explain, at least partially, the reproducibility problems encountered in transcriptomics studies. We expect that this improvement in detection will help efforts to realize the full potential of gene expression profiling, especially in analyses of cellular processes involving complex modulations of gene expression.
Microbiome-based stratification of healthy individuals into compositional categories, referred to as community types, holds promise for drastically improving personalized medicine. Despite this potential, the existence of community types and the degree of their distinctness have been highly debated. Here we adopted a dynamic systems approach and found that heterogeneity in the interspecific interactions or the presence of strongly interacting species is sufficient to explain community types, independent of the topology of the underlying ecological network. By controlling the presence or absence of these strongly interacting species we can steer the microbial ecosystem to any desired community type. This open-loop control strategy still holds even when the community types are not distinct but appear as dense regions within a continuous gradient. This finding can be used to develop viable therapeutic strategies for shifting the microbial composition to a healthy configuration
The recent explosion of genomic data has underscored the need for interpretable and comprehensive analyses that can capture complex phylogenetic relationships within and across species. Recombination, reassortment and horizontal gene transfer constitute examples of pervasive biological phenomena that cannot be captured by tree-like representations. Starting from hundreds of genomes, we are interested in the reconstruction of potential evolutionary histories leading to the observed data. Ancestral recombination graphs represent potential histories that explicitly accommodate recombination and mutation events across orthologous genomes. However, they are computationally costly to reconstruct, usually being infeasible for more than few tens of genomes. Recently, Topological Data Analysis (TDA) methods have been proposed as robust and scalable methods that can capture the genetic scale and frequency of recombination. We build upon previous TDA developments for detecting and quantifying recombination, and present a novel framework that can be applied to hundreds of genomes and can be interpreted in terms of minimal histories of mutation and recombination events, quantifying the scales and identifying the genomic locations of recombinations. We implement this framework in a software package, called TARGet, and apply it to several examples, including small migration between different populations, human recombination, and horizontal evolution in finches inhabiting the Galapagos Islands.
Diverse classes of proteins function through large-scale conformational changes; sophisticated enhanced sampling methods have been proposed to generate these macromolecular transition paths. As such paths are curves in a high-dimensional space, they have been difficult to compare quantitatively, a prerequisite to, for instance, assess the quality of different sampling algorithms. The Path Similarity Analysis (PSA) approach alleviates these difficulties by utilizing the full information in 3N-dimensional trajectories in configuration space. PSA employs the Hausdorff or Frechet path metrics---adopted from computational geometry---enabling us to quantify path (dis)similarity, while the new concept of a Hausdorff-pair map permits the extraction of atomic-scale determinants responsible for path differences. Combined with clustering techniques, PSA facilitates the comparison of many paths, including collections of transition ensembles. We use the closed-to-open transition of the enzyme adenylate kinase (AdK)---a commonly used testbed for the assessment enhanced sampling algorithms---to examine multiple microsecond equilibrium molecular dynamics (MD) transitions of AdK in its substrate-free form alongside transition ensembles from the MD-based dynamic importance sampling (DIMS-MD) and targeted MD (TMD) methods, and a geometrical targeting algorithm (FRODA). A Hausdorff pairs analysis of these ensembles revealed, for instance, that differences in DIMS-MD and FRODA paths were mediated by a set of conserved salt bridges whose charge-charge interactions are fully modeled in DIMS-MD but not in FRODA. We also demonstrate how existing trajectory analysis methods relying on pre-defined collective variables, such as native contacts or geometric quantities, can be used synergistically with PSA, as well as the application of PSA to more complex systems such as membrane transporter proteins.
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا