No Arabic abstract
Equation learning methods present a promising tool to aid scientists in the modeling process for biological data. Previous equation learning studies have demonstrated that these methods can infer models from rich datasets, however, the performance of these methods in the presence of common challenges from biological data has not been thoroughly explored. We present an equation learning methodology comprised of data denoising, equation learning, model selection and post-processing steps that infers a dynamical systems model from noisy spatiotemporal data. The performance of this methodology is thoroughly investigated in the face of several common challenges presented by biological data, namely, sparse data sampling, large noise levels, and heterogeneity between datasets. We find that this methodology can accurately infer the correct underlying equation and predict unobserved system dynamics from a small number of time samples when the data is sampled over a time interval exhibiting both linear and nonlinear dynamics. Our findings suggest that equation learning methods can be used for model discovery and selection in many areas of biology when an informative dataset is used. We focus on glioblastoma multiforme modeling as a case study in this work to highlight how these results are informative for data-driven modeling-based tumor invasion predictions.
Comparative transcriptomics has gained increasing popularity in genomic research thanks to the development of high-throughput technologies including microarray and next-generation RNA sequencing that have generated numerous transcriptomic data. An important question is to understand the conservation and differentiation of biological processes in different species. We propose a testing-based method TROM (Transcriptome Overlap Measure) for comparing transcriptomes within or between different species, and provide a different perspective to interpret transcriptomic similarity in contrast to traditional correlation analyses. Specifically, the TROM method focuses on identifying associated genes that capture molecular characteristics of biological samples, and subsequently comparing the biological samples by testing the overlap of their associated genes. We use simulation and real data studies to demonstrate that TROM is more powerful in identifying similar transcriptomes and more robust to stochastic gene expression noise than Pearson and Spearman correlations. We apply TROM to compare the developmental stages of six Drosophila species, C. elegans, S. purpuratus, D. rerio and mouse liver, and find interesting correspondence patterns that imply conserved gene expression programs in the development of these species. The TROM method is available as an R package on CRAN (http://cran.r-project.org/) with manuals and source codes available at http://www.stat.ucla.edu/ jingyi.li/software-and-data/trom.html.
Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conservation, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA. As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.
Modeling biological rhythms helps understand the complex principles behind the physical and psychological abnormalities of human bodies, to plan life schedules, and avoid persisting fatigue and mood and sleep alterations due to the desynchronization of those rhythms. The first step in modeling biological rhythms is to identify their characteristics, such as cyclic periods, phase, and amplitude. However, human rhythms are susceptible to external events, which cause irregular fluctuations in waveforms and affect the characterization of each rhythm. In this paper, we present our exploratory work towards developing a computational framework for automated discovery and modeling of human rhythms. We first identify cyclic periods in time series data using three different methods and test their performance on both synthetic data and real fine-grained biological data. We observe consistent periods are detected by all three methods. We then model inner cycles within each period through identifying change points to observe fluctuations in biological data that may inform the impact of external events on human rhythms. The results provide initial insights into the design of a computational framework for discovering and modeling human rhythms.
It is basic question in biology and other fields to identify the char- acteristic properties that on one hand are shared by structures from a particular realm, like gene regulation, protein-protein interaction or neu- ral networks or foodwebs, and that on the other hand distinguish them from other structures. We introduce and apply a general method, based on the spectrum of the normalized graph Laplacian, that yields repre- sentations, the spectral plots, that allow us to find and visualize such properties systematically. We present such visualizations for a wide range of biological networks and compare them with those for networks derived from theoretical schemes. The differences that we find are quite striking and suggest that the search for universal properties of biological networks should be complemented by an understanding of more specific features of biological organization principles at different scales.
We investigate methods for learning partial differential equation (PDE) models from spatiotemporal data under biologically realistic levels and forms of noise. Recent progress in learning PDEs from data have used sparse regression to select candidate terms from a denoised set of data, including approximated partial derivatives. We analyze the performance in utilizing previous methods to denoise data for the task of discovering the governing system of partial differential equations (PDEs). We also develop a novel methodology that uses artificial neural networks (ANNs) to denoise data and approximate partial derivatives. We test the methodology on three PDE models for biological transport, i.e., the advection-diffusion, classical Fisher-KPP, and nonlinear Fisher-KPP equations. We show that the ANN methodology outperforms previous denoising methods, including finite differences and polynomial regression splines, in the ability to accurately approximate partial derivatives and learn the correct PDE model.