أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Claudia Solis-Lemus

CARlasso: An R package for the estimation of sparse microbial networks with predictors

116 - Yunyi Shen , Claudia Solis-Lemus 2021

Microbiome data analyses require statistical tools that can simultaneously decode microbes reactions to the environment and interactions among microbes. We introduce CARlasso, the first user-friendly open-source and publicly available R package to fi t a chain graph model for the inference of sparse microbial networks that represent both interactions among nodes and effects of a set of predictors. Unlike in standard regression approaches, the edges represent the correct conditional structure among responses and predictors that allows the incorporation of prior knowledge from controlled experiments. In addition, CARlasso 1) enforces sparsity in the network via LASSO; 2) allows for an adaptive extension to include different shrinkage to different edges; 3) is computationally inexpensive through an efficient Gibbs sampling algorithm so it can equally handle small and big data; 4) allows for continuous, binary, counting and compositional responses via proper hierarchical structure, and 5) has a similar syntax to lm for ease of use. The package also supports Bayesian graphical LASSO and several of its hierarchical models as well as lower level one-step sampling functions of the CAR-LASSO model for users to extend.

تطبيقات الإحصاء

The Effect of the Prior and the Experimental Design on the Inference of the Precision Matrix in Gaussian Chain Graph Models

335 - Yunyi Shen , Claudia Solis-Lemus 2021

Here, we investigate whether (and how) experimental design could aid in the estimation of the precision matrix in a Gaussian chain graph model, especially the interplay between the design, the effect of the experiment and prior knowledge about the ef fect. We approximate the marginal posterior precision of the precision matrix via Laplace approximation under different priors: a flat prior, the conjugate prior Normal-Wishart, the unconfounded prior Normal-Matrix Generalized Inverse Gaussian (MGIG) and a general independent prior. We show that the approximated posterior precision is not a function of the design matrix for the cases of the Normal-Wishart and flat prior, but it is for the cases of the Normal-MGIG and the general independent prior. However, for the Normal-MGIG and the general independent prior, we find a sharp upper bound on the approximated posterior precision that does not involve the design matrix which translates into a bound on the information that could be extracted from a given experiment. We confirm the theoretical findings via a simulation study comparing the Steins loss difference between random versus no experiment (design matrix equal to zero). Our findings provide practical advice for domain scientists conducting experiments to decode the relationships between a multidimensional response and a set of predictors.

المنهجية نظرية الإحصاء نظرية الإحصاء

Bayesian Conditional Auto-Regressive LASSO Models to Learn Sparse Microbial Networks with Predictors

154 - Yunyi Shen , Claudia Solis-Lemus 2020

Microbiome data analyses require statistical models that can simultaneously decode microbes reactions to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straightforward solution, we argue th at treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between response and predictor nodes because it does not represent the adjacency matrix. This observation is especially important in biological settings when we have prior knowledge on the edges from specific experimental interventions that can only be properly encoded under a conditional dependence model. Here, we propose a chain graph model with two sets of nodes (predictors and responses) whose solution yields a graph with edges that indeed represent conditional dependence and thus, agrees with the experimenters intuition on the average behavior of nodes under treatment. The solution to our model is sparse via Bayesian LASSO and is also guaranteed to be the sparse solution to a Conditional Auto-Regressive (CAR) model. In addition, we propose an adaptive extension so that different shrinkage can be applied to different edges to incorporate edge-specific prior knowledge. Our model is computationally inexpensive through an efficient Gibbs sampling algorithm and can account for binary, counting, and compositional responses via appropriate hierarchical structure. We apply our model to a human gut and a soil microbial compositional datasets and we highlight that CAR-LASSO can estimate biologically meaningful network structures in the data. The CAR-LASSO software is available as an R package at https://github.com/YunyiShen/CAR-LASSO.

تطبيقات الإحصاء المنهجية

Towards a robust out-of-the-box neural network model for genomic data

166 - Zhaoyi Zhang , Songyang Cheng , Claudia Solis-Lemus 2020

The accurate prediction of biological features from genomic data is paramount for precision medicine, sustainable agriculture and climate change research. For decades, neural network models have been widely popular in fields like computer vision, ast rophysics and targeted marketing given their prediction accuracy and their robust performance under big data settings. Yet neural network models have not made a successful transition into the medical and biological world due to the ubiquitous characteristics of biological data such as modest sample sizes, sparsity, and extreme heterogeneity. Results: Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers. Here, we investigate the robustness, generalization potential and prediction accuracy of widely used convolutional neural network and natural language processing models with a variety of heterogeneous genomic datasets. While the perspective of a robust out-of-the-box neural network model is out of reach, we identify certain model characteristics that translate well across datasets and could serve as a baseline model for translational researchers.

الجينوم

WI Fast Stats: a collection of web apps for the visualization and analysis of WI Fast Plants data

89 - Yizhou Liu , Claudia Solis-Lemus 2020

WI Fast Stats is the first and only dedicated tool tailored to the WI Fast Plants educational objectives. WI Fast Stats is an integrated animated web page with a collection of R-developed web apps that provide Data Visualization and Data Analysis too ls for WI Fast Plants data. WI Fast Stats is a user-friendly easy-to-use interface that will render Data Science accessible to K-16 teachers and students currently using WI Fast Plants lesson plans. Users do not need to have strong programming or mathematical background to use WI Fast Stats as the web apps are simple to use, well documented, and freely available.

علم الأحياء الكمي

On the Identifiability of Phylogenetic Networks under a Pseudolikelihood model

134 - Claudia Solis-Lemus , Arrigo Coen , Cecile Ane 2020

The Tree of Life is the graphical structure that represents the evolutionary process from single-cell organisms at the origin of life to the vast biodiversity we see today. Reconstructing this tree from genomic sequences is challenging due to the var iety of biological forces that shape the signal in the data, and many of those processes like incomplete lineage sorting and hybridization can produce confounding information. Here, we present the mathematical version of the identifiability proofs of phylogenetic networks under the pseudolikelihood model in SNaQ. We establish that the ability to detect different hybridization events depends on the number of nodes on the hybridization blob, with small blobs (corresponding to closely related species) being the hardest to be detected. Our work focuses on level-1 networks, but raises attention to the importance of identifiability studies on phylogenetic inference methods for broader classes of networks.

السكان والتطور نظرية الإحصاء نظرية الإحصاء

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد