No Arabic abstract
Microbiome data analyses require statistical models that can simultaneously decode microbes reactions to the environment and interactions among microbes. While a multiresponse linear regression model seems like a straightforward solution, we argue that treating it as a graphical model is flawed given that the regression coefficient matrix does not encode the conditional dependence structure between response and predictor nodes because it does not represent the adjacency matrix. This observation is especially important in biological settings when we have prior knowledge on the edges from specific experimental interventions that can only be properly encoded under a conditional dependence model. Here, we propose a chain graph model with two sets of nodes (predictors and responses) whose solution yields a graph with edges that indeed represent conditional dependence and thus, agrees with the experimenters intuition on the average behavior of nodes under treatment. The solution to our model is sparse via Bayesian LASSO and is also guaranteed to be the sparse solution to a Conditional Auto-Regressive (CAR) model. In addition, we propose an adaptive extension so that different shrinkage can be applied to different edges to incorporate edge-specific prior knowledge. Our model is computationally inexpensive through an efficient Gibbs sampling algorithm and can account for binary, counting, and compositional responses via appropriate hierarchical structure. We apply our model to a human gut and a soil microbial compositional datasets and we highlight that CAR-LASSO can estimate biologically meaningful network structures in the data. The CAR-LASSO software is available as an R package at https://github.com/YunyiShen/CAR-LASSO.
Microbiome data analyses require statistical tools that can simultaneously decode microbes reactions to the environment and interactions among microbes. We introduce CARlasso, the first user-friendly open-source and publicly available R package to fit a chain graph model for the inference of sparse microbial networks that represent both interactions among nodes and effects of a set of predictors. Unlike in standard regression approaches, the edges represent the correct conditional structure among responses and predictors that allows the incorporation of prior knowledge from controlled experiments. In addition, CARlasso 1) enforces sparsity in the network via LASSO; 2) allows for an adaptive extension to include different shrinkage to different edges; 3) is computationally inexpensive through an efficient Gibbs sampling algorithm so it can equally handle small and big data; 4) allows for continuous, binary, counting and compositional responses via proper hierarchical structure, and 5) has a similar syntax to lm for ease of use. The package also supports Bayesian graphical LASSO and several of its hierarchical models as well as lower level one-step sampling functions of the CAR-LASSO model for users to extend.
We propose Robust Lasso-Zero, an extension of the Lasso-Zero methodology [Descloux and Sardy, 2018], initially introduced for sparse linear models, to the sparse corruptions problem. We give theoretical guarantees on the sign recovery of the parameters for a slightly simplified version of the estimator, called Thresholded Justice Pursuit. The use of Robust Lasso-Zero is showcased for variable selection with missing values in the covariates. In addition to not requiring the specification of a model for the covariates, nor estimating their covariance matrix or the noise variance, the method has the great advantage of handling missing not-at random values without specifying a parametric model. Numerical experiments and a medical application underline the relevance of Robust Lasso-Zero in such a context with few available competitors. The method is easy to use and implemented in the R library lass0.
Bayesian optimization (BO) is a powerful approach for optimizing black-box, expensive-to-evaluate functions. To enable a flexible trade-off between the cost and accuracy, many applications allow the function to be evaluated at different fidelities. In order to reduce the optimization cost while maximizing the benefit-cost ratio, in this paper, we propose Batch Multi-fidelity Bayesian Optimization with Deep Auto-Regressive Networks (BMBO-DARN). We use a set of Bayesian neural networks to construct a fully auto-regressive model, which is expressive enough to capture strong yet complex relationships across all the fidelities, so as to improve the surrogate learning and optimization performance. Furthermore, to enhance the quality and diversity of queries, we develop a simple yet efficient batch querying method, without any combinatorial search over the fidelities. We propose a batch acquisition function based on Max-value Entropy Search (MES) principle, which penalizes highly correlated queries and encourages diversity. We use posterior samples and moment matching to fulfill efficient computation of the acquisition function and conduct alternating optimization over every fidelity-input pair, which guarantees an improvement at each step. We demonstrate the advantage of our approach on four real-world hyperparameter optimization applications.
Estimation of the long-term health effects of air pollution is a challenging task, especially when modelling small-area disease incidence data in an ecological study design. The challenge comes from the unobserved underlying spatial correlation structure in these data, which is accounted for using random effects modelled by a globally smooth conditional autoregressive model. These smooth random effects confound the effects of air pollution, which are also globally smooth. To avoid this collinearity a Bayesian localised conditional autoregressive model is developed for the random effects. This localised model is flexible spatially, in the sense that it is not only able to model step changes in the random effects surface, but also is able to capture areas of spatial smoothness in the study region. This methodological development allows us to improve the estimation performance of the covariate effects, compared to using traditional conditional auto-regressive models. These results are established using a simulation study, and are then illustrated with our motivating study on air pollution and respiratory ill health in Greater Glasgow, Scotland in 2010. The model shows substantial health effects of particulate matter air pollution and income deprivation, whose effects have been consistently attenuated by the currently available globally smooth models.
Studying the neurological, genetic and evolutionary basis of human vocal communication mechanisms is an important field of neuroscience. In the absence of high quality data on humans, mouse vocalization experiments in laboratory settings have been proven to be useful in providing valuable insights into mammalian vocal development and evolution, including especially the impact of certain genetic mutations. Data sets from mouse vocalization experiments usually consist of categorical syllable sequences along with continuous inter-syllable interval times for mice of different genotypes vocalizing under various contexts. Few statistical models have considered the inference for both transition probabilities and inter-state intervals. The latter is of particular importance as increased inter-state intervals can be an indication of possible vocal impairment. In this paper, we propose a class of novel Markov renewal mixed models that capture the stochastic dynamics of both state transitions and inter-state interval times. Specifically, we model the transition dynamics and the inter-state intervals using Dirichlet and gamma mixtures, respectively, allowing the mixture probabilities in both cases to vary flexibly with fixed covariate effects as well as random individual-specific effects. We apply our model to analyze the impact of a mutation in the Foxp2 gene on mouse vocal behavior. We find that genotypes and social contexts significantly affect the inter-state interval times but, compared to previous analyses, the influences of genotype and social context on the syllable transition dynamics are weaker.