No Arabic abstract
This paper arises from collaborative research the aim of which was to model clinical assessments of upper limb function after stroke using 3D kinematic data. We present a new nonlinear mixed-effects scalar-on-function regression model with a Gaussian process prior focusing on variable selection from large number of candidates including both scalar and function variables. A novel variable selection algorithm has been developed, namely functional least angle regression (fLARS). As they are essential for this algorithm, we studied the representation of functional variables with different methods and the correlation between a scalar and a group of mixed scalar and functional variables. We also propose two new stopping rules for practical usage. This algorithm is able to do variable selection when the number of variables is larger than the sample size. It is efficient and accurate for both variable selection and parameter estimation. Our comprehensive simulation study showed that the method is superior to other existing variable selection methods. When the algorithm was applied to the analysis of the 3D kinetic movement data the use of the non linear random-effects model and the function variables significantly improved the prediction accuracy for the clinical assessment.
The ability to generate samples of the random effects from their conditional distributions is fundamental for inference in mixed effects models. Random walk Metropolis is widely used to conduct such sampling, but such a method can converge slowly for medium dimension problems, or when the joint structure of the distributions to sample is complex. We propose a Metropolis Hastings (MH) algorithm based on a multidimensional Gaussian proposal that takes into account the joint conditional distribution of the random effects and does not require any tuning, in contrast with more sophisticated samplers such as the Metropolis Adjusted Langevin Algorithm or the No-U-Turn Sampler that involve costly tuning runs or intensive computation. Indeed, this distribution is automatically obtained thanks to a Laplace approximation of the original model. We show that such approximation is equivalent to linearizing the model in the case of continuous data. Numerical experiments based on real data highlight the very good performances of the proposed method for continuous data model.
Microorganisms play critical roles in human health and disease. It is well known that microbes live in diverse communities in which they interact synergistically or antagonistically. Thus for estimating microbial associations with clinical covariates, multivariate statistical models are preferred. Multivariate models allow one to estimate and exploit complex interdependencies among multiple taxa, yielding more powerful tests of exposure or treatment effects than application of taxon-specific univariate analyses. In addition, the analysis of microbial count data requires special attention because data commonly exhibit zero inflation. To meet these needs, we developed a Bayesian variable selection model for multivariate count data with excess zeros that incorporates information on the covariance structure of the outcomes (counts for multiple taxa), while estimating associations with the mean levels of these outcomes. Although there has been a great deal of effort in zero-inflated models for longitudinal data, little attention has been given to high-dimensional multivariate zero-inflated data modeled via a general correlation structure. Through simulation, we compared performance of the proposed method to that of existing univariate approaches, for both the binary and count parts of the model. When outcomes were correlated the proposed variable selection method maintained type I error while boosting the ability to identify true associations in the binary component of the model. For the count part of the model, in some scenarios the the univariate method had higher power than the multivariate approach. This higher power was at a cost of a highly inflated false discovery rate not observed with the proposed multivariate method. We applied the approach to oral microbiome data from the Pediatric HIV/AIDS Cohort Oral Health Study and identified five species (of 44) associated with HIV infection.
Accelerated degradation tests are used to provide accurate estimation of lifetime properties of highly reliable products within a relatively short testing time. There data from particular tests at high levels of stress (e.,g. temperature, voltage, or vibration) are extrapolated, through a physically meaningful model, to obtain estimates of lifetime quantiles under normal use conditions. In this work, we consider repeated measures accelerated degradation tests with multiple stress variables, where the degradation paths are assumed to follow a linear mixed effects model which is quite common in settings when repeated measures are made. We derive optimal experimental designs for minimizing the asymptotic variance for estimating the median failure time under normal use conditions when the time points for measurements are either fixed in advance or are also to be optimized.
The instability in the selection of models is a major concern with data sets containing a large number of covariates. This paper deals with variable selection methodology in the case of high-dimensional problems where the response variable can be right censored. We focuse on new stable variable selection methods based on bootstrap for two methodologies: the Cox proportional hazard model and survival trees. As far as the Cox model is concerned, we investigate the bootstrapping applied to two variable selection techniques: the stepwise algorithm based on the AIC criterion and the L1-penalization of Lasso. Regarding survival trees, we review two methodologies: the bootstrap node-level stabilization and random survival forests. We apply these different approaches to two real data sets. We compare the methods on the prediction error rate based on the Harrell concordance index and the relevance of the interpretation of the corresponding selected models. The aim is to find a compromise between a good prediction performance and ease to interpretation for clinicians. Results suggest that in the case of a small number of individuals, a bootstrapping adapted to L1-penalization in the Cox model or a bootstrap node-level stabilization in survival trees give a good alternative to the random survival forest methodology, known to give the smallest prediction error rate but difficult to interprete by non-statisticians. In a clinical perspective, the complementarity between the methods based on the Cox model and those based on survival trees would permit to built reliable models easy to interprete by the clinician.
Mass cytometry technology enables the simultaneous measurement of over 40 proteins on single cells. This has helped immunologists to increase their understanding of heterogeneity, complexity, and lineage relationships of white blood cells. Current statistical methods often collapse the rich single-cell data into summary statistics before proceeding with downstream analysis, discarding the information in these multivariate datasets. In this article, our aim is to exhibit the use of statistical analyses on the raw, uncompressed data thus improving replicability, and exposing multivariate patterns and their associated uncertainty profiles. We show that multivariate generative models are a valid alternative to univariate hypothesis testing. We propose two models: a multivariate Poisson log-normal mixed model and a logistic linear mixed model. We show that these models are complementary and that either model can account for different confounders. We use Hamiltonian Monte Carlo to provide Bayesian uncertainty quantification. Our models applied to a recent pregnancy study successfully reproduce key findings while quantifying increased overall protein-to-protein correlations between first and third trimester.