No Arabic abstract
Metagenomics sequencing is routinely applied to quantify bacterial abundances in microbiome studies, where the bacterial composition is estimated based on the sequencing read counts. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which tend to result in inaccurate estimates of bacterial abundance and diversity. This paper takes a multi-sample approach to the estimation of bacterial abundances in order to borrow information across samples and across species. Empirical results from real data sets suggest that the composition matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient and Euclidean projection onto simplex space is developed. The theoretical upper bounds and the minimax lower bounds of the estimation errors, measured by the Kullback-Leibler divergence and the Frobenius norm, are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is applied to an analysis of a human gut microbiome dataset.
In spatial statistics, it is often assumed that the spatial field of interest is stationary and its covariance has a simple parametric form, but these assumptions are not appropriate in many applications. Given replicate observations of a Gaussian spatial field, we propose nonstationary and nonparametric Bayesian inference on the spatial dependence. Instead of estimating the quadratic (in the number of spatial locations) entries of the covariance matrix, the idea is to infer a near-linear number of nonzero entries in a sparse Cholesky factor of the precision matrix. Our prior assumptions are motivated by recent results on the exponential decay of the entries of this Cholesky factor for Matern-type covariances under a specific ordering scheme. Our methods are highly scalable and parallelizable. We conduct numerical comparisons and apply our methodology to climate-model output, enabling statistical emulation of an expensive physical model.
In many health domains such as substance-use, outcomes are often counts with an excessive number of zeros (EZ) - count data having zero counts at a rate significantly higher than that expected of a standard count distribution (e.g., Poisson). However, an important gap exists in sample size estimation methodology for planning sequential multiple assignment randomized trials (SMARTs) for comparing dynamic treatment regimens (DTRs) using longitudinal count data. DTRs, also known as treatment algorithms or adaptive interventions, mimic the individualized and evolving nature of patient care through the specification of decision rules guiding the type, timing and modality of delivery, and dosage of treatments to address the unique and changing needs of individuals. To close this gap, we develop a Monte Carlo-based approach to sample size estimation. A SMART for engaging alcohol and cocaine-dependent patients in treatment is used as motivation.
In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. This paper proposes a multi-sample approach to estimation of the clr matrix directly in order to borrow information across samples and across species. Empirical results from real datasets suggest that the clr matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient is developed. Theoretical upper bounds of the estimation errors and of its corresponding singular subspace errors are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is analyzed on Gut Microbiome dataset and the American Gut project.
We consider the problem of estimating parameters of stochastic differential equations (SDEs) with discrete-time observations that are either completely or partially observed. The transition density between two observations is generally unknown. We propose an importance sampling approach with an auxiliary parameter when the transition density is unknown. We embed the auxiliary importance sampler in a penalized maximum likelihood framework which produces more accurate and computationally efficient parameter estimates. Simulation studies in three different models illustrate promising improvements of the new penalized simulated maximum likelihood method. The new procedure is designed for the challenging case when some state variables are unobserved and moreover, observed states are sparse over time, which commonly arises in ecological studies. We apply this new approach to two epidemics of chronic wasting disease in mule deer.
To analyse a very large data set containing lengthy variables, we adopt a sequential estimation idea and propose a parallel divide-and-conquer method. We conduct several conventional sequential estimation procedures separately, and properly integrate their results while maintaining the desired statistical properties. Additionally, using a criterion from the statistical experiment design, we adopt an adaptive sample selection, together with an adaptive shrinkage estimation method, to simultaneously accelerate the estimation procedure and identify the effective variables. We confirm the cogency of our methods through theoretical justifications and numerical results derived from synthesized data sets. We then apply the proposed method to three real data sets, including those pertaining to appliance energy use and particulate matter concentration.