No Arabic abstract
The odds ratio (OR) is a widely used measure of the effect size in observational research. ORs reflect statistical association between a binary outcome, such as the presence of a health condition, and a binary predictor, such as an exposure to a pollutant. Statistical significance and interval estimates are often computed for the logarithm of OR, ln(OR), and depend on the asymptotic standard error of ln(OR). For a sample of size N, the standard error can be written as a ratio of sigma over square root of N, where sigma is the population standard deviation of ln(OR). The ratio of ln(OR) over sigma is a standardized effect size. Unlike correlation, that is another familiar standardized statistic, the standardized ln(OR) cannot reach values of minus one or one. We find that its maximum possible value is given by the Laplace Limit Constant, (LLC=0.6627...), that appears as a condition in solutions to Kepler equation -- one of the central equations in celestial mechanics. The range of the standardized ln(OR) is bounded by minus LLC to LLC, reaching its maximum for ln(OR)~4.7987. This range has implications for analysis of epidemiological associations, affecting the behavior of the reasonable prior distribution for the standardized ln(OR).
The odds ratio measure is used in health and social surveys where the odds of a certain event is to be compared between two populations. It is defined using logistic regression, and requires that data from surveys are accompanied by their weights. A nonparametric estimation method that incorporates survey weights and auxiliary information may improve the precision of the odds ratio estimator. It consists in $B$-spline calibration which can handle the nonlinear structure of the parameter. The variance is estimated through linearization. Implementation is possible through standard survey softwares. The gain in precision depends on the data as shown on two examples.
Let X_1, ..., X_n be independent and identically distributed random vectors with a log-concave (Lebesgue) density f. We first prove that, with probability one, there exists a unique maximum likelihood estimator of f. The use of this estimator is attractive because, unlike kernel density estimation, the method is fully automatic, with no smoothing parameters to choose. Although the existence proof is non-constructive, we are able to reformulate the issue of computation in terms of a non-differentiable convex optimisation problem, and thus combine techniques of computational geometry with Shors r-algorithm to produce a sequence that converges to the maximum likelihood estimate. For the moderate or large sample sizes in our simulations, the maximum likelihood estimator is shown to provide an improvement in performance compared with kernel-based methods, even when we allow the use of a theoretical, optimal fixed bandwidth for the kernel estimator that would not be available in practice. We also present a real data clustering example, which shows that our methodology can be used in conjunction with the Expectation--Maximisation (EM) algorithm to fit finite mixtures of log-concave densities. An R version of the algorithm is available in the package LogConcDEAD -- Log-Concave Density Estimation in Arbitrary Dimensions.
In microbiome studies, one of the ways of studying bacterial abundances is to estimate bacterial composition based on the sequencing read counts. Various transformations are then applied to such compositional data for downstream statistical analysis, among which the centered log-ratio (clr) transformation is most commonly used. Due to limited sequencing depth and DNA dropouts, many rare bacterial taxa might not be captured in the final sequencing reads, which results in many zero counts. Naive composition estimation using count normalization leads to many zero proportions, which makes clr transformation infeasible. This paper proposes a multi-sample approach to estimation of the clr matrix directly in order to borrow information across samples and across species. Empirical results from real datasets suggest that the clr matrix over multiple samples is approximately low rank, which motivates a regularized maximum likelihood estimation with a nuclear norm penalty. An efficient optimization algorithm using the generalized accelerated proximal gradient is developed. Theoretical upper bounds of the estimation errors and of its corresponding singular subspace errors are established. Simulation studies demonstrate that the proposed estimator outperforms the naive estimators. The method is analyzed on Gut Microbiome dataset and the American Gut project.
Additive, multiplicative, and odd ratio neutral models for interactions are for long advocated and controversial in epidemiology. We show here that these commonly advocated models are biased, leading to spurious interactions, and missing true interactions.
The odds ratio (OR) is a measure of effect size commonly used in observational research. OR reflects statistical association between a binary outcome, such as the presence of a health condition, and a binary predictor, such as an exposure to a pollutant. Statistical inference and interval estimation for OR are often performed on the logarithmic scale, due to asymptotic convergence of log(OR) to a normal distribution. Here, we propose a new normalized measure of effect size, $gamma$, and derive its asymptotic distribution. We show that the new statistic, based on the $gamma$ distribution, is more powerful than the traditional one for testing the hypothesis $H_0$: log(OR)=0. The new normalized effect size is termed `gamma prime in the spirit of $D$, a normalized measure of genetic linkage disequilibrium, which ranges from -1 to 1 for a pair of genetic loci. The normalization constant for $gamma$ is based on the maximum range of the standardized effect size, for which we establish a peculiar connection to the Laplace Limit Constant. Furthermore, while standardized effects are of little value on their own, we propose a powerful application, in which standardized effects are employed as an intermediate step in an approximate, yet accurate posterior inference for raw effect size measures, such as log(OR) and $gamma$.