No Arabic abstract
We present the $U$-Statistic Permutation (USP) test of independence in the context of discrete data displayed in a contingency table. Either Pearsons chi-squared test of independence, or the $G$-test, are typically used for this task, but we argue that these tests have serious deficiencies, both in terms of their inability to control the size of the test, and their power properties. By contrast, the USP test is guaranteed to control the size of the test at the nominal level for all sample sizes, has no issues with small (or zero) cell counts, and is able to detect distributions that violate independence in only a minimal way. The test statistic is derived from a $U$-statistic estimator of a natural population measure of dependence, and we prove that this is the unique minimum variance unbiased estimator of this population quantity. The practical utility of the USP test is demonstrated on both simulated data, where its power can be dramatically greater than those of Pearsons test and the $G$-test, and on real data. The USP test is implemented in the R package USP.
We propose a general new method, the conditional permutation test, for testing the conditional independence of variables $X$ and $Y$ given a potentially high-dimensional random vector $Z$ that may contain confounding factors. The proposed test permutes entries of $X$ non-uniformly, so as to respect the existing dependence between $X$ and $Z$ and thus account for the presence of these confounders. Like the conditional randomization test of Cand`es et al. (2018), our test relies on the availability of an approximation to the distribution of $X mid Z$. While Cand`es et al. (2018)s test uses this estimate to draw new $X$ values, for our test we use this approximation to design an appropriate non-uniform distribution on permutations of the $X$ values already seen in the true data. We provide an efficient Markov Chain Monte Carlo sampler for the implementation of our method, and establish bounds on the Type I error in terms of the error in the approximation of the conditional distribution of $Xmid Z$, finding that, for the worst case test statistic, the inflation in Type I error of the conditional permutation test is no larger than that of the conditional randomization test. We validate these theoretical results with experiments on simulated data and on the Capital Bikeshare data set.
We consider settings in which the data of interest correspond to pairs of ordered times, e.g, the birth times of the first and second child, the times at which a new user creates an account and makes the first purchase on a website, and the entry and survival times of patients in a clinical trial. In these settings, the two times are not independent (the second occurs after the first), yet it is still of interest to determine whether there exists significant dependence {em beyond} their ordering in time. We refer to this notion as quasi-(in)dependence. For instance, in a clinical trial, to avoid biased selection, we might wish to verify that recruitment times are quasi-independent of survival times, where dependencies might arise due to seasonal effects. In this paper, we propose a nonparametric statistical test of quasi-independence. Our test considers a potentially infinite space of alternatives, making it suitable for complex data where the nature of the possible quasi-dependence is not known in advance. Standard parametric approaches are recovered as special cases, such as the classical conditional Kendalls tau, and log-rank tests. The tests apply in the right-censored setting: an essential feature in clinical trials, where patients can withdraw from the study. We provide an asymptotic analysis of our test-statistic, and demonstrate in experiments that our test obtains better power than existing approaches, while being more computationally efficient.
We propose a new method for dimension reduction in regression using the first two inverse moments. We develop corresponding weighted chi-squared tests for the dimension of the regression. The proposed method considers linear combinations of Sliced Inverse Regression (SIR) and the method using a new candidate matrix which is designed to recover the entire inverse second moment subspace. The optimal combination may be selected based on the p-values derived from the dimension tests. Theoretically, the proposed method, as well as Sliced Average Variance Estimate (SAVE), are more capable of recovering the complete central dimension reduction subspace than SIR and Principle Hessian Directions (pHd). Therefore it can substitute for SIR, pHd, SAVE, or any linear combination of them at a theoretical level. Simulation study indicates that the proposed method may have consistently greater power than SIR, pHd, and SAVE.
Aggregating multiple effects is often encountered in large-scale data analysis where the fraction of significant effects is generally small. Many existing methods cannot handle it effectively because of lack of computational accuracy for small p-values. The Cauchy combination test (abbreviated as CCT) ( J Am Statist Assoc, 2020, 115(529):393-402) is a powerful and computational effective test to aggregate individual $p$-values under arbitrary correlation structures. This work revisits CCT and shows three key contributions including that (i) the tail probability of CCT can be well approximated by a standard Cauchy distribution under much more relaxed conditions placed on individual p-values instead of the original test statistics; (ii) the relaxation conditions are shown to be satisfied for many popular copulas formulating bivariate distributions; (iii) the power of CCT is no less than that of the minimum-type test as the number of tests goes to infinity with some regular conditions. These results further broaden the theories and applications of CCT. The simulation results verify the theoretic results and the performance of CCT is further evaluated with data from a prostate cancer study.
Differential abundance tests in compositional data are essential and fundamental tasks in various biomedical applications, such as single-cell, bulk RNA-seq, and microbiome data analysis. However, despite the recent developments in these fields, differential abundance analysis in compositional data remains a complicated and unsolved statistical problem, because of the compositional constraint and prevalent zero counts in the dataset. This study introduces a new differential abundance test, the robust differential abundance (RDB) test, to address these challenges. Compared with existing methods, the RDB test 1) is simple and computationally efficient, 2) is robust to prevalent zero counts in compositional datasets, 3) can take the datas compositional nature into account, and 4) has a theoretical guarantee of controlling false discoveries in a general setting. Furthermore, in the presence of observed covariates, the RDB test can work with the covariate balancing techniques to remove the potential confounding effects and draw reliable conclusions. Finally, we apply the new test to several numerical examples using simulated and real datasets to demonstrate its practical merits.