No Arabic abstract
Two-sample tests have been one of the most classical topics in statistics with wide application even in cutting edge applications. There are at least two modes of inference used to justify the two-sample tests. One is usual superpopulation inference assuming the units are independent and identically distributed (i.i.d.) samples from some superpopulation; the other is finite population inference that relies on the random assignments of units into different groups. When randomization is actually implemented, the latter has the advantage of avoiding distributional assumptions on the outcomes. In this paper, we will focus on finite population inference for censored outcomes, which has been less explored in the literature. Moreover, we allow the censoring time to depend on treatment assignment, under which exact permutation inference is unachievable. We find that, surprisingly, the usual logrank test can also be justified by randomization. Specifically, under a Bernoulli randomized experiment with non-informative i.i.d. censoring within each treatment arm, the logrank test is asymptotically valid for testing Fishers null hypothesis of no treatment effect on any unit. Moreover, the asymptotic validity of the logrank test does not require any distributional assumption on the potential event times. We further extend the theory to the stratified logrank test, which is useful for randomized blocked designs and when censoring mechanisms vary across strata. In sum, the developed theory for the logrank test from finite population inference supplements its classical theory from usual superpopulation inference, and helps provide a broader justification for the logrank test.
In many scientific problems, researchers try to relate a response variable $Y$ to a set of potential explanatory variables $X = (X_1,dots,X_p)$, and start by trying to identify variables that contribute to this relationship. In statistical terms, this goal can be posed as trying to identify $X_j$s upon which $Y$ is conditionally dependent. Sometimes it is of value to simultaneously test for each $j$, which is more commonly known as variable selection. The conditional randomization test (CRT) and model-X knockoffs are two recently proposed methods that respectively perform conditional independence testing and variable selection by, for each $X_j$, computing any test statistic on the data and assessing that test statistics significance by comparing it to test statistics computed on synthetic variables generated using knowledge of $X$s distribution. Our main contribution is to analyze their power in a high-dimensional linear model where the ratio of the dimension $p$ and the sample size $n$ converge to a positive constant. We give explicit expressions of the asymptotic power of the CRT, variable selection with CRT $p$-values, and model-X knockoffs, each with a test statistic based on either the marginal covariance, the least squares coefficient, or the lasso. One useful application of our analysis is the direct theoretical comparison of the asymptotic powers of variable selection with CRT $p$-values and model-X knockoffs; in the instances with independent covariates that we consider, the CRT provably dominates knockoffs. We also analyze the power gain from using unlabeled data in the CRT when limited knowledge of $X$s distribution is available, and the power of the CRT when samples are collected retrospectively.
In this paper, we address the question of comparison between populations of trees. We study an statistical test based on the distance between empirical mean trees, as an analog of the two sample z statistic for comparing two means. Despite its simplicity, we can report that the test is quite powerful to separate distributions with different means but it does not distinguish between different populations with the same mean, a more complicated test should be applied in that setting. The performance of the test is studied via simulations on Galton-Watson branching processes. We also show an application to a real data problem in genomics.
This paper is concerned with the problem of comparing the population means of two groups of independent observations. An approximate randomization test procedure based on the test statistic of Chen & Qin (2010) is proposed. The asymptotic behavior of the test statistic as well as the randomized statistic is studied under weak conditions. In our theoretical framework, observations are not assumed to be identically distributed even within groups. No condition on the eigenstructure of the covariance is imposed. And the sample sizes of two groups are allowed to be unbalanced. Under general conditions, all possible asymptotic distributions of the test statistic are obtained. We derive the asymptotic level and local power of the proposed test procedure. Our theoretical results show that the proposed test procedure can adapt to all possible asymptotic distributions of the test statistic and always has correct test level asymptotically. Also, the proposed test procedure has good power behavior. Our numerical experiments show that the proposed test procedure has favorable performance compared with several altervative test procedures.
A new goodness-of-fit test for normality in high-dimension (and Reproducing Kernel Hilbert Space) is proposed. It shares common ideas with the Maximum Mean Discrepancy (MMD) it outperforms both in terms of computation time and applicability to a wider range of data. Theoretical results are derived for the Type-I and Type-II errors. They guarantee the control of Type-I error at prescribed level and an exponentially fast decrease of the Type-II error. Synthetic and real data also illustrate the practical improvement allowed by our test compared with other leading approaches in high-dimensional settings.
In this paper, a novel Bayesian nonparametric test for assessing multivariate normal models is presented. While there are extensive frequentist and graphical methods for testing multivariate normality, it is challenging to find Bayesian counterparts. The proposed approach is based on the use of the Dirichlet process and Mahalanobis distance. More precisely, the Mahalanobis distance is employed as a good technique to transform the $m$-variate problem into a univariate problem. Then the Dirichlet process is used as a prior on the distribution of the Mahalanobis distance. The concentration of the distribution of the distance between the posterior process and the chi-square distribution with $m$ degrees of freedom is compared to the concentration of the distribution of the distance between the prior process and the chi-square distribution with $m$ degrees of freedom via a relative belief ratio. The distance between the Dirichlet process and the chi-square distribution is established based on the Anderson-Darling distance. Key theoretical results of the approach are derived. The procedure is illustrated through several examples, in which the proposed approach shows excellent performance.