No Arabic abstract
This paper investigates the theoretical and empirical performance of Fisher-Pitman-type permutation tests for assessing the equality of unknown Poisson mixture distributions. Building on nonparametric maximum likelihood estimators (NPMLEs) of the mixing distribution, these tests are theoretically shown to be able to adapt to complicated unspecified structures of count data and also consistent against their corresponding ANOVA-type alternatives; the latter is a result in parallel to classic claims made by Robinson (Robinson, 1973). The studied methods are then applied to a single-cell RNA-seq data obtained from different cell types from brain samples of autism subjects and healthy controls; empirically, they unveil genes that are differentially expressed between autism and control subjects yet are missed using common tests. For justifying their use, rate optimality of NPMLEs is also established in settings similar to nonparametric Gaussian (Wu and Yang, 2020a) and binomial mixtures (Tian et al., 2017; Vinayak et al., 2019).
Classical two-sample permutation tests for equality of distributions have exact size in finite samples, but they fail to control size for testing equality of parameters that summarize each distribution. This paper proposes permutation tests for equality of parameters that are estimated at root-n or slower rates. Our general framework applies to both parametric and nonparametric models, with two samples or one sample split into two subsamples. Our tests have correct size asymptotically while preserving exact size in finite samples when distributions are equal. They have no loss in local-asymptotic power compared to tests that use asymptotic critical values. We propose confidence sets with correct coverage in large samples that also have exact coverage in finite samples if distributions are equal up to a transformation. We apply our theory to four commonly-used hypothesis tests of nonparametric functions evaluated at a point. Lastly, simulations show good finite sample properties of our tests.
A new family of nonparametric statistics, the r-statistics, is introduced. It consists of counting the number of records of the cumulative sum of the sample. The single-sample r-statistic is almost as powerful as Students t-statistic for Gaussian and uniformly distributed variables, and more powerful than the sign and Wilcoxon signed-rank statistics as long as the data are not too heavy-tailed. Three two-sample parametric r-statistics are proposed, one with a higher specificity but a smaller sensitivity than Mann-Whitney U-test and the other one a higher sensitivity but a smaller specificity. A nonparametric two-sample r-statistic is introduced, whose power is very close to that of Welch statistic for Gaussian or uniformly distributed variables.
Conditional density estimation generalizes regression by modeling a full density f(yjx) rather than only the expected value E(yjx). This is important for many tasks, including handling multi-modality and generating prediction intervals. Though fundamental and widely applicable, nonparametric conditional density estimators have received relatively little attention from statisticians and little or none from the machine learning community. None of that work has been applied to greater than bivariate data, presumably due to the computational difficulty of data-driven bandwidth selection. We describe the double kernel conditional density estimator and derive fast dual-tree-based algorithms for bandwidth selection using a maximum likelihood criterion. These techniques give speedups of up to 3.8 million in our experiments, and enable the first applications to previously intractable large multivariate datasets, including a redshift prediction problem from the Sloan Digital Sky Survey.
Inverse probability weighted estimators are the oldest and potentially most commonly used class of procedures for the estimation of causal effects. By adjusting for selection biases via a weighting mechanism, these procedures estimate an effect of interest by constructing a pseudo-population in which selection biases are eliminated. Despite their ease of use, these estimators require the correct specification of a model for the weighting mechanism, are known to be inefficient, and suffer from the curse of dimensionality. We propose a class of nonparametric inverse probability weighted estimators in which the weighting mechanism is estimated via undersmoothing of the highly adaptive lasso, a nonparametric regression function proven to converge at $n^{-1/3}$-rate to the true weighting mechanism. We demonstrate that our estimators are asymptotically linear with variance converging to the nonparametric efficiency bound. Unlike doubly robust estimators, our procedures require neither derivation of the efficient influence function nor specification of the conditional outcome model. Our theoretical developments have broad implications for the construction of efficient inverse probability weighted estimators in large statistical models and a variety of problem settings. We assess the practical performance of our estimators in simulation studies and demonstrate use of our proposed methodology with data from a large-scale epidemiologic study.
In this paper, we generalize the metric-based permutation test for the equality of covariance operators proposed by Pigoli et al. (2014) to the case of multiple samples of functional data. To this end, the non-parametric combination methodology of Pesarin and Salmaso (2010) is used to combine all the pairwise comparisons between samples into a global test. Different combining functions and permutation strategies are reviewed and analyzed in detail. The resulting test allows to make inference on the equality of the covariance operators of multiple groups and, if there is evidence to reject the null hypothesis, to identify the pairs of groups having different covariances. It is shown that, for some combining functions, step-down adjusting procedures are available to control for the multiple testing problem in this setting. The empirical power of this new test is then explored via simulations and compared with those of existing alternative approaches in different scenarios. Finally, the proposed methodology is applied to data from wheel running activity experiments, that used selective breeding to study the evolution of locomotor behavior in mice.