No Arabic abstract
A new family of nonparametric statistics, the r-statistics, is introduced. It consists of counting the number of records of the cumulative sum of the sample. The single-sample r-statistic is almost as powerful as Students t-statistic for Gaussian and uniformly distributed variables, and more powerful than the sign and Wilcoxon signed-rank statistics as long as the data are not too heavy-tailed. Three two-sample parametric r-statistics are proposed, one with a higher specificity but a smaller sensitivity than Mann-Whitney U-test and the other one a higher sensitivity but a smaller specificity. A nonparametric two-sample r-statistic is introduced, whose power is very close to that of Welch statistic for Gaussian or uniformly distributed variables.
Intensity interferometry is a well known method in astronomy. Recently, a related method called incoherent diffractive imaging (IDI) was proposed to apply intensity correlations of x-ray fluorescence radiation to determine the 3D arrangement of the emitting atoms in a sample. Here we discuss inherent sources of noise affecting IDI and derive a model to estimate the dependence of the signal to noise ratio (SNR) on the photon counts per pixel, the temporal coherence (or number of modes), and the shape of the imaged object. Simulations in two- and three-dimensions have been performed to validate the predictions of the model. We find that contrary to coherent imaging methods, higher intensities and higher detected counts do not always correspond to a larger SNR. Also, larger and more complex objects generally yield a poorer SNR despite the higher measured counts. The framework developed here should be a valuable guide to future experimental design.
In this paper, we show that the likelihood-ratio measure (a) is invariant with respect to dominating sigma-finite measures, (b) satisfies logical consequences which are not satisfied by standard $p$-values, (c) respects frequentist properties, i.e., the type I error can be properly controlled, and, under mild regularity conditions, (d) can be used as an upper bound for posterior probabilities. We also discuss a generic application to test whether the genotype frequencies of a given population are under the Hardy-Weinberg equilibrium, under inbreeding restrictions or under outbreeding restrictions.
This paper investigates the theoretical and empirical performance of Fisher-Pitman-type permutation tests for assessing the equality of unknown Poisson mixture distributions. Building on nonparametric maximum likelihood estimators (NPMLEs) of the mixing distribution, these tests are theoretically shown to be able to adapt to complicated unspecified structures of count data and also consistent against their corresponding ANOVA-type alternatives; the latter is a result in parallel to classic claims made by Robinson (Robinson, 1973). The studied methods are then applied to a single-cell RNA-seq data obtained from different cell types from brain samples of autism subjects and healthy controls; empirically, they unveil genes that are differentially expressed between autism and control subjects yet are missed using common tests. For justifying their use, rate optimality of NPMLEs is also established in settings similar to nonparametric Gaussian (Wu and Yang, 2020a) and binomial mixtures (Tian et al., 2017; Vinayak et al., 2019).
Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.
Two-sample and independence tests with the kernel-based MMD and HSIC have shown remarkable results on i.i.d. data and stationary random processes. However, these statistics are not directly applicable to non-stationary random processes, a prevalent form of data in many scientific disciplines. In this work, we extend the application of MMD and HSIC to non-stationary settings by assuming access to independent realisations of the underlying random process. These realisations - in the form of non-stationary time-series measured on the same temporal grid - can then be viewed as i.i.d. samples from a multivariate probability distribution, to which MMD and HSIC can be applied. We further show how to choose suitable kernels over these high-dimensional spaces by maximising the estimated test power with respect to the kernel hyper-parameters. In experiments on synthetic data, we demonstrate superior performance of our proposed approaches in terms of test power when compared to current state-of-the-art functional or multivariate two-sample and independence tests. Finally, we employ our methods on a real socio-economic dataset as an example application.