Global and Local Two-Sample Tests via Regression

189 0 0.0 ( 0 )

Download Cite

Added by Ilmun Kim

Publication date 2018

fields Mathematical Statistics

and research's language is English

Authors Ilmun Kim - Ann B. Lee - Jing Lei

Methodology

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.

rate research

Kernel Two-Sample and Independence Tests for Non-Stationary Random Processes

180 - Felix Laumann , Julius von Kugelgen , Mauricio Barahona 2020

Two-sample and independence tests with the kernel-based MMD and HSIC have shown remarkable results on i.i.d. data and stationary random processes. However, these statistics are not directly applicable to non-stationary random processes, a prevalent form of data in many scientific disciplines. In this work, we extend the application of MMD and HSIC to non-stationary settings by assuming access to independent realisations of the underlying random process. These realisations - in the form of non-stationary time-series measured on the same temporal grid - can then be viewed as i.i.d. samples from a multivariate probability distribution, to which MMD and HSIC can be applied. We further show how to choose suitable kernels over these high-dimensional spaces by maximising the estimated test power with respect to the kernel hyper-parameters. In experiments on synthetic data, we demonstrate superior performance of our proposed approaches in terms of test power when compared to current state-of-the-art functional or multivariate two-sample and independence tests. Finally, we employ our methods on a real socio-economic dataset as an example application.

Methodology Applications

Two-Sample Tests for High Dimensional Means with Thresholding and Data Transformation

228 - Song Xi Chen , Jun Li , Ping-Shou Zhong 2014

We consider testing for two-sample means of high dimensional populations by thresholding. Two tests are investigated, which are designed for better power performance when the two population mean vectors differ only in sparsely populated coordinates. The first test is constructed by carrying out thresholding to remove the non-signal bearing dimensions. The second test combines data transformation via the precision matrix with the thresholding. The benefits of the thresholding and the data transformations are showed by a reduced variance of the test thresholding statistics, the improved power and a wider detection region of the tests. Simulation experiments and an empirical study are performed to confirm the theoretical findings and to demonstrate the practical implementations.

Methodology

Inference on covariance operators via concentration inequalities: k-sample tests, classification, and clustering via Rademacher complexities

123 - Adam B. Kashlak , John A. D. Aston , Richard Nickl 2016

We propose a novel approach to the analysis of covariance operators making use of concentration inequalities. First, non-asymptotic confidence sets are constructed for such operators. Then, subsequent applications including a k sample test for equality of covariance, a functional data classifier, and an expectation-maximization style clustering algorithm are derived and tested on both simulated and phoneme data.

Methodology Statistics Theory Statistics Theory

One- and two-sample nonparametric tests for the signal-to-noise ratio based on record statistics

169 - Damien Challet 2015

A new family of nonparametric statistics, the r-statistics, is introduced. It consists of counting the number of records of the cumulative sum of the sample. The single-sample r-statistic is almost as powerful as Students t-statistic for Gaussian and uniformly distributed variables, and more powerful than the sign and Wilcoxon signed-rank statistics as long as the data are not too heavy-tailed. Three two-sample parametric r-statistics are proposed, one with a higher specificity but a smaller sensitivity than Mann-Whitney U-test and the other one a higher sensitivity but a smaller specificity. A nonparametric two-sample r-statistic is introduced, whose power is very close to that of Welch statistic for Gaussian or uniformly distributed variables.

Methodology Physics and Society General Finance

Interactive Martingale Tests for the Global Null

68 - Boyan Duan , Aaditya Ramdas , Sivaraman Balakrishnan 2019

Global null testing is a classical problem going back about a century to Fishers and Stouffers combination tests. In this work, we present simple martingale analogs of these classical tests, which are applicable in two distinct settings: (a) the online setting in which there is a possibly infinite sequence of $p$-values, and (b) the batch setting, where one uses prior knowledge to preorder the hypotheses. Through theory and simulations, we demonstrate that our martingale variants have higher power than their classical counterparts even when the preordering is only weakly informative. Finally, using a recent idea of masking $p$-values, we develop a novel interactive test for the global null that can take advantage of covariates and repeated user guidance to create a data-adaptive ordering that achieves higher detection power against structured alternatives.

Methodology