Do you want to publish a course? Click here

Two-Sample Tests for High Dimensional Means with Thresholding and Data Transformation

221   0   0.0 ( 0 )
 Added by Jun Li
 Publication date 2014
and research's language is English




Ask ChatGPT about the research

We consider testing for two-sample means of high dimensional populations by thresholding. Two tests are investigated, which are designed for better power performance when the two population mean vectors differ only in sparsely populated coordinates. The first test is constructed by carrying out thresholding to remove the non-signal bearing dimensions. The second test combines data transformation via the precision matrix with the thresholding. The benefits of the thresholding and the data transformations are showed by a reduced variance of the test thresholding statistics, the improved power and a wider detection region of the tests. Simulation experiments and an empirical study are performed to confirm the theoretical findings and to demonstrate the practical implementations.



rate research

Read More

72 - Kaijie Xue , Fang Yao 2019
We propose a two-sample test for high-dimensional means that requires neither distributional nor correlational assumptions, besides some weak conditions on the moments and tail properties of the elements in the random vectors. This two-sample test based on a nontrivial extension of the one-sample central limit theorem (Chernozhukov et al., 2017) provides a practically useful procedure with rigorous theoretical guarantees on its size and power assessment. In particular, the proposed test is easy to compute and does not require the independently and identically distributed assumption, which is allowed to have different distributions and arbitrary correlation structures. Further desired features include weaker moments and tail conditions than existing methods, allowance for highly unequal sample sizes, consistent power behavior under fairly general alternative, data dimension allowed to be exponentially high under the umbrella of such general conditions. Simulated and real data examples are used to demonstrate the favorable numerical performance over existing methods.
188 - Ilmun Kim , Ann B. Lee , Jing Lei 2018
Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data. Specifically, in the machine learning literature, there have been recent methodological developments such as classification accuracy tests. The goal of this work is to present a regression approach to comparing multivariate distributions of complex data. Depending on the chosen regression model, our framework can efficiently handle different types of variables and various structures in the data, with competitive power under many practical scenarios. Whereas previous work has been largely limited to global tests which conceal much of the local information, our approach naturally leads to a local two-sample testing framework in which we identify local differences between multivariate distributions with statistical confidence. We demonstrate the efficacy of our approach both theoretically and empirically, under some well-known parametric and nonparametric regression methods. Our proposed methods are applied to simulated data as well as a challenging astronomy data set to assess their practical usefulness.
Two-sample and independence tests with the kernel-based MMD and HSIC have shown remarkable results on i.i.d. data and stationary random processes. However, these statistics are not directly applicable to non-stationary random processes, a prevalent form of data in many scientific disciplines. In this work, we extend the application of MMD and HSIC to non-stationary settings by assuming access to independent realisations of the underlying random process. These realisations - in the form of non-stationary time-series measured on the same temporal grid - can then be viewed as i.i.d. samples from a multivariate probability distribution, to which MMD and HSIC can be applied. We further show how to choose suitable kernels over these high-dimensional spaces by maximising the estimated test power with respect to the kernel hyper-parameters. In experiments on synthetic data, we demonstrate superior performance of our proposed approaches in terms of test power when compared to current state-of-the-art functional or multivariate two-sample and independence tests. Finally, we employ our methods on a real socio-economic dataset as an example application.
155 - Song Xi Chen , Bin Guo 2014
We consider testing regression coefficients in high dimensional generalized linear models. An investigation of the test of Goeman et al. (2011) is conducted, which reveals that if the inverse of the link function is unbounded, the high dimensionality in the covariates can impose adverse impacts on the power of the test. We propose a test formation which can avoid the adverse impact of the high dimensionality. When the inverse of the link function is bounded such as the logistic or probit regression, the proposed test is as good as Goeman et al. (2011)s test. The proposed tests provide p-values for testing significance for gene-sets as demonstrated in a case study on an acute lymphoblastic leukemia dataset.
128 - Xiuyuan Cheng , Yao Xie 2021
We present a study of kernel MMD two-sample test statistics in the manifold setting, assuming the high-dimensional observations are close to a low-dimensional manifold. We characterize the property of the test (level and power) in relation to the kernel bandwidth, the number of samples, and the intrinsic dimensionality of the manifold. Specifically, we show that when data densities are supported on a $d$-dimensional sub-manifold $mathcal{M}$ embedded in an $m$-dimensional space, the kernel MMD two-sample test for data sampled from a pair of distributions $(p, q)$ that are Holder with order $beta$ is consistent and powerful when the number of samples $n$ is greater than $delta_2(p,q)^{-2-d/beta}$ up to certain constant, where $delta_2$ is the squared $ell_2$-divergence between two distributions on manifold. Moreover, to achieve testing consistency under this scaling of $n$, our theory suggests that the kernel bandwidth $gamma$ scales with $n^{-1/(d+2beta)}$. These results indicate that the kernel MMD two-sample test does not have a curse-of-dimensionality when the data lie on the low-dimensional manifold. We demonstrate the validity of our theory and the property of the MMD test for manifold data using several numerical experiments.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا