No Arabic abstract
Two-sample tests utilizing a similarity graph on observations are useful for high-dimensional data and non-Euclidean data due to their flexibility and good performance under a wide range of alternatives. Existing works mainly focused on sparse graphs, such as graphs with the number of edges in the order of the number of observations. However, the tests have better performance with denser graphs under many settings. In this work, we establish the theoretical ground for graph-based tests with graphs that are much denser than those in existing works.
In this paper, we study limiting laws and consistent estimation criteria for the extreme eigenvalues in a spiked covariance model of dimension $p$. Firstly, for fixed $p$, we propose a generalized estimation criterion that can consistently estimate, $k$, the number of spiked eigenvalues. Compared with the existing literature, we show that consistency can be achieved under weaker conditions on the penalty term. Next, allowing both $p$ and $k$ to diverge, we derive limiting distributions of the spiked sample eigenvalues using random matrix theory techniques. Notably, our results do not require the spiked eigenvalues to be uniformly bounded from above or tending to infinity, as have been assumed in the existing literature. Based on the above derived results, we formulate a generalized estimation criterion and show that it can consistently estimate $k$, while $k$ can be fixed or grow at an order of $k=o(n^{1/3})$. We further show that the results in our work continue to hold under a general population distribution without assuming normality. The efficacy of the proposed estimation criteria is illustrated through comparative simulation studies.
In this paper, we study Kaplan-Meier V- and U-statistics respectively defined as $theta(widehat{F}_n)=sum_{i,j}K(X_{[i:n]},X_{[j:n]})W_iW_j$ and $theta_U(widehat{F}_n)=sum_{i eq j}K(X_{[i:n]},X_{[j:n]})W_iW_j/sum_{i eq j}W_iW_j$, where $widehat{F}_n$ is the Kaplan-Meier estimator, ${W_1,ldots,W_n}$ are the Kaplan-Meier weights and $K:(0,infty)^2tomathbb R$ is a symmetric kernel. As in the canonical setting of uncensored data, we differentiate between two asymptotic behaviours for $theta(widehat{F}_n)$ and $theta_U(widehat{F}_n)$. Additionally, we derive an asymptotic canonical V-statistic representation of the Kaplan-Meier V- and U-statistics. By using this representation we study properties of the asymptotic distribution. Applications to hypothesis testing are given.
Distance correlation has become an increasingly popular tool for detecting the nonlinear dependence between a pair of potentially high-dimensional random vectors. Most existing works have explored its asymptotic distributions under the null hypothesis of independence between the two random vectors when only the sample size or the dimensionality diverges. Yet its asymptotic null distribution for the more realistic setting when both sample size and dimensionality diverge in the full range remains largely underdeveloped. In this paper, we fill such a gap and develop central limit theorems and associated rates of convergence for a rescaled test statistic based on the bias-corrected distance correlation in high dimensions under some mild regularity conditions and the null hypothesis. Our new theoretical results reveal an interesting phenomenon of blessing of dimensionality for high-dimensional distance correlation inference in the sense that the accuracy of normal approximation can increase with dimensionality. Moreover, we provide a general theory on the power analysis under the alternative hypothesis of dependence, and further justify the capability of the rescaled distance correlation in capturing the pure nonlinear dependency under moderately high dimensionality for a certain type of alternative hypothesis. The theoretical results and finite-sample performance of the rescaled statistic are illustrated with several simulation examples and a blockchain application.
Over the last two decades, many exciting variable selection methods have been developed for finding a small group of covariates that are associated with the response from a large pool. Can the discoveries from these data mining approaches be spurious due to high dimensionality and limited sample size? Can our fundamental assumptions about the exogeneity of the covariates needed for such variable selection be validated with the data? To answer these questions, we need to derive the distributions of the maximum spurious correlations given a certain number of predictors, namely, the distribution of the correlation of a response variable $Y$ with the best $s$ linear combinations of $p$ covariates $mathbf{X}$, even when $mathbf{X}$ and $Y$ are independent. When the covariance matrix of $mathbf{X}$ possesses the restricted eigenvalue property, we derive such distributions for both a finite $s$ and a diverging $s$, using Gaussian approximation and empirical process techniques. However, such a distribution depends on the unknown covariance matrix of $mathbf{X}$. Hence, we use the multiplier bootstrap procedure to approximate the unknown distributions and establish the consistency of such a simple bootstrap approach. The results are further extended to the situation where the residuals are from regularized fits. Our approach is then used to construct the upper confidence limit for the maximum spurious correlation and to test the exogeneity of the covariates. The former provides a baseline for guarding against false discoveries and the latter tests whether our fundamental assumptions for high-dimensional model selection are statistically valid. Our techniques and results are illustrated with both numerical examples and real data analysis.
We consider testing the equality of two high-dimensional covariance matrices by carrying out a multi-level thresholding procedure, which is designed to detect sparse and faint differences between the covariances. A novel U-statistic composition is developed to establish the asymptotic distribution of the thresholding statistics in conjunction with the matrix blocking and the coupling techniques. We propose a multi-thresholding test that is shown to be powerful in detecting sparse and weak differences between two covariance matrices. The test is shown to have attractive detection boundary and to attain the optimal minimax rate in the signal strength under different regimes of high dimensionality and the sparsity of the signal. Simulation studies are conducted to demonstrate the utility of the proposed test.