No Arabic abstract
Testing heteroscedasticity of the errors is a major challenge in high-dimensional regressions where the number of covariates is large compared to the sample size. Traditional procedures such as the White and the Breusch-Pagan tests typically suffer from low sizes and powers. This paper proposes two new test procedures based on standard OLS residuals. Using the theory of random Haar orthogonal matrices, the asymptotic normality of both test statistics is obtained under the null when the degree of freedom tends to infinity. This encompasses both the classical low-dimensional setting where the number of variables is fixed while the sample size tends to infinity, and the proportional high-dimensional setting where these dimensions grow to infinity proportionally. These procedures thus offer a wide coverage of dimensions in applications. To our best knowledge, this is the first procedures in the literature for testing heteroscedasticity which are valid for medium and high-dimensional regressions. The superiority of our proposed tests over the existing methods are demonstrated by extensive simulations and by several real data analyses as well.
For a multivariate linear model, Wilks likelihood ratio test (LRT) constitutes one of the cornerstone tools. However, the computation of its quantiles under the null or the alternative requires complex analytic approximations and more importantly, these distributional approximations are feasible only for moderate dimension of the dependent variable, say $ple 20$. On the other hand, assuming that the data dimension $p$ as well as the number $q$ of regression variables are fixed while the sample size $n$ grows, several asymptotic approximations are proposed in the literature for Wilks $bLa$ including the widely used chi-square approximation. In this paper, we consider necessary modifications to Wilks test in a high-dimensional context, specifically assuming a high data dimension $p$ and a large sample size $n$. Based on recent random matrix theory, the correction we propose to Wilks test is asymptotically Gaussian under the null and simulations demonstrate that the corrected LRT has very satisfactory size and power, surely in the large $p$ and large $n$ context, but also for moderately large data dimensions like $p=30$ or $p=50$. As a byproduct, we give a reason explaining why the standard chi-square approximation fails for high-dimensional data. We also introduce a new procedure for the classical multiple sample significance test in MANOVA which is valid for high-dimensional data.
For high-dimensional small sample size data, Hotellings T2 test is not applicable for testing mean vectors due to the singularity problem in the sample covariance matrix. To overcome the problem, there are three main approaches in the literature. Note, however, that each of the existing approaches may have serious limitations and only works well in certain situations. Inspired by this, we propose a pairwise Hotelling method for testing high-dimensional mean vectors, which, in essence, provides a good balance between the existing approaches. To effectively utilize the correlation information, we construct the new test statistics as the summation of Hotellings test statistics for the covariate pairs with strong correlations and the squared $t$ statistics for the individual covariates that have little correlation with others. We further derive the asymptotic null distributions and power functions for the proposed Hotelling tests under some regularity conditions. Numerical results show that our new tests are able to control the type I error rates, and can achieve a higher statistical power compared to existing methods, especially when the covariates are highly correlated. Two real data examples are also analyzed and they both demonstrate the efficacy of our pairwise Hotelling tests.
By studying the family of $p$-dimensional scale mixtures, this paper shows for the first time a non trivial example where the eigenvalue distribution of the corresponding sample covariance matrix {em does not converge} to the celebrated Marv{c}enko-Pastur law. A different and new limit is found and characterized. The reasons of failure of the Marv{c}enko-Pastur limit in this situation are found to be a strong dependence between the $p$-coordinates of the mixture. Next, we address the problem of testing whether the mixture has a spherical covariance matrix. To analize the traditional Johns type test we establish a novel and general CLT for linear statistics of eigenvalues of the sample covariance matrix. It is shown that the Johns test and its recent high-dimensional extensions both fail for high-dimensional mixtures, precisely due to the different spectral limit above. As a remedy, a new test procedure is constructed afterwards for the sphericity hypothesis. This test is then applied to identify the covariance structure in model-based clustering. It is shown that the test has much higher power than the widely used ICL and BIC criteria in detecting non spherical component covariance matrices of a high-dimensional mixture.
To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsampling probabilities under the A- and L-optimality criteria are derived. Furthermore, using Frobenius norm matrix concentration inequalities, finite sample properties of the subsample estimator based on optimal subsampling probabilities are also derived. Since the optimal subsampling probabilities depend on the full data estimate, an adaptive two-step algorithm is developed. Asymptotic normality and optimality of the estimator from this adaptive algorithm are established. The proposed methods are illustrated and evaluated through numerical experiments on simulated and real datasets.
Regression models with crossed random effect errors can be very expensive to compute. The cost of both generalized least squares and Gibbs sampling can easily grow as $N^{3/2}$ (or worse) for $N$ observations. Papaspiliopoulos et al. (2020) present a collapsed Gibbs sampler that costs $O(N)$, but under an extremely stringent sampling model. We propose a backfitting algorithm to compute a generalized least squares estimate and prove that it costs $O(N)$. A critical part of the proof is in ensuring that the number of iterations required is $O(1)$ which follows from keeping a certain matrix norm below $1-delta$ for some $delta>0$. Our conditions are greatly relaxed compared to those for the collapsed Gibbs sampler, though still strict. Empirically, the backfitting algorithm has a norm below $1-delta$ under conditions that are less strict than those in our assumptions. We illustrate the new algorithm on a ratings data set from Stitch Fix.