No Arabic abstract
We propose a likelihood ratio test framework for testing normal mean vectors in high-dimensional data under two common scenarios: the one-sample test and the two-sample test with equal covariance matrices. We derive the test statistics under the assumption that the covariance matrices follow a diagonal matrix structure. In comparison with the diagonal Hotellings tests, our proposed test statistics display some interesting characteristics. In particular, they are a summation of the log-transformed squared $t$-statistics rather than a direct summation of those components. More importantly, to derive the asymptotic normality of our test statistics under the null and local alternative hypotheses, we do not need the requirement that the covariance matrices follow a diagonal matrix structure. As a consequence, our proposed test methods are very flexible and readily applicable in practice. Simulation studies and a real data analysis are also carried out to demonstrate the advantages of our likelihood ratio test methods.
For high-dimensional small sample size data, Hotellings T2 test is not applicable for testing mean vectors due to the singularity problem in the sample covariance matrix. To overcome the problem, there are three main approaches in the literature. Note, however, that each of the existing approaches may have serious limitations and only works well in certain situations. Inspired by this, we propose a pairwise Hotelling method for testing high-dimensional mean vectors, which, in essence, provides a good balance between the existing approaches. To effectively utilize the correlation information, we construct the new test statistics as the summation of Hotellings test statistics for the covariate pairs with strong correlations and the squared $t$ statistics for the individual covariates that have little correlation with others. We further derive the asymptotic null distributions and power functions for the proposed Hotelling tests under some regularity conditions. Numerical results show that our new tests are able to control the type I error rates, and can achieve a higher statistical power compared to existing methods, especially when the covariates are highly correlated. Two real data examples are also analyzed and they both demonstrate the efficacy of our pairwise Hotelling tests.
The complexity underlying real-world systems implies that standard statistical hypothesis testing methods may not be adequate for these peculiar applications. Specifically, we show that the likelihood-ratio tests null-distribution needs to be modified to accommodate the complexity found in multi-edge network data. When working with independent observations, the p-values of likelihood-ratio tests are approximated using a $chi^2$ distribution. However, such an approximation should not be used when dealing with multi-edge network data. This type of data is characterized by multiple correlations and competitions that make the standard approximation unsuitable. We provide a solution to the problem by providing a better approximation of the likelihood-ratio test null-distribution through a Beta distribution. Finally, we empirically show that even for a small multi-edge network, the standard $chi^2$ approximation provides erroneous results, while the proposed Beta approximation yields the correct p-value estimation.
The ratio between two probability density functions is an important component of various tasks, including selection bias correction, novelty detection and classification. Recently, several estimators of this ratio have been proposed. Most of these methods fail if the sample space is high-dimensional, and hence require a dimension reduction step, the result of which can be a significant loss of information. Here we propose a simple-to-implement, fully nonparametric density ratio estimator that expands the ratio in terms of the eigenfunctions of a kernel-based operator; these functions reflect the underlying geometry of the data (e.g., submanifold structure), often leading to better estimates without an explicit dimension reduction step. We show how our general framework can be extended to address another important problem, the estimation of a likelihood function in situations where that function cannot be well-approximated by an analytical form. One is often faced with this situation when performing statistical inference with data from the sciences, due the complexity of the data and of the processes that generated those data. We emphasize applications where using existing likelihood-free methods of inference would be challenging due to the high dimensionality of the sample space, but where our spectral series method yields a reasonable estimate of the likelihood function. We provide theoretical guarantees and illustrate the effectiveness of our proposed method with numerical experiments.
As an effective nonparametric method, empirical likelihood (EL) is appealing in combining estimating equations flexibly and adaptively for incorporating data information. To select important variables and estimating equations in the sparse high-dimensional model, we consider a penalized EL method based on robust estimating functions by applying two penalty functions for regularizing the regression parameters and the associated Lagrange multipliers simultaneously, which allows the dimensionalities of both regression parameters and estimating equations to grow exponentially with the sample size. A first inspection on the robustness of estimating equations contributing to the estimating equations selection and variable selection is discussed from both theoretical perspective and intuitive simulation results in this paper. The proposed method can improve the robustness and effectiveness when the data have underlying outliers or heavy tails in the response variables and/or covariates. The robustness of the estimator is measured via the bounded influence function, and the oracle properties are also established under some regularity conditions. Extensive simulation studies and a yeast cell data are used to evaluate the performance of the proposed method. The numerical results reveal that the robustness of sparse estimating equations selection fundamentally enhances variable selection accuracy when the data have heavy tails and/or include underlying outliers.
This study aims to evaluate the performance of power in the likelihood ratio test for changepoint detection by bootstrap sampling, and proposes a hypothesis test based on bootstrapped confidence interval lengths. Assuming i.i.d normally distributed errors, and using the bootstrap method, the changepoint sampling distribution is estimated. Furthermore, this study describes a method to estimate a data set with no changepoint to form the null sampling distribution. With the null sampling distribution, and the distribution of the estimated changepoint, critical values and power calculations can be made, over the lengths of confidence intervals.