No Arabic abstract
In the multiple testing context, a challenging problem is the estimation of the proportion $pi_0$ of true-null hypotheses. A large number of estimators of this quantity rely on identifiability assumptions that either appear to be violated on real data, or may be at least relaxed. Under independence, we propose an estimator $hat{pi}_0$ based on density estimation using both histograms and cross-validation. Due to the strong connection between the false discovery rate (FDR) and $pi_0$, many multiple testing procedures (MTP) designed to control the FDR may be improved by introducing an estimator of $pi_0$. We provide an example of such an improvement (plug-in MTP) based on the procedure of Benjamini and Hochberg. Asymptotic optimality results may be derived for both $hat{pi}_0$ and the resulting plug-in procedure. The latter ensures the desired asymptotic control of the FDR, while it is more powerful than the BH-procedure. Finally, we compare our estimator of $pi_0$ with other widespread estimators in a wide range of simulations. We obtain better results than other tested methods in terms of mean square error (MSE) of the proposed estimator. Finally, both asymptotic optimality results and the interest in tightly estimating $pi_0$ are confirmed (empirically) by results obtained with the plug-in MTP.
The lasso procedure is ubiquitous in the statistical and signal processing literature, and as such, is the target of substantial theoretical and applied research. While much of this research focuses on the desirable properties that lasso possesses---predictive risk consistency, sign consistency, correct model selection---all of it has assumes that the tuning parameter is chosen in an oracle fashion. Yet, this is impossible in practice. Instead, data analysts must use the data twice, once to choose the tuning parameter and again to estimate the model. But only heuristics have ever justified such a procedure. To this end, we give the first definitive answer about the risk consistency of lasso when the smoothing parameter is chosen via cross-validation. We show that under some restrictions on the design matrix, the lasso estimator is still risk consistent with an empirically chosen tuning parameter.
We present an elementary mathematical method to find the minimax estimator of the Bernoulli proportion $theta$ under the squared error loss when $theta$ belongs to the restricted parameter space of the form $Omega = [0, eta]$ for some pre-specified constant $0 leq eta leq 1$. This problem is inspired from the problem of estimating the rate of positive COVID-19 tests. The presented results and applications would be useful materials for both instructors and students when teaching point estimation in statistical or machine learning courses.
In science, the most widespread statistical quantities are perhaps $p$-values. A typical advice is to reject the null hypothesis $H_0$ if the corresponding p-value is sufficiently small (usually smaller than 0.05). Many criticisms regarding p-values have arisen in the scientific literature. The main issue is that in general optimal p-values (based on likelihood ratio statistics) are not measures of evidence over the parameter space $Theta$. Here, we propose an emph{objective} measure of evidence for very general null hypotheses that satisfies logical requirements (i.e., operations on the subsets of $Theta$) that are not met by p-values (e.g., it is a possibility measure). We study the proposed measure in the light of the abstract belief calculus formalism and we conclude that it can be used to establish objective states of belief on the subsets of $Theta$. Based on its properties, we strongly recommend this measure as an additional summary of significance tests. At the end of the paper we give a short listing of possible open problems.
We propose leave-out estimators of quadratic forms designed for the study of linear models with unrestricted heteroscedasticity. Applications include analysis of variance and tests of linear restrictions in models with many regressors. An approximation algorithm is provided that enables accurate computation of the estimator in very large datasets. We study the large sample properties of our estimator allowing the number of regressors to grow in proportion to the number of observations. Consistency is established in a variety of settings where plug-in methods and estimators predicated on homoscedasticity exhibit first-order biases. For quadratic forms of increasing rank, the limiting distribution can be represented by a linear combination of normal and non-central $chi^2$ random variables, with normality ensuing under strong identification. Standard error estimators are proposed that enable tests of linear restrictions and the construction of uniformly valid confidence intervals for quadratic forms of interest. We find in Italian social security records that leave-out estimates of a variance decomposition in a two-way fixed effects model of wage determination yield substantially different conclusions regarding the relative contribution of workers, firms, and worker-firm sorting to wage inequality than conventional methods. Monte Carlo exercises corroborate the accuracy of our asymptotic approximations, with clear evidence of non-normality emerging when worker mobility between blocks of firms is limited.
In this paper, we study the classical problem of estimating the proportion of a finite population. First, we consider a fixed sample size method and derive an explicit sample size formula which ensures a mixed criterion of absolute and relative errors. Second, we consider an inverse sampling scheme such that the sampling is continue until the number of units having a certain attribute reaches a threshold value or the whole population is examined. We have established a simple method to determine the threshold so that a prescribed relative precision is guaranteed. Finally, we develop a multistage sampling scheme for constructing fixed-width confidence interval for the proportion of a finite population. Powerful computational techniques are introduced to make it possible that the fixed-width confidence interval ensures prescribed level of coverage probability.