No Arabic abstract
The present work aims at deriving theoretical guaranties on the behavior of some cross-validation procedures applied to the $k$-nearest neighbors ($k$NN) rule in the context of binary classification. Here we focus on the leave-$p$-out cross-validation (L$p$O) used to assess the performance of the $k$NN classifier. Remarkably this L$p$O estimator can be efficiently computed in this context using closed-form formulas derived by cite{CelisseMaryHuard11}. We describe a general strategy to derive moment and exponential concentration inequalities for the L$p$O estimator applied to the $k$NN classifier. Such results are obtained first by exploiting the connection between the L$p$O estimator and U-statistics, and second by making an intensive use of the generalized Efron-Stein inequality applied to the L$1$O estimator. One other important contribution is made by deriving new quantifications of the discrepancy between the L$p$O estimator and the classification error/risk of the $k$NN classifier. The optimality of these bounds is discussed by means of several lower bounds as well as simulation experiments.
The lasso procedure is ubiquitous in the statistical and signal processing literature, and as such, is the target of substantial theoretical and applied research. While much of this research focuses on the desirable properties that lasso possesses---predictive risk consistency, sign consistency, correct model selection---all of it has assumes that the tuning parameter is chosen in an oracle fashion. Yet, this is impossible in practice. Instead, data analysts must use the data twice, once to choose the tuning parameter and again to estimate the model. But only heuristics have ever justified such a procedure. To this end, we give the first definitive answer about the risk consistency of lasso when the smoothing parameter is chosen via cross-validation. We show that under some restrictions on the design matrix, the lasso estimator is still risk consistent with an empirically chosen tuning parameter.
Nonparametric estimation of mutual information is used in a wide range of scientific problems to quantify dependence between variables. The k-nearest neighbor (knn) methods are consistent, and therefore expected to work well for large sample size. These methods use geometrically regular local volume elements. This practice allows maximum localization of the volume elements, but can also induce a bias due to a poor description of the local geometry of the underlying probability measure. We introduce a new class of knn estimators that we call geometric knn estimators (g-knn), which use more complex local volume elements to better model the local geometry of the probability measures. As an example of this class of estimators, we develop a g-knn estimator of entropy and mutual information based on elliptical volume elements, capturing the local stretching and compression common to a wide range of dynamical systems attractors. A series of numerical examples in which the thickness of the underlying distribution and the sample sizes are varied suggest that local geometry is a source of problems for knn methods such as the Kraskov-St{o}gbauer-Grassberger (KSG) estimator when local geometric effects cannot be removed by global preprocessing of the data. The g-knn method performs well despite the manipulation of the local geometry. In addition, the examples suggest that the g-knn estimators can be of particular relevance to applications in which the system is large, but data size is limited.
The lasso and related sparsity inducing algorithms have been the target of substantial theoretical and applied research. Correspondingly, many results are known about their behavior for a fixed or optimally chosen tuning parameter specified up to unknown constants. In practice, however, this oracle tuning parameter is inaccessible so one must use the data to select one. Common statistical practice is to use a variant of cross-validation for this task. However, little is known about the theoretical properties of the resulting predictions with such data-dependent methods. We consider the high-dimensional setting with random design wherein the number of predictors $p$ grows with the number of observations $n$. Under typical assumptions on the data generating process, similar to those in the literature, we recover oracle rates up to a log factor when choosing the tuning parameter with cross-validation. Under weaker conditions, when the true model is not necessarily linear, we show that the lasso remains risk consistent relative to its linear oracle. We also generalize these results to the group lasso and square-root lasso and investigate the predictive and model selection performance of cross-validation via simulation.
We consider a problem of multiclass classification, where the training sample $S_n = {(X_i, Y_i)}_{i=1}^n$ is generated from the model $mathbb P(Y = m | X = x) = eta_m(x)$, $1 leq m leq M$, and $eta_1(x), dots, eta_M(x)$ are unknown $alpha$-Holder continuous functions.Given a test point $X$, our goal is to predict its label. A widely used $mathsf k$-nearest-neighbors classifier constructs estimates of $eta_1(X), dots, eta_M(X)$ and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter $mathsf k$, which may become tricky in some situations. In our solution, we fix several integers $n_1, dots, n_K$, compute corresponding $n_k$-nearest-neighbor estimates for each $m$ and each $n_k$ and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter $alpha$ and to the margin and establish rates of convergence under mild assumptions.
The asymptotic optimality (a.o.) of various hyper-parameter estimators with different optimality criteria has been studied in the literature for regularized least squares regression problems. The estimators include e.g., the maximum (marginal) likelihood method, $C_p$ statistics, and generalized cross validation method, and the optimality criteria are based on e.g., the inefficiency, the expectation inefficiency and the risk. In this paper, we consider the regularized least squares regression problems with fixed number of regression parameters, choose the optimality criterion based on the risk, and study the a.o. of several cross validation (CV) based hyper-parameter estimators including the leave $k$-out CV method, generalized CV method, $r$-fold CV method and hold out CV method. We find the former three methods can be a.o. under mild assumptions, but not the last one, and we use Monte Carlo simulations to illustrate the efficacy of our findings.