On the number of variables to use in principal component regression


الملخص بالإنكليزية

We study least squares linear regression over $N$ uncorrelated Gaussian features that are selected in order of decreasing variance. When the number of selected features $p$ is at most the sample size $n$, the estimator under consideration coincides with the principal component regression estimator; when $p>n$, the estimator is the least $ell_2$ norm solution over the selected features. We give an average-case analysis of the out-of-sample prediction error as $p,n,N to infty$ with $p/N to alpha$ and $n/N to beta$, for some constants $alpha in [0,1]$ and $beta in (0,1)$. In this average-case setting, the prediction error exhibits a double descent shape as a function of $p$. We also establish conditions under which the minimum risk is achieved in the interpolating ($p>n$) regime.

تحميل البحث