No Arabic abstract
Gaussian process regression has proven very powerful in statistics, machine learning and inverse problems. A crucial aspect of the success of this methodology, in a wide range of applications to complex and real-world problems, is hierarchical modeling and learning of hyperparameters. The purpose of this paper is to study two paradigms of learning hierarchical parameters: one is from the probabilistic Bayesian perspective, in particular, the empirical Bayes approach that has been largely used in Bayesian statistics; the other is from the deterministic and approximation theoretic view, and in particular the kernel flow algorithm that was proposed recently in the machine learning literature. Analysis of their consistency in the large data limit, as well as explicit identification of their implicit bias in parameter learning, are established in this paper for a Matern-like model on the torus. A particular technical challenge we overcome is the learning of the regularity parameter in the Matern-like field, for which consistency results have been very scarce in the spatial statistics literature. Moreover, we conduct extensive numerical experiments beyond the Matern-like model, comparing the two algorithms further. These experiments demonstrate learning of other hierarchical parameters, such as amplitude and lengthscale; they also illustrate the setting of model misspecification in which the kernel flow approach could show superior performance to the more traditional empirical Bayes approach.
We develop a Nonparametric Empirical Bayes (NEB) framework for compound estimation in the discrete linear exponential family, which includes a wide class of discrete distributions frequently arising from modern big data applications. We propose to directly estimate the Bayes shrinkage factor in the generalized Robbins formula via solving a scalable convex program, which is carefully developed based on a RKHS representation of the Steins discrepancy measure. The new NEB estimation framework is flexible for incorporating various structural constraints into the data driven rule, and provides a unified approach to compound estimation with both regular and scaled squared error losses. We develop theory to show that the class of NEB estimators enjoys strong asymptotic properties. Comprehensive simulation studies as well as analyses of real data examples are carried out to demonstrate the superiority of the NEB estimator over competing methods.
We show that polynomials do not belong to the reproducing kernel Hilbert space of infinitely differentiable translation-invariant kernels whose spectral measures have moments corresponding to a determinate moment problem. Our proof is based on relating this question to the problem of best linear estimation in continuous time one-parameter regression models with a stationary error process defined by the kernel. In particular, we show that the existence of a sequence of estimators with variances converging to $0$ implies that the regression function cannot be an element of the reproducing kernel Hilbert space. This question is then related to the determinacy of the Hamburger moment problem for the spectral measure corresponding to the kernel. In the literature it was observed that a non-vanishing constant function does not belong to the reproducing kernel Hilbert space associated with the Gaussian kernel (see Corollary 4.44 in Steinwart and Christmann, 2008). Our results provide a unifying view of this phenomenon and show that the mentioned result can be extended for arbitrary polynomials and a broad class of translation-invariant kernels.
This paper explores a class of empirical Bayes methods for level-dependent threshold selection in wavelet shrinkage. The prior considered for each wavelet coefficient is a mixture of an atom of probability at zero and a heavy-tailed density. The mixing weight, or sparsity parameter, for each level of the transform is chosen by marginal maximum likelihood. If estimation is carried out using the posterior median, this is a random thresholding procedure; the estimation can also be carried out using other thresholding rules with the same threshold. Details of the calculations needed for implementing the procedure are included. In practice, the estimates are quick to compute and there is software available. Simulations on the standard model functions show excellent performance, and applications to data drawn from various fields of application are used to explore the practical performance of the approach. By using a general result on the risk of the corresponding marginal maximum likelihood approach for a single sequence, overall bounds on the risk of the method are found subject to membership of the unknown function in one of a wide range of Besov classes, covering also the case of f of bounded variation. The rates obtained are optimal for any value of the parameter p in (0,infty], simultaneously for a wide range of loss functions, each dominating the L_q norm of the sigmath derivative, with sigmage0 and 0<qle2.
Although the operator (spectral) norm is one of the most widely used metrics for covariance estimation, comparatively little is known about the fluctuations of error in this norm. To be specific, let $hatSigma$ denote the sample covariance matrix of $n$ observations in $mathbb{R}^p$ that arise from a population matrix $Sigma$, and let $T_n=sqrt{n}|hatSigma-Sigma|_{text{op}}$. In the setting where the eigenvalues of $Sigma$ have a decay profile of the form $lambda_j(Sigma)asymp j^{-2beta}$, we analyze how well the bootstrap can approximate the distribution of $T_n$. Our main result shows that up to factors of $log(n)$, the bootstrap can approximate the distribution of $T_n$ at the dimension-free rate of $n^{-frac{beta-1/2}{6beta+4}}$, with respect to the Kolmogorov metric. Perhaps surprisingly, a result of this type appears to be new even in settings where $p< n$. More generally, we discuss the consequences of this result beyond covariance matrices and show how the bootstrap can be used to estimate the errors of sketching algorithms in randomized numerical linear algebra (RandNLA). An illustration of these ideas is also provided with a climate data example.
We consider the classical problems of estimating the mean of an $n$-dimensional normally (with identity covariance matrix) or Poisson distributed vector under the squared loss. In a Bayesian setting the optimal estimator is given by the prior-dependent conditional mean. In a frequentist setting various shrinkage methods were developed over the last century. The framework of empirical Bayes, put forth by Robbins (1956), combines Bayesian and frequentist mindsets by postulating that the parameters are independent but with an unknown prior and aims to use a fully data-driven estimator to compete with the Bayesian oracle that knows the true prior. The central figure of merit is the regret, namely, the total excess risk over the Bayes risk in the worst case (over the priors). Although this paradigm was introduced more than 60 years ago, little is known about the asymptotic scaling of the optimal regret in the nonparametric setting. We show that for the Poisson model with compactly supported and subexponential priors, the optimal regret scales as $Theta((frac{log n}{loglog n})^2)$ and $Theta(log^3 n)$, respectively, both attained by the original estimator of Robbins. For the normal mean model, the regret is shown to be at least $Omega((frac{log n}{loglog n})^2)$ and $Omega(log^2 n)$ for compactly supported and subgaussian priors, respectively, the former of which resolves the conjecture of Singh (1979) on the impossibility of achieving bounded regret; before this work, the best regret lower bound was $Omega(1)$. In addition to the empirical Bayes setting, these results are shown to hold in the compound setting where the parameters are deterministic. As a side application, the construction in this paper also leads to improved or new lower bounds for density estimation of Gaussian and Poisson mixtures.