ترغب بنشر مسار تعليمي؟ اضغط هنا

A comparative study of new cross-validated bandwidth selectors for kernel density estimation

222   0   0.0 ( 0 )
 نشر من قبل Enno Mammen
 تاريخ النشر 2012
  مجال البحث الاحصاء الرياضي
والبحث باللغة English




اسأل ChatGPT حول البحث

Recent contributions to kernel smoothing show that the performance of cross-validated bandwidth selectors improve significantly from indirectness. Indirect crossvalidation first estimates the classical cross-validated bandwidth from a more rough and difficult smoothing problem than the original one and then rescales this indirect bandwidth to become a bandwidth of the original problem. The motivation for this approach comes from the observation that classical crossvalidation tends to work better when the smoothing problem is difficult. In this paper we find that the performance of indirect crossvalidation improves theoretically and practically when the polynomial order of the indirect kernel increases, with the Gaussian kernel as limiting kernel when the polynomial order goes to infinity. These theoretical and practical results support the often proposed choice of the Gaussian kernel as indirect kernel. However, for do-validation our study shows a discrepancy between asymptotic theory and practical performance. As for indirect crossvalidation, in asymptotic theory the performance of indirect do-validation improves with increasing polynomial order of the used indirect kernel. But this theoretical improvements do not carry over to practice and the original do-validation still seems to be our preferred bandwidth selector. We also consider plug-in estimation and combinations of plug-in bandwidths and crossvalidated bandwidths. These latter bandwidths do not outperform the original do-validation estimator either.



قيم البحث

اقرأ أيضاً

In applied multivariate statistics, estimating the number of latent dimensions or the number of clusters is a fundamental and recurring problem. One common diagnostic is the scree plot, which shows the largest eigenvalues of the data matrix; the user searches for a gap or elbow in the decreasing eigenvalues; unfortunately, these patterns can hide beneath the bias of the sample eigenvalues. This methodological problem is conceptually difficult because, in many situations, there is only enough signal to detect a subset of the $k$ population dimensions/eigenvectors. In this situation, one could argue that the correct choice of $k$ is the number of detectable dimensions. We alleviate these problems with cross-validated eigenvalues. Under a large class of random graph models, without any parametric assumptions, we provide a p-value for each sample eigenvector. It tests the null hypothesis that this sample eigenvector is orthogonal to (i.e., uncorrelated with) the true latent dimensions. This approach naturally adapts to problems where some dimensions are not statistically detectable. In scenarios where all $k$ dimensions can be estimated, we prove that our procedure consistently estimates $k$. In simulations and a data example, the proposed estimator compares favorably to alternative approaches in both computational and statistical performance.
153 - Aurelie Bugeau 2007
Kernel estimation techniques, such as mean shift, suffer from one major drawback: the kernel bandwidth selection. The bandwidth can be fixed for all the data set or can vary at each points. Automatic bandwidth selection becomes a real challenge in ca se of multidimensional heterogeneous features. This paper presents a solution to this problem. It is an extension of cite{Comaniciu03a} which was based on the fundamental property of normal distributions regarding the bias of the normalized density gradient. The selection is done iteratively for each type of features, by looking for the stability of local bandwidth estimates across a predefined range of bandwidths. A pseudo balloon mean shift filtering and partitioning are introduced. The validity of the method is demonstrated in the context of color image segmentation based on a 5-dimensional space.
The covariance matrix plays a fundamental role in many modern exploratory and inferential statistical procedures, including dimensionality reduction, hypothesis testing, and regression. In low-dimensional regimes, where the number of observations far exceeds the number of variables, the optimality of the sample covariance matrix as an estimator of this parameter is well-established. High-dimensional regimes do not admit such a convenience, however. As such, a variety of estimators have been derived to overcome the shortcomings of the sample covariance matrix in these settings. Yet, the question of selecting an optimal estimator from among the plethora available remains largely unaddressed. Using the framework of cross-validated loss-based estimation, we develop the theoretical underpinnings of just such an estimator selection procedure. In particular, we propose a general class of loss functions for covariance matrix estimation and establish finite-sample risk bounds and conditions for the asymptotic optimality of the cross-validated estimator selector with respect to these loss functions. We evaluate our proposed approach via a comprehensive set of simulation experiments and demonstrate its practical benefits by application in the exploratory analysis of two single-cell transcriptome sequencing datasets. A free and open-source software implementation of the proposed methodology, the cvCovEst R package, is briefly introduced.
69 - Yang Liu , David Ruppert 2019
This paper develops a novel approach to density estimation on a network. We formulate nonparametric density estimation on a network as a nonparametric regression problem by binning. Nonparametric regression using local polynomial kernel-weighted leas t squares have been studied rigorously, and its asymptotic properties make it superior to kernel estimators such as the Nadaraya-Watson estimator. When applied to a network, the best estimator near a vertex depends on the amount of smoothness at the vertex. Often, there are no compelling reasons to assume that a density will be continuous or discontinuous at a vertex, hence a data driven approach is proposed. To estimate the density in a neighborhood of a vertex, we propose a two-step procedure. The first step of this pretest estimator fits a separate local polynomial regression on each edge using data only on that edge, and then tests for equality of the estimates at the vertex. If the null hypothesis is not rejected, then the second step re-estimates the regression function in a small neighborhood of the vertex, subject to a joint equality constraint. Since the derivative of the density may be discontinuous at the vertex, we propose a piecewise polynomial local regression estimate to model the change in slope. We study in detail the special case of local piecewise linear regression and derive the leading bias and variance terms using weighted least squares theory. We show that the proposed approach will remove the bias near a vertex that has been noted for existing methods, which typically do not allow for discontinuity at vertices. For a fixed network, the proposed method scales sub-linearly with sample size and it can be extended to regression and varying coefficient models on a network. We demonstrate the workings of the proposed model by simulation studies and apply it to a dendrite network data set.
168 - Wai Ming Tai 2020
Given a point set $Psubset mathbb{R}^d$, a kernel density estimation for Gaussian kernel is defined as $overline{mathcal{G}}_P(x) = frac{1}{left|Pright|}sum_{pin P}e^{-leftlVert x-p rightrVert^2}$ for any $xinmathbb{R}^d$. We study how to construct a small subset $Q$ of $P$ such that the kernel density estimation of $P$ can be approximated by the kernel density estimation of $Q$. This subset $Q$ is called coreset. The primary technique in this work is to construct $pm 1$ coloring on the point set $P$ by the discrepancy theory and apply this coloring algorithm recursively. Our result leverages Banaszczyks Theorem. When $d>1$ is constant, our construction gives a coreset of size $Oleft(frac{1}{varepsilon}right)$ as opposed to the best-known result of $Oleft(frac{1}{varepsilon}sqrt{logfrac{1}{varepsilon}}right)$. It is the first to give a breakthrough on the barrier of $sqrt{log}$ factor even when $d=2$.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا