Clustering by Hierarchical Nearest Neighbor Descent (H-NND)

77 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Teng Qiu

تاريخ النشر 2015

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Teng Qiu - Yongjie Li

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Previously in 2014, we proposed the Nearest Descent (ND) method, capable of generating an efficient Graph, called the in-tree (IT). Due to some beautiful and effective features, this IT structure proves well suited for data clustering. Although there exist some redundant edges in IT, they usually have salient features and thus it is not hard to remove them. Subsequently, in order to prevent the seemingly redundant edges from occurring, we proposed the Nearest Neighbor Descent (NND) by adding the Neighborhood constraint on ND. Consequently, clusters automatically emerged, without the additional requirement of removing the redundant edges. However, NND proved still not perfect, since it brought in a new yet worse problem, the over-partitioning problem. Now, in this paper, we propose a method, called the Hierarchical Nearest Neighbor Descent (H-NND), which overcomes the over-partitioning problem of NND via using the hierarchical strategy. Specifically, H-NND uses ND to effectively merge the over-segmented sub-graphs or clusters that NND produces. Like ND, H-NND also generates the IT structure, in which the redundant edges once again appear. This seemingly comes back to the situation that ND faces. However, compared with ND, the redundant edges in the IT structure generated by H-NND generally become more salient, thus being much easier and more reliable to be identified even by the simplest edge-removing method which takes the edge length as the only measure. In other words, the IT structure constructed by H-NND becomes more fitted for data clustering. We prove this on several clustering datasets of varying shapes, dimensions and attributes. Besides, compared with ND, H-NND generally takes less computation time to construct the IT data structure for the input data.

قيم البحث

136 - Nikita Puchkin , Vladimir Spokoiny 2018

We consider a problem of multiclass classification, where the training sample $S_n = {(X_i, Y_i)}_{i=1}^n$ is generated from the model $mathbb P(Y = m | X = x) = eta_m(x)$, $1 leq m leq M$, and $eta_1(x), dots, eta_M(x)$ are unknown $alpha$-Holder co ntinuous functions.Given a test point $X$, our goal is to predict its label. A widely used $mathsf k$-nearest-neighbors classifier constructs estimates of $eta_1(X), dots, eta_M(X)$ and uses a plug-in rule for the prediction. However, it requires a proper choice of the smoothing parameter $mathsf k$, which may become tricky in some situations. In our solution, we fix several integers $n_1, dots, n_K$, compute corresponding $n_k$-nearest-neighbor estimates for each $m$ and each $n_k$ and apply an aggregation procedure. We study an algorithm, which constructs a convex combination of these estimates such that the aggregated estimate behaves approximately as well as an oracle choice. We also provide a non-asymptotic analysis of the procedure, prove its adaptation to the unknown smoothness parameter $alpha$ and to the margin and establish rates of convergence under mild assumptions.

التعلم الالي التعلم الآلي

Distributed Adaptive Nearest Neighbor Classifier: Algorithm and Theory

98 - Ruiqi Liu , Ganggang Xu , Zuofeng Shang 2021

When data is of an extraordinarily large size or physically stored in different locations, the distributed nearest neighbor (NN) classifier is an attractive tool for classification. We propose a novel distributed adaptive NN classifier for which the number of nearest neighbors is a tuning parameter stochastically chosen by a data-driven criterion. An early stopping rule is proposed when searching for the optimal tuning parameter, which not only speeds up the computation but also improves the finite sample performance of the proposed Algorithm. Convergence rate of excess risk of the distributed adaptive NN classifier is investigated under various sub-sample size compositions. In particular, we show that when the sub-sample sizes are sufficiently large, the proposed classifier achieves the nearly optimal convergence rate. Effectiveness of the proposed approach is demonstrated through simulation studies as well as an empirical application to a real-world dataset.

التعلم الالي التعلم الآلي المنهجية

Posterior Regularisation on Bayesian Hierarchical Mixture Clustering

157 - Weipeng Huang , Tin Lok James Ng , Nishma Laitonjam 2021

We study a recent inferential framework, named posterior regularisation, on the Bayesian hierarchical mixture clustering (BHMC) model. This framework facilitates a simple way to impose extra constraints on a Bayesian model to overcome some weakness o f the original model. It narrows the search space of the parameters of the Bayesian model through a formalism that imposes certain constraints on the features of the found solutions. In this paper, in order to enhance the separation of clusters, we apply posterior regularisation to impose max-margin constraints on the nodes at every level of the hierarchy. This paper shows how the framework integrates with BHMC and achieves the expected improvements over the original Bayesian model.

التعلم الالي الذكاء الاصطناعي التعلم الآلي

Efficient Nearest-Neighbor Query and Clustering of Planar Curves

66 - Boris Aronov , Omrit Filtser , Michael Horton 2019

We study two fundamental problems dealing with curves in the plane, namely, the nearest-neighbor problem and the center problem. Let $mathcal{C}$ be a set of $n$ polygonal curves, each of size $m$. In the nearest-neighbor problem, the goal is to cons truct a compact data structure over $mathcal{C}$, such that, given a query curve $Q$, one can efficiently find the curve in $mathcal{C}$ closest to $Q$. In the center problem, the goal is to find a curve $Q$, such that the maximum distance between $Q$ and the curves in $mathcal{C}$ is minimized. We use the well-known discrete Frechet distance function, both under~$L_infty$ and under $L_2$, to measure the distance between two curves. For the nearest-neighbor problem, despite discouraging previous results, we identify two important cases for which it is possible to obtain practical bounds, even when $m$ and $n$ are large. In these cases, either $Q$ is a line segment or $mathcal{C}$ consists of line segments, and the bounds on the size of the data structure and query time are nearly linear in the size of the input and query curve, respectively. The returned answer is either exact under $L_infty$, or approximated to within a factor of $1+varepsilon$ under~$L_2$. We also consider the variants in which the location of the input curves is only fixed up to translation, and obtain similar bounds, under $L_infty$. As for the center problem, we study the case where the center is a line segment, i.e., we seek the line segment that represents the given set as well as possible. We present near-linear time exact algorithms under $L_infty$, even when the location of the input curves is only fixed up to translation. Under $L_2$, we present a roughly $O(n^2m^3)$-time exact algorithm.

الهندسة الحسابية

Nearest Neighbor distributions: new statistical measures for cosmological clustering

171 - Arka Banerjee , Tom Abel 2020

The use of summary statistics beyond the two-point correlation function to analyze the non-Gaussian clustering on small scales is an active field of research in cosmology. In this paper, we explore a set of new summary statistics -- the $k$-Nearest N eighbor Cumulative Distribution Functions ($k{rm NN}$-${rm CDF}$). This is the empirical cumulative distribution function of distances from a set of volume-filling, Poisson distributed random points to the $k$-nearest data points, and is sensitive to all connected $N$-point correlations in the data. The $k{rm NN}$-${rm CDF}$ can be used to measure counts in cell, void probability distributions and higher $N$-point correlation functions, all using the same formalism exploiting fast searches with spatial tree data structures. We demonstrate how it can be computed efficiently from various data sets - both discrete points, and the generalization for continuous fields. We use data from a large suite of $N$-body simulations to explore the sensitivity of this new statistic to various cosmological parameters, compared to the two-point correlation function, while using the same range of scales. We demonstrate that the use of $k{rm NN}$-${rm CDF}$ improves the constraints on the cosmological parameters by more than a factor of $2$ when applied to the clustering of dark matter in the range of scales between $10h^{-1}{rm Mpc}$ and $40h^{-1}{rm Mpc}$. We also show that relative improvement is even greater when applied on the same scales to the clustering of halos in the simulations at a fixed number density, both in real space, as well as in redshift space. Since the $k{rm NN}$-${rm CDF}$ are sensitive to all higher order connected correlation functions in the data, the gains over traditional two-point analyses are expected to grow as progressively smaller scales are included in the analysis of cosmological data.

علم الكونيات والفيزياء الفلكية Nongalactic