ترغب بنشر مسار تعليمي؟ اضغط هنا

A description length approach to determining the number of k-means clusters

49   0   0.0 ( 0 )
 نشر من قبل Hiromitsu Mizutani
 تاريخ النشر 2017
والبحث باللغة English




اسأل ChatGPT حول البحث

We present an asymptotic criterion to determine the optimal number of clusters in k-means. We consider k-means as data compression, and propose to adopt the number of clusters that minimizes the estimated description length after compression. Here we report two types of compression ratio based on two ways to quantify the description length of data after compression. This approach further offers a way to evaluate whether clusters obtained with k-means have a hierarchical structure by examining whether multi-stage compression can further reduce the description length. We applied our criteria to determine the number of clusters to synthetic data and empirical neuroimaging data to observe the behavior of the criteria across different types of data set and suitability of the two types of criteria for different datasets. We found that our method can offer reasonable clustering results that are useful for dimension reduction. While our numerical results revealed dependency of our criteria on the various aspects of dataset such as the dimensionality, the description length approach proposed here provides a useful guidance to determine the number of clusters in a principled manner when underlying properties of the data are unknown and only inferred from observation of data.



قيم البحث

اقرأ أيضاً

148 - Yuri Kornyushin 2007
A detailed simple model is applied to study a metallic cluster. It is assumed that the ions and delocalized electrons are distributed randomly throughout the cluster. The delocalized electrons are assumed to be degenerate. A spherical ball models the shape of a cluster. The energy of the microscopic electrostatic field around the ions is taken into account and calculated. It is shown in the framework of the model that the cluster is stable. Equilibrium radius of a ball and the energy of the equilibrium cluster are calculated. Bulk modulus of a cluster is calculated also.
116 - Chiao-Yu Yang , Eric Xia , Nhat Ho 2019
Dirichlet process mixture models (DPMM) play a central role in Bayesian nonparametrics, with applications throughout statistics and machine learning. DPMMs are generally used in clustering problems where the number of clusters is not known in advance , and the posterior distribution is treated as providing inference for this number. Recently, however, it has been shown that the DPMM is inconsistent in inferring the true number of components in certain cases. This is an asymptotic result, and it would be desirable to understand whether it holds with finite samples, and to more fully understand the full posterior. In this work, we provide a rigorous study for the posterior distribution of the number of clusters in DPMM under different prior distributions on the parameters and constraints on the distributions of the data. We provide novel lower bounds on the ratios of probabilities between $s+1$ clusters and $s$ clusters when the prior distributions on parameters are chosen to be Gaussian or uniform distributions.
The number of trees T in the random forest (RF) algorithm for supervised learning has to be set by the user. It is controversial whether T should simply be set to the largest computationally manageable value or whether a smaller T may in some cases b e better. While the principle underlying bagging is that more trees are better, in practice the classification error rate sometimes reaches a minimum before increasing again for increasing number of trees. The goal of this paper is four-fold: (i) providing theoretical results showing that the expected error rate may be a non-monotonous function of the number of trees and explaining under which circumstances this happens; (ii) providing theoretical results showing that such non-monotonous patterns cannot be observed for other performance measures such as the Brier score and the logarithmic loss (for classification) and the mean squared error (for regression); (iii) illustrating the extent of the problem through an application to a large number (n = 306) of datasets from the public database OpenML; (iv) finally arguing in favor of setting it to a computationally feasible large number, depending on convergence properties of the desired performance measure.
Our earlier Faddeev three-body study in the $K^-$-deuteron scattering length, $A_{K^-d}$, is revisited here in the light of the recent developments in two fronts: {it (i)} the improved chiral unitary approach to the theoretical description of the cou pled $Kbar N$ related channels at low energies, and {it (ii)} the new and improved measurement from SIDDHARTA Collaboration of the strong interaction energy shift and width in the lowest $K^-$-hydrogen atomic level. Those two, in combination, have allowed us to produced a reliable two-body input to the three-body calculation. All available low-energy $K^-p$ observables are well reproduced and predictions for the $Kbar N$ scattering lengths and amplitudes, $(pi Sigma)^circ$ invariant-mass spectra, as well as for $A_{K^-d}$ are put forward and compared with results from other sources. The findings of the present work are expected to be useful in interpreting the forthcoming data from CLAS, HADES, LEPS and SIDDHARTA Collaborations.
We propose a novel Bayesian neural network architecture that can learn invariances from data alone by inferring a posterior distribution over different weight-sharing schemes. We show that our model outperforms other non-invariant architectures, when trained on datasets that contain specific invariances. The same holds true when no data augmentation is performed.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا