$t$-$k$-means: A Robust and Stable $k$-means Variant

365 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yiming Li

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية الاحصاء الرياضي

والبحث باللغة English

تأليف Yiming Li - Yang Zhang - Qingtao Tang

التعلم الآلي التعلم الالي

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at url{https://github.com/THUYimingLi/t-k-means}.

قيم البحث

89 - Carlo Baldassi 2019

We present a simple heuristic algorithm for efficiently optimizing the notoriously hard minimum sum-of-squares clustering problem, usually addressed by the classical k-means heuristic and its variants. The algorithm, called recombinator-k-means, is v ery similar to a genetic algorithmic scheme: it uses populations of configurations, that are optimized independently in parallel and then recombined in a next-iteration population batch by exploiting a variant of the k-means++ seeding algorithm. An additional reweighting mechanism ensures that the population eventually coalesces into a single solution. Extensive tests measuring optimization objective vs computational time on synthetic and real-word data show that it is the only choice, among state-of-the-art alternatives (simple restarts, random swap, genetic algorithm with pairwise-nearest-neighbor crossover), that consistently produces good results at all time scales, outperforming competitors on large and complicated datasets. The only parameter that requires tuning is the population size. The scheme is rather general (it could be applied even to k-medians or k-medoids, for example). Our implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl.

التعلم الآلي التعلم الالي

QuicK-means: Acceleration of K-means by learning a fast transform

79 - Luc Giffon , Valentin Emiya , Liva Ralaivola 2019

K-means -- and the celebrated Lloyd algorithm -- is more than the clustering method it was originally designed to be. It has indeed proven pivotal to help increase the speed of many machine learning and data analysis techniques such as indexing, near est-neighbor search and prediction, data compression; its beneficial use has been shown to carry over to the acceleration of kernel machines (when using the Nystrom method). Here, we propose a fast extension of K-means, dubbed QuicK-means, that rests on the idea of expressing the matrix of the $K$ centroids as a product of sparse matrices, a feat made possible by recent results devoted to find approximations of matrices as a product of sparse factors. Using such a decomposition squashes the complexity of the matrix-vector product between the factorized $K times D$ centroid matrix $mathbf{U}$ and any vector from $mathcal{O}(K D)$ to $mathcal{O}(A log A+B)$, with $A=min (K, D)$ and $B=max (K, D)$, where $D$ is the dimension of the training data. This drastic computational saving has a direct impact in the assignment process of a point to a cluster, meaning that it is not only tangible at prediction time, but also at training time, provided the factorization procedure is performed during Lloyds algorithm. We precisely show that resorting to a factorization step at each iteration does not impair the convergence of the optimization scheme and that, depending on the context, it may entail a reduction of the training time. Finally, we provide discussions and numerical simulations that show the versatility of our computationally-efficient QuicK-means algorithm.

التعلم الآلي التعلم الالي

Biclustering with Alternating K-Means

101 - Nicolas Fraiman , Zichao Li 2020

Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-d iagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.

التعلم الآلي التعلم الالي

Generalization of k-means Related Algorithms

101 - Yiwei Li 2019

This article briefly introduced Arthur and Vassilvitshiis work on textbf{k-means++} algorithm and further generalized the center initialization process. It is found that choosing the most distant sample point from the nearest center as new center can mostly have the same effect as the center initialization process in the textbf{k-means++} algorithm.

التعلم الآلي التعلم الالي

Deep clustering with concrete k-means

125 - Boyan Gao , Yongxin Yang , Henry Gouk 2019

We address the problem of simultaneously learning a k-means clustering and deep feature representation from unlabelled data, which is of interest due to the potential of deep k-means to outperform traditional two-step feature extraction and shallow-c lustering strategies. We achieve this by developing a gradient-estimator for the non-differentiable k-means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concrete k-means model can be optimised with respect to the canonical k-means objective and is easily trained end-to-end without resorting to alternating optimisation. We demonstrate the efficacy of our method on standard clustering benchmarks.

التعلم الآلي التعلم الالي