ترغب بنشر مسار تعليمي؟ اضغط هنا

QuicK-means: Acceleration of K-means by learning a fast transform

80   0   0.0 ( 0 )
 نشر من قبل Valentin Emiya
 تاريخ النشر 2019
والبحث باللغة English




اسأل ChatGPT حول البحث

K-means -- and the celebrated Lloyd algorithm -- is more than the clustering method it was originally designed to be. It has indeed proven pivotal to help increase the speed of many machine learning and data analysis techniques such as indexing, nearest-neighbor search and prediction, data compression; its beneficial use has been shown to carry over to the acceleration of kernel machines (when using the Nystrom method). Here, we propose a fast extension of K-means, dubbed QuicK-means, that rests on the idea of expressing the matrix of the $K$ centroids as a product of sparse matrices, a feat made possible by recent results devoted to find approximations of matrices as a product of sparse factors. Using such a decomposition squashes the complexity of the matrix-vector product between the factorized $K times D$ centroid matrix $mathbf{U}$ and any vector from $mathcal{O}(K D)$ to $mathcal{O}(A log A+B)$, with $A=min (K, D)$ and $B=max (K, D)$, where $D$ is the dimension of the training data. This drastic computational saving has a direct impact in the assignment process of a point to a cluster, meaning that it is not only tangible at prediction time, but also at training time, provided the factorization procedure is performed during Lloyds algorithm. We precisely show that resorting to a factorization step at each iteration does not impair the convergence of the optimization scheme and that, depending on the context, it may entail a reduction of the training time. Finally, we provide discussions and numerical simulations that show the versatility of our computationally-efficient QuicK-means algorithm.

قيم البحث

اقرأ أيضاً

We propose a novel method to accelerate Lloyds algorithm for K-Means clustering. Unlike previous acceleration approaches that reduce computational cost per iterations or improve initialization, our approach is focused on reducing the number of iterat ions required for convergence. This is achieved by treating the assignment step and the update step of Lloyds algorithm as a fixed-point iteration, and applying Anderson acceleration, a well-established technique for accelerating fixed-point solvers. Classical Anderson acceleration utilizes m previous iterates to find an accelerated iterate, and its performance on K-Means clustering can be sensitive to choice of m and the distribution of samples. We propose a new strategy to dynamically adjust the value of m, which achieves robust and consistent speedups across different problem instances. Our method complements existing acceleration techniques, and can be combined with them to achieve state-of-the-art performance. We perform extensive experiments to evaluate the performance of the proposed method, where it outperforms other algorithms in 106 out of 120 test cases, and the mean decrease ratio of computational time is more than 33%.
$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at url{https://github.com/THUYimingLi/t-k-means}.
89 - Carlo Baldassi 2019
We present a simple heuristic algorithm for efficiently optimizing the notoriously hard minimum sum-of-squares clustering problem, usually addressed by the classical k-means heuristic and its variants. The algorithm, called recombinator-k-means, is v ery similar to a genetic algorithmic scheme: it uses populations of configurations, that are optimized independently in parallel and then recombined in a next-iteration population batch by exploiting a variant of the k-means++ seeding algorithm. An additional reweighting mechanism ensures that the population eventually coalesces into a single solution. Extensive tests measuring optimization objective vs computational time on synthetic and real-word data show that it is the only choice, among state-of-the-art alternatives (simple restarts, random swap, genetic algorithm with pairwise-nearest-neighbor crossover), that consistently produces good results at all time scales, outperforming competitors on large and complicated datasets. The only parameter that requires tuning is the population size. The scheme is rather general (it could be applied even to k-medians or k-medoids, for example). Our implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl.
101 - Nicolas Fraiman , Zichao Li 2020
Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-d iagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.
101 - Yiwei Li 2019
This article briefly introduced Arthur and Vassilvitshiis work on textbf{k-means++} algorithm and further generalized the center initialization process. It is found that choosing the most distant sample point from the nearest center as new center can mostly have the same effect as the center initialization process in the textbf{k-means++} algorithm.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا