Near-Optimal Explainable $k$-Means for All Dimensions

89 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Lunjia Hu

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Moses Charikar - Lunjia Hu

التعلم الآلي الهندسة الحسابية بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML20) introduced explainable clustering, where the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. The central question here is: how much does the explainability constraint increase the value of the cost function? Given $d$-dimensional data points, we show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k^{1 - 2/d}mathrm{poly}(dlog k)$ times the minimum cost achievable by a clustering without the explainability constraint, assuming $k,dge 2$. Combining this with an independent work by Makarychev and Shan (ICML21), we get an improved bound of $k^{1 - 2/d}mathrm{polylog}(k)$, which we show is optimal for every choice of $k,dge 2$ up to a poly-logarithmic factor in $k$. For $d = 2$ in particular, we show an $O(log kloglog k)$ bound, improving exponentially over the previous best bound of $widetilde O(k)$.

قيم البحث

238 - Jeff M. Phillips , Wai Ming Tai 2018

We construct near-optimal coresets for kernel density estimates for points in $mathbb{R}^d$ when the kernel is positive definite. Specifically we show a polynomial time construction for a coreset of size $O(sqrt{d}/varepsiloncdot sqrt{log 1/varepsilo n} )$, and we show a near-matching lower bound of size $Omega(min{sqrt{d}/varepsilon, 1/varepsilon^2})$. When $dgeq 1/varepsilon^2$, it is known that the size of coreset can be $O(1/varepsilon^2)$. The upper bound is a polynomial-in-$(1/varepsilon)$ improvement when $d in [3,1/varepsilon^2)$ and the lower bound is the first known lower bound to depend on $d$ for this problem. Moreover, the upper bound restriction that the kernel is positive definite is significant in that it applies to a wide-variety of kernels, specifically those most important for machine learning. This includes kernels for information distances and the sinc kernel which can be negative.

التعلم الآلي الهندسة الحسابية التعلم الالي

Recombinator-k-means: A population based algorithm that exploits k-means++ for recombination

89 - Carlo Baldassi 2019

We present a simple heuristic algorithm for efficiently optimizing the notoriously hard minimum sum-of-squares clustering problem, usually addressed by the classical k-means heuristic and its variants. The algorithm, called recombinator-k-means, is v ery similar to a genetic algorithmic scheme: it uses populations of configurations, that are optimized independently in parallel and then recombined in a next-iteration population batch by exploiting a variant of the k-means++ seeding algorithm. An additional reweighting mechanism ensures that the population eventually coalesces into a single solution. Extensive tests measuring optimization objective vs computational time on synthetic and real-word data show that it is the only choice, among state-of-the-art alternatives (simple restarts, random swap, genetic algorithm with pairwise-nearest-neighbor crossover), that consistently produces good results at all time scales, outperforming competitors on large and complicated datasets. The only parameter that requires tuning is the population size. The scheme is rather general (it could be applied even to k-medians or k-medoids, for example). Our implementation is publicly available at https://github.com/carlobaldassi/RecombinatorKMeans.jl.

التعلم الآلي التعلم الالي

$t$-$k$-means: A Robust and Stable $k$-means Variant

364 - Yiming Li , Yang Zhang , Qingtao Tang 2019

$k$-means algorithm is one of the most classical clustering methods, which has been widely and successfully used in signal processing. However, due to the thin-tailed property of the Gaussian distribution, $k$-means algorithm suffers from relatively poor performance on the dataset containing heavy-tailed data or outliers. Besides, standard $k$-means algorithm also has relatively weak stability, $i.e.$ its results have a large variance, which reduces its credibility. In this paper, we propose a robust and stable $k$-means variant, dubbed the $t$-$k$-means, as well as its fast version to alleviate those problems. Theoretically, we derive the $t$-$k$-means and analyze its robustness and stability from the aspect of the loss function and the expression of the clustering center, respectively. Extensive experiments are also conducted, which verify the effectiveness and efficiency of the proposed method. The code for reproducing main results is available at url{https://github.com/THUYimingLi/t-k-means}.

التعلم الآلي التعلم الالي

K-Means Kernel Classifier

69 - M. Andrecut 2020

We combine K-means clustering with the least-squares kernel classification method. K-means clustering is used to extract a set of representative vectors for each class. The least-squares kernel method uses these representative vectors as a training s et for the classification task. We show that this combination of unsupervised and supervised learning algorithms performs very well, and we illustrate this approach using the MNIST dataset

التعلم الآلي تحليل البيانات والإحصاءات والاحتمال

Biclustering with Alternating K-Means

101 - Nicolas Fraiman , Zichao Li 2020

Biclustering is the task of simultaneously clustering the rows and columns of the data matrix into different subgroups such that the rows and columns within a subgroup exhibit similar patterns. In this paper, we consider the case of producing block-d iagonal biclusters. We provide a new formulation of the biclustering problem based on the idea of minimizing the empirical clustering risk. We develop and prove a consistency result with respect to the empirical clustering risk. Since the optimization problem is combinatorial in nature, finding the global minimum is computationally intractable. In light of this fact, we propose a simple and novel algorithm that finds a local minimum by alternating the use of an adapted version of the k-means clustering algorithm between columns and rows. We evaluate and compare the performance of our algorithm to other related biclustering methods on both simulated data and real-world gene expression data sets. The results demonstrate that our algorithm is able to detect meaningful structures in the data and outperform other competing biclustering methods in various settings and situations.

التعلم الآلي التعلم الالي