ترغب بنشر مسار تعليمي؟ اضغط هنا

On Variants of k-means Clustering

550   0   0.0 ( 0 )
 نشر من قبل Sayan Bandyapadhyay
 تاريخ النشر 2015
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

textit{Clustering problems} often arise in the fields like data mining, machine learning etc. to group a collection of objects into similar groups with respect to a similarity (or dissimilarity) measure. Among the clustering problems, specifically textit{$k$-means} clustering has got much attention from the researchers. Despite the fact that $k$-means is a very well studied problem its status in the plane is still an open problem. In particular, it is unknown whether it admits a PTAS in the plane. The best known approximation bound in polynomial time is $9+eps$. In this paper, we consider the following variant of $k$-means. Given a set $C$ of points in $mathcal{R}^d$ and a real $f > 0$, find a finite set $F$ of points in $mathcal{R}^d$ that minimizes the quantity $f*|F|+sum_{pin C} min_{q in F} {||p-q||}^2$. For any fixed dimension $d$, we design a local search PTAS for this problem. We also give a bi-criterion local search algorithm for $k$-means which uses $(1+eps)k$ centers and yields a solution whose cost is at most $(1+eps)$ times the cost of an optimal $k$-means solution. The algorithm runs in polynomial time for any fixed dimension. The contribution of this paper is two fold. On the one hand, we are being able to handle the square of distances in an elegant manner, which yields near optimal approximation bound. This leads us towards a better understanding of the $k$-means problem. On the other hand, our analysis of local search might also be useful for other geometric problems. This is important considering that very little is known about the local search method for geometric approximation.



قيم البحث

اقرأ أيضاً

In 2015, Driemel, Krivov{s}ija and Sohler introduced the $(k,ell)$-median problem for clustering polygonal curves under the Frechet distance. Given a set of input curves, the problem asks to find $k$ median curves of at most $ell$ vertices each that minimize the sum of Frechet distances over all input curves to their closest median curve. A major shortcoming of their algorithm is that the input curves are restricted to lie on the real line. In this paper, we present a randomized bicriteria-approximation algorithm that works for polygonal curves in $mathbb{R}^d$ and achieves approximation factor $(1+epsilon)$ with respect to the clustering costs. The algorithm has worst-case running-time linear in the number of curves, polynomial in the maximum number of vertices per curve, i.e. their complexity, and exponential in $d$, $ell$, $epsilon$ and $delta$, i.e., the failure probability. We achieve this result through a shortcutting lemma, which guarantees the existence of a polygonal curve with similar cost as an optimal median curve of complexity $ell$, but of complexity at most $2ell-2$, and whose vertices can be computed efficiently. We combine this lemma with the superset-sampling technique by Kumar et al. to derive our clustering result. In doing so, we describe and analyze a generalization of the algorithm by Ackermann et al., which may be of independent interest.
In this paper we consider two metric covering/clustering problems - textit{Minimum Cost Covering Problem} (MCC) and $k$-clustering. In the MCC problem, we are given two point sets $X$ (clients) and $Y$ (servers), and a metric on $X cup Y$. We would l ike to cover the clients by balls centered at the servers. The objective function to minimize is the sum of the $alpha$-th power of the radii of the balls. Here $alpha geq 1$ is a parameter of the problem (but not of a problem instance). MCC is closely related to the $k$-clustering problem. The main difference between $k$-clustering and MCC is that in $k$-clustering one needs to select $k$ balls to cover the clients. For any $eps > 0$, we describe quasi-polynomial time $(1 + eps)$ approximation algorithms for both of the problems. However, in case of $k$-clustering the algorithm uses $(1 + eps)k$ balls. Prior to our work, a $3^{alpha}$ and a ${c}^{alpha}$ approximation were achieved by polynomial-time algorithms for MCC and $k$-clustering, respectively, where $c > 1$ is an absolute constant. These two problems are thus interesting examples of metric covering/clustering problems that admit $(1 + eps)$-approximation (using $(1+eps)k$ balls in case of $k$-clustering), if one is willing to settle for quasi-polynomial time. In contrast, for the variant of MCC where $alpha$ is part of the input, we show under standard assumptions that no polynomial time algorithm can achieve an approximation factor better than $O(log |X|)$ for $alpha geq log |X|$.
We address the problem of simultaneously learning a k-means clustering and deep feature representation from unlabelled data, which is of interest due to the potential of deep k-means to outperform traditional two-step feature extraction and shallow-c lustering strategies. We achieve this by developing a gradient-estimator for the non-differentiable k-means objective via the Gumbel-Softmax reparameterisation trick. In contrast to previous attempts at deep clustering, our concrete k-means model can be optimised with respect to the canonical k-means objective and is easily trained end-to-end without resorting to alternating optimisation. We demonstrate the efficacy of our method on standard clustering benchmarks.
This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$ -means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.
In this paper, we show that the popular K-means clustering problem can equivalently be reformulated as a conic program of polynomial size. The arising convex optimization problem is NP-hard, but amenable to a tractable semidefinite programming (SDP) relaxation that is tighter than the current SDP relaxation schemes in the literature. In contrast to the existing schemes, our proposed SDP formulation gives rise to solutions that can be leveraged to identify the clusters. We devise a new approximation algorithm for K-means clustering that utilizes the improved formulation and empirically illustrate its superiority over the state-of-the-art solution schemes.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا