The Bane of Low-Dimensionality Clustering

46 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Arnaud de Mesmay

تاريخ النشر 2017

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Vincent Cohen-Addad - Arnaud de Mesmay - Eva Rotenberg

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we give a conditional lower bound of $n^{Omega(k)}$ on running time for the classic k-median and k-means clustering objectives (where n is the size of the input), even in low-dimensional Euclidean space of dimension four, assuming the Exponential Time Hypothesis (ETH). We also consider k-median (and k-means) with penalties where each point need not be assigned to a center, in which case it must pay a penalty, and extend our lower bound to at least three-dimensional Euclidean space. This stands in stark contrast to many other geometric problems such as the traveling salesman problem, or computing an independent set of unit spheres. While these problems benefit from the so-called (limited) blessing of dimensionality, as they can be solved in time $n^{O(k^{1-1/d})}$ or $2^{n^{1-1/d}}$ in d dimensions, our work shows that widely-used clustering objectives have a lower bound of $n^{Omega(k)}$, even in dimension four. We complete the picture by considering the two-dimensional case: we show that there is no algorithm that solves the penalized version in time less than $n^{o(sqrt{k})}$, and provide a matching upper bound of $n^{O(sqrt{k})}$. The main tool we use to establish these lower bounds is the placement of points on the moment curve, which takes its inspiration from constructions of point sets yielding Delaunay complexes of high complexity.

قيم البحث

184 - Michael B. Cohen , Sam Elder , Cameron Musco 2014

We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problem s includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover a good subspace for $bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(log k/epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

On the Local Structure of Stable Clustering Instances

49 - Vincent Cohen-Addad , Chris Schwiegelshohn 2017

We study the classic $k$-median and $k$-means clustering objectives in the beyond-worst-case scenario. We consider three well-studied notions of structured data that aim at characterizing real-world inputs: Distribution Stability (introduced by Awast hi, Blum, and Sheffet, FOCS 2010), Spectral Separability (introduced by Kumar and Kannan, FOCS 2010), Perturbation Resilience (introduced by Bilu and Linial, ICS 2010). We prove structural results showing that inputs satisfying at least one of the conditions are inherently local. Namely, for any such input, any local optimum is close both in term of structure and in term of objective value to the global optima. As a corollary we obtain that the widely-used Local Search algorithm has strong performance guarantees for both the tasks of recovering the underlying optimal clustering and obtaining a clustering of small cost. This is a significant step toward understanding the success of local search heuristics in clustering applications.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية التعلم الآلي

Clustering under Perturbation Stability in Near-Linear Time

303 - Pankaj K. Agarwal , Hsien-Chih Chang , Kamesh Munagala 2020

We consider the problem of center-based clustering in low-dimensional Euclidean spaces under the perturbation stability assumption. An instance is $alpha$-stable if the underlying optimal clustering continues to remain optimal even when all pairwise distances are arbitrarily perturbed by a factor of at most $alpha$. Our main contribution is in presenting efficient exact algorithms for $alpha$-stable clustering instances whose running times depend near-linearly on the size of the data set when $alpha ge 2 + sqrt{3}$. For $k$-center and $k$-means problems, our algorithms also achieve polynomial dependence on the number of clusters, $k$, when $alpha geq 2 + sqrt{3} + epsilon$ for any constant $epsilon > 0$ in any fixed dimension. For $k$-median, our algorithms have polynomial dependence on $k$ for $alpha > 5$ in any fixed dimension; and for $alpha geq 2 + sqrt{3}$ in two dimensions. Our algorithms are simple, and only require applying techniques such as local search or dynamic programming to a suitably modified metric space, combined with careful choice of data structures.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

Clustering, Encoding and Diameter Computation Algorithms for Multidimensional Data

124 - Mugurel Ionut Andreica , Eliana-Dina Tirsa 2010

In this paper we present novel algorithms for several multidimensional data processing problems. We consider problems related to the computation of restricted clusters and of the diameter of a set of points using a new distance function. We also cons ider two string (1D data) processing problems, regarding an optimal encoding method and the computation of the number of occurrences of a substring within a string generated by a grammar. The algorithms have been thoroughly analyzed from a theoretical point of view and some of them have also been evaluated experimentally.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

Finding the Mode of a Kernel Density Estimate

126 - Jasper C.H. Lee , Jerry Li , Christopher Musco 2019

Given points $p_1, dots, p_n$ in $mathbb{R}^d$, how do we find a point $x$ which maximizes $frac{1}{n} sum_{i=1}^n e^{-|p_i - x|^2}$? In other words, how do we find the maximizing point, or mode of a Gaussian kernel density estimation (KDE) centered at $p_1, dots, p_n$? Given the power of KDEs in representing probability distributions and other continuous functions, the basic mode finding problem is widely applicable. However, it is poorly understood algorithmically. Few provable algorithms are known, so practitioners rely on heuristics like the mean-shift algorithm, which are not guaranteed to find a global optimum. We address this challenge by providing fast and provably accurate approximation algorithms for mode finding in both the low and high dimensional settings. For low dimension $d$, our main contribution is to reduce the mode finding problem to a solving a small number of systems of polynomial inequalities. For high dimension $d$, we prove the first dimensionality reduction result for KDE mode finding, which allows for reduction to the low dimensional case. Our result leverages Johnson-Lindenstrauss random projection, Kirszbrauns classic extension theorem, and perhaps surprisingly, the mean-shift heuristic for mode finding.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

الجامعة الدولية الخاصة للعلوم والتكنولوجيا

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

The Bane of Low-Dimensionality Clustering

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً