High Dimensional Clustering with $r$-nets


Abstract in English

Clustering, a fundamental task in data science and machine learning, groups a set of objects in such a way that objects in the same cluster are closer to each other than to those in other clusters. In this paper, we consider a well-known structure, so-called $r$-nets, which rigorously captures the properties of clustering. We devise algorithms that improve the run-time of approximating $r$-nets in high-dimensional spaces with $ell_1$ and $ell_2$ metrics from $tilde{O}(dn^{2-Theta(sqrt{epsilon})})$ to $tilde{O}(dn + n^{2-alpha})$, where $alpha = Omega({epsilon^{1/3}}/{log(1/epsilon)})$. These algorithms are also used to improve a framework that provides approximate solutions to other high dimensional distance problems. Using this framework, several important related problems can also be solved efficiently, e.g., $(1+epsilon)$-approximate $k$th-nearest neighbor distance, $(4+epsilon)$-approximate Min-Max clustering, $(4+epsilon)$-approximate $k$-center clustering. In addition, we build an algorithm that $(1+epsilon)$-approximates greedy permutations in time $tilde{O}((dn + n^{2-alpha}) cdot log{Phi})$ where $Phi$ is the spread of the input. This algorithm is used to $(2+epsilon)$-approximate $k$-center with the same time complexity.

Download