On Efficient Low Distortion Ultrametric Embedding


Abstract in English

A classic problem in unsupervised learning and data analysis is to find simpler and easy-to-visualize representations of the data that preserve its essential properties. A widely-used method to preserve the underlying hierarchical structure of the data while reducing its complexity is to find an embedding of the data into a tree or an ultrametric. The most popular algorithms for this task are the classic linkage algorithms (single, average, or complete). However, these methods on a data set of $n$ points in $Omega(log n)$ dimensions exhibit a quite prohibitive running time of $Theta(n^2)$. In this paper, we provide a new algorithm which takes as input a set of points $P$ in $mathbb{R}^d$, and for every $cge 1$, runs in time $n^{1+frac{rho}{c^2}}$ (for some universal constant $rho>1$) to output an ultrametric $Delta$ such that for any two points $u,v$ in $P$, we have $Delta(u,v)$ is within a multiplicative factor of $5c$ to the distance between $u$ and $v$ in the best ultrametric representation of $P$. Here, the best ultrametric is the ultrametric $tildeDelta$ that minimizes the maximum distance distortion with respect to the $ell_2$ distance, namely that minimizes $underset{u,v in P}{max} frac{tildeDelta(u,v)}{|u-v|_2}$. We complement the above result by showing that under popular complexity theoretic assumptions, for every constant $varepsilon>0$, no algorithm with running time $n^{2-varepsilon}$ can distinguish between inputs in $ell_infty$-metric that admit isometric embedding and those that incur a distortion of $frac{3}{2}$. Finally, we present empirical evaluation on classic machine learning datasets and show that the output of our algorithm is comparable to the output of the linkage algorithms while achieving a much faster running time.

Download