Dual-tree $k$-means with bounded iteration runtime

126 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ryan Curtin

تاريخ النشر 2016

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ryan R. Curtin

بنى وهياكل البيانات والخوارزميات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

k-means is a widely used clustering algorithm, but for $k$ clusters and a dataset size of $N$, each iteration of Lloyds algorithm costs $O(kN)$ time. Although there are existing techniques to accelerate single Lloyd iterations, none of these are tailored to the case of large $k$, which is increasingly common as dataset sizes grow. We propose a dual-tree algorithm that gives the exact same results as standard $k$-means; when using cover trees, we use adaptive analysis techniques to, under some assumptions, bound the single-iteration runtime of the algorithm as $O(N + k log k)$. To our knowledge these are the first sub-$O(kN)$ bounds for exact Lloyd iterations. We then show that this theoretically favorable algorithm performs competitively in practice, especially for large $N$ and $k$ in low dimensions. Further, the algorithm is tree-independent, so any type of tree may be used.

قيم البحث

376 - Ioannis Katsikarelis 2013

By a well known result the treewidth of k-outerplanar graphs is at most 3k-1. This paper gives, besides a rigorous proof of this fact, an algorithmic implementation of the proof, i.e. it is shown that, given a k-outerplanar graph G, a tree decomposit ion of G of width at most 3k-1 can be found in O(kn) time and space. Similarly, a branch decomposition of a k-outerplanar graph of width at most 2k+1 can be also obtained in O(kn) time, the algorithm for which is also analyzed.

بنى وهياكل البيانات والخوارزميات التعقيد الحسابي

Tree-Independent Dual-Tree Algorithms

688 - Ryan R. Curtin , William B. March , Parikshit Ram 2013

Dual-tree algorithms are a widely used class of branch-and-bound algorithms. Unfortunately, developing dual-tree algorithms for use with different trees and problems is often complex and burdensome. We introduce a four-part logical split: the tree, t he traversal, the point-to-point base case, and the pruning rule. We provide a meta-algorithm which allows development of dual-tree algorithms in a tree-independent manner and easy extension to entirely new types of trees. Representations are provided for five common algorithms; for k-nearest neighbor search, this leads to a novel, tighter pruning bound. The meta-algorithm also allows straightforward extensions to massively parallel settings.

بنى وهياكل البيانات والخوارزميات

Dimensionality Reduction for k-Means Clustering and Low Rank Approximation

724 - Michael B. Cohen , Sam Elder , Cameron Musco 2014

We show how to approximate a data matrix $mathbf{A}$ with a much smaller sketch $mathbf{tilde A}$ that can be used to solve a general class of constrained k-rank approximation problems to within $(1+epsilon)$ error. Importantly, this class of problem s includes $k$-means clustering and unconstrained low rank approximation (i.e. principal component analysis). By reducing data points to just $O(k)$ dimensions, our methods generically accelerate any exact, approximate, or heuristic algorithm for these ubiquitous problems. For $k$-means dimensionality reduction, we provide $(1+epsilon)$ relative error results for many common sketching techniques, including random row projection, column selection, and approximate SVD. For approximate principal component analysis, we give a simple alternative to known algorithms that has applications in the streaming setting. Additionally, we extend recent work on column-based matrix reconstruction, giving column subsets that not only `cover a good subspace for $bv{A}$, but can be used directly to compute this subspace. Finally, for $k$-means clustering, we show how to achieve a $(9+epsilon)$ approximation by Johnson-Lindenstrauss projecting data points to just $O(log k/epsilon^2)$ dimensions. This gives the first result that leverages the specific structure of $k$-means to achieve dimension independent of input size and sublinear in $k$.

بنى وهياكل البيانات والخوارزميات التعلم الآلي

Fast Noise Removal for $k$-Means Clustering

81 - Sungjin Im , Mahshid Montazer Qaem , Benjamin Moseley 2020

This paper considers $k$-means clustering in the presence of noise. It is known that $k$-means clustering is highly sensitive to noise, and thus noise should be removed to obtain a quality solution. A popular formulation of this problem is called $k$ -means clustering with outliers. The goal of $k$-means clustering with outliers is to discard up to a specified number $z$ of points as noise/outliers and then find a $k$-means solution on the remaining data. The problem has received significant attention, yet current algorithms with theoretical guarantees suffer from either high running time or inherent loss in the solution quality. The main contribution of this paper is two-fold. Firstly, we develop a simple greedy algorithm that has provably strong worst case guarantees. The greedy algorithm adds a simple preprocessing step to remove noise, which can be combined with any $k$-means clustering algorithm. This algorithm gives the first pseudo-approximation-preserving reduction from $k$-means with outliers to $k$-means without outliers. Secondly, we show how to construct a coreset of size $O(k log n)$. When combined with our greedy algorithm, we obtain a scalable, near linear time algorithm. The theoretical contributions are verified experimentally by demonstrating that the algorithm quickly removes noise and obtains a high-quality clustering.

بنى وهياكل البيانات والخوارزميات

C-Planarity Testing of Embedded Clustered Graphs with Bounded Dual Carving-Width

70 - Giordano Da Lozzo , David Eppstein , Michael T. Goodrich 2019

For a clustered graph, i.e, a graph whose vertex set is recursively partitioned into clusters, the C-Planarity Testing problem asks whether it is possible to find a planar embedding of the graph and a representation of each cluster as a region homeom orphic to a closed disk such that 1. the subgraph induced by each cluster is drawn in the interior of the corresponding disk, 2. each edge intersects any disk at most once, and 3. the nesting between clusters is reflected by the representation, i.e., child clusters are properly contained in their parent cluster. The computational complexity of this problem, whose study has been central to the theory of graph visualization since its introduction in 1995 [Qing-Wen Feng, Robert F. Cohen, and Peter Eades. Planarity for clustered graphs. ESA95], has only been recently settled [Radoslav Fulek and Csaba D. Toth. Atomic Embeddability, Clustered Planarity, and Thickenability. To appear at SODA20]. Before such a breakthrough, the complexity question was still unsolved even when the graph has a prescribed planar embedding, i.e, for embedded clustered graphs. We show that the C-Planarity Testing problem admits a single-exponential single-parameter FPT algorithm for embedded clustered graphs, when parameterized by the carving-width of the dual graph of the input. This is the first FPT algorithm for this long-standing open problem with respect to a single notable graph-width parameter. Moreover, in the general case, the polynomial dependency of our FPT algorithm is smaller than the one of the algorithm by Fulek and Toth. To further strengthen the relevance of this result, we show that the C-Planarity Testing problem retains its computational complexity when parameterized by several other graph-width parameters, which may potentially lead to faster algorithms.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية