Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Clustering with Noisy Queries

329 0 0.0 ( 0 )

Download Cite

Added by Arya Mazumdar

Publication date 2017

fields Mathematical Statistics Informatics Engineering

and research's language is English

Authors Arya Mazumdar - Barna Saha

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

No English abstract

rate research

Query Complexity of Clustering with Side Information

93 - Arya Mazumdar , Barna Saha 2017

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, do two elements $u$ and $v$ belong to the same cluster?. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $Theta(nk)$ (no similarity matrix) to $O(frac{k^2log{n}}{cH^2(f_+|f_-)})$ where $cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

Machine Learning Data Structures and Algorithms Information Theory

Reliable Distributed Clustering with Redundant Data Assignment

69 - Venkata Gandikota , Arya Mazumdar , Ankit Singh Rawat 2020

In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations. The assignment scheme leads to distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.

Distributed Parallel and Cluster Computing Data Structures and Algorithms Information Theory

Balancing the Tradeoff Between Clustering Value and Interpretability

73 - Sandhya Saisubramanian , Sainyam Galhotra , Shlomo Zilberstein 2019

Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support systems hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a $beta$-interpretable clustering algorithm that ensures that at least $beta$ fraction of nodes in each cluster share the same feature value. The tunable parameter $beta$ is user-specified. We also present a more efficient algorithm for scenarios with $beta!=!1$ and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining.

Machine Learning Data Structures and Algorithms Machine Learning

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

112 - Thibault Lesieur , Caterina De Bacco , Jess Banks 2016

We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of $m$ points in $n$ dimensions, $n,m rightarrow infty$ and $alpha = m/n$ stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of $alpha$ and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determine the accuracy achievable by the Bayes-optimal estimation algorithm. In particular, we find that when the number of clusters is sufficiently large, $r > 4 + 2 sqrt{alpha}$, there is a gap between the threshold for information-theoretically optimal performance and the threshold at which known algorithms succeed.

Machine Learning Disordered Systems and Neural Networks Information Theory

Parallel Range, Segment and Rectangle Queries with Augmented Maps

111 - Yihan Sun , Guy E. Blelloch 2018

The range, segment and rectangle query problems are fundamental problems in computational geometry, and have extensive applications in many domains. Despite the significant theoretical work on these problems, efficient implementations can be complicated. We know of very few practical implementations of the algorithms in parallel, and most implementations do not have tight theoretical bounds. We focus on simple and efficient parallel algorithms and implementations for these queries, which have tight worst-case bound in theory and good parallel performance in practice. We propose to use a simple framework (the augmented map) to model the problem. Based on the augmented map interface, we develop both multi-level tree structures and sweepline algorithms supporting range, segment and rectangle queries in two dimensions. For the sweepline algorithms, we propose a parallel paradigm and show corresponding cost bounds. All of our data structures are work-efficient to build in theory and achieve a low parallel depth. The query time is almost linear to the output size. We have implemented all the data structures described in the paper using a parallel augmented map library. Based on the library each data structure only requires about 100 lines of C++ code. We test their performance on large data sets (up to $10^8$ elements) and a machine with 72-cores (144 hyperthreads). The parallel construction achieves 32-68x speedup. Speedup numbers on queries are up to 126-fold. Our sequential implementation outperforms the CGAL library by at least 2x in both construction and queries. Our sequential implementation can be slightly slower than the R-tree in the Boost library in some cases (0.6-2.5x), but has significantly better query performance (1.6-1400x) than Boost.

Computational Geometry Data Structures and Algorithms

comments

Fetching comments

University of Aleppo

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Clustering with Noisy Queries

Ask ChatGPT about the research

No Arabic abstract

No English abstract

Read More