بحث متقدم مدعوم من الذكاء الصنعي

مساحة جديدة

اشترك بالحزمة الذهبية واحصل على وصول غير محدود شمرا أكاديميا

تسجيل مستخدم جديد

Clustering with Noisy Queries

329 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Arya Mazumdar

تاريخ النشر 2017

مجال البحث الاحصاء الرياضي الهندسة المعلوماتية

والبحث باللغة English

تأليف Arya Mazumdar - Barna Saha

التعلم الالي بنى وهياكل البيانات والخوارزميات نظرية المعلومات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

No English abstract

قيم البحث

93 - Arya Mazumdar , Barna Saha 2017

Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, do two elements $u$ and $v$ belong to the same cluster?. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $Theta(nk)$ (no similarity matrix) to $O(frac{k^2log{n}}{cH^2(f_+|f_-)})$ where $cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.

التعلم الالي بنى وهياكل البيانات والخوارزميات نظرية المعلومات

Reliable Distributed Clustering with Redundant Data Assignment

69 - Venkata Gandikota , Arya Mazumdar , Ankit Singh Rawat 2020

In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain globa l information about the entire data even when some machines fail to respond with the results of the assigned local computations. The assignment scheme leads to distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات نظرية المعلومات

Balancing the Tradeoff Between Clustering Value and Interpretability

73 - Sandhya Saisubramanian , Sainyam Galhotra , Shlomo Zilberstein 2019

Graph clustering groups entities -- the vertices of a graph -- based on their similarity, typically using a complex distance function over a large number of features. Successful integration of clustering approaches in automated decision-support syste ms hinges on the interpretability of the resulting clusters. This paper addresses the problem of generating interpretable clusters, given features of interest that signify interpretability to an end-user, by optimizing interpretability in addition to common clustering objectives. We propose a $beta$-interpretable clustering algorithm that ensures that at least $beta$ fraction of nodes in each cluster share the same feature value. The tunable parameter $beta$ is user-specified. We also present a more efficient algorithm for scenarios with $beta!=!1$ and analyze the theoretical guarantees of the two algorithms. Finally, we empirically demonstrate the benefits of our approaches in generating interpretable clusters using four real-world datasets. The interpretability of the clusters is complemented by generating simple explanations denoting the feature values of the nodes in the clusters, using frequent pattern mining.

التعلم الالي بنى وهياكل البيانات والخوارزميات التعلم الآلي

Phase transitions and optimal algorithms in high-dimensional Gaussian mixture clustering

112 - Thibault Lesieur , Caterina De Bacco , Jess Banks 2016

We consider the problem of Gaussian mixture clustering in the high-dimensional limit where the data consists of $m$ points in $n$ dimensions, $n,m rightarrow infty$ and $alpha = m/n$ stays finite. Using exact but non-rigorous methods from statistical physics, we determine the critical value of $alpha$ and the distance between the clusters at which it becomes information-theoretically possible to reconstruct the membership into clusters better than chance. We also determine the accuracy achievable by the Bayes-optimal estimation algorithm. In particular, we find that when the number of clusters is sufficiently large, $r > 4 + 2 sqrt{alpha}$, there is a gap between the threshold for information-theoretically optimal performance and the threshold at which known algorithms succeed.

التعلم الالي الأنظمة المضطربة والشبكات العصبية نظرية المعلومات

Parallel Range, Segment and Rectangle Queries with Augmented Maps

111 - Yihan Sun , Guy E. Blelloch 2018

The range, segment and rectangle query problems are fundamental problems in computational geometry, and have extensive applications in many domains. Despite the significant theoretical work on these problems, efficient implementations can be complica ted. We know of very few practical implementations of the algorithms in parallel, and most implementations do not have tight theoretical bounds. We focus on simple and efficient parallel algorithms and implementations for these queries, which have tight worst-case bound in theory and good parallel performance in practice. We propose to use a simple framework (the augmented map) to model the problem. Based on the augmented map interface, we develop both multi-level tree structures and sweepline algorithms supporting range, segment and rectangle queries in two dimensions. For the sweepline algorithms, we propose a parallel paradigm and show corresponding cost bounds. All of our data structures are work-efficient to build in theory and achieve a low parallel depth. The query time is almost linear to the output size. We have implemented all the data structures described in the paper using a parallel augmented map library. Based on the library each data structure only requires about 100 lines of C++ code. We test their performance on large data sets (up to $10^8$ elements) and a machine with 72-cores (144 hyperthreads). The parallel construction achieves 32-68x speedup. Speedup numbers on queries are up to 126-fold. Our sequential implementation outperforms the CGAL library by at least 2x in both construction and queries. Our sequential implementation can be slightly slower than the R-tree in the Boost library in some cases (0.6-2.5x), but has significantly better query performance (1.6-1400x) than Boost.

الهندسة الحسابية بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

المعهد العالي للدراسات والبحوث السكانية

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Clustering with Noisy Queries

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

No English abstract

اقرأ أيضاً