Parallel Correlation Clustering on Big Graphs

162 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Dimitris S. Papailiopoulos

تاريخ النشر 2015

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xinghao Pan - Dimitris Papailiopoulos - Samet Oymak

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات التعلم الالي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Given a similarity graph between items, correlation clustering (CC) groups similar items together and dissimilar ones apart. One of the most popular CC algorithms is KwikCluster: an algorithm that serially clusters neighborhoods of vertices, and obtains a 3-approximation ratio. Unfortunately, KwikCluster in practice requires a large number of clustering rounds, a potential bottleneck for large graphs. We present C4 and ClusterWild!, two algorithms for parallel correlation clustering that run in a polylogarithmic number of rounds and achieve nearly linear speedups, provably. C4 uses concurrency control to enforce serializability of a parallel clustering process, and guarantees a 3-approximation ratio. ClusterWild! is a coordination free algorithm that abandons consistency for the benefit of better scaling; this leads to a provably small loss in the 3-approximation ratio. We provide extensive experimental results for both algorithms, where we outperform the state of the art, both in terms of clustering accuracy and running time. We show that our algorithms can cluster billion-edge graphs in under 5 seconds on 32 cores, while achieving a 15x speedup.

قيم البحث

86 - Melanie Cambus , Davin Choo , Havu Miikonen 2021

Identifying clusters of similar elements in a set is a common task in data analysis. With the immense growth of data and physical limitations on single processor speed, it is necessary to find efficient parallel algorithms for clustering tasks. In th is paper, we study the problem of correlation clustering in bounded arboricity graphs with respect to the Massively Parallel Computation (MPC) model. More specifically, we are given a complete graph where the edges are either positive or negative, indicating whether pairs of vertices are similar or dissimilar. The task is to partition the vertices into clusters with as few disagreements as possible. That is, we want to minimize the number of positive inter-cluster edges and negative intra-cluster edges. Consider an input graph $G$ on $n$ vertices such that the positive edges induce a $lambda$-arboric graph. Our main result is a 3-approximation ($textit{in expectation}$) algorithm to correlation clustering that runs in $mathcal{O}(log lambda cdot textrm{poly}(log log n))$ MPC rounds in the $textit{strongly sublinear memory regime}$. This is obtained by combining structural properties of correlation clustering on bounded arboricity graphs with the insights of Fischer and Noever (SODA 18) on randomized greedy MIS and the $texttt{PIVOT}$ algorithm of Ailon, Charikar, and Newman (STOC 05). Combined with known graph matching algorithms, our structural property also implies an exact algorithm and algorithms with $textit{worst case}$ $(1+epsilon)$-approximation guarantees in the special case of forests, where $lambda=1$.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Event Clustering & Event Series Characterization on Expected Frequency

134 - Conrad M Albrecht , Marcus Freitag , Theodore G van Kessel 2020

We present an efficient clustering algorithm applicable to one-dimensional data such as e.g. a series of timestamps. Given an expected frequency $Delta T^{-1}$, we introduce an $mathcal{O}(N)$-efficient method of characterizing $N$ events represented by an ordered series of timestamps $t_1,t_2,dots,t_N$. In practice, the method proves useful to e.g. identify time intervals of missing data or to locate isolated events. Moreover, we define measures to quantify a series of events by varying $Delta T$ to e.g. determine the quality of an Internet of Things service.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات بنية الشبكات والإنترنت

Parallel K-Medoids++ Spatial Clustering Algorithm Based on MapReduce

74 - Xia Yue , Wang Man , Jun Yue 2016

Clustering analysis has received considerable attention in spatial data mining for several years. With the rapid development of the geospatial information technologies, the size of spatial information data is growing exponentially which makes cluster ing massive spatial data a challenging task. In order to improve the efficiency of spatial clustering for large scale data, many researchers proposed several efficient clustering algorithms in parallel. In this paper, a new K-Medoids++ spatial clustering algorithm based on MapReduce for clustering massive spatial data is proposed. The initialization algorithm to decrease the number of iterations is combined with the MapReduce framework. Comparative Experiments conducted over different dataset and different number of nodes indicate that the proposed K-Medoids spatial clustering algorithm provides better efficiency than traditional K-Medoids and scales well while processing massive spatial data on commodity hardware.

النظم الموزعة والتوازية والحوسبة العنقودية

Local Approximability of Minimum Dominating Set on Planar Graphs

236 - Miikka Hilke , Christoph Lenzen , Jukka Suomela 2014

We show that there is no deterministic local algorithm (constant-time distributed graph algorithm) that finds a $(7-epsilon)$-approximation of a minimum dominating set on planar graphs, for any positive constant $epsilon$. In prior work, the best low er bound on the approximation ratio has been $5-epsilon$; there is also an upper bound of $52$.

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

Compression and Sieve: Reducing Communication in Parallel Breadth First Search on Distributed Memory Systems

227 - Huiwei Lv , Guangming Tan , Mingyu Chen 2012

For parallel breadth first search (BFS) algorithm on large-scale distributed memory systems, communication often costs significantly more than arithmetic and limits the scalability of the algorithm. In this paper we sufficiently reduce the communicat ion cost in distributed BFS by compressing and sieving the messages. First, we leverage a bitmap compression algorithm to reduce the size of messages before communication. Second, we propose a novel distributed directory algorithm, cross directory, to sieve the redundant data in messages. Experiments on a 6,144-core SMP cluster show our algorithm outperforms the baseline implementation in Graph500 by 2.2 times, reduces its communication time by 79.0%, and achieves a performance rate of 12.1 GTEPS (billion edge visits per second)

النظم الموزعة والتوازية والحوسبة العنقودية بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة دمشق

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Parallel Correlation Clustering on Big Graphs

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً