ترغب بنشر مسار تعليمي؟ اضغط هنا

Reliable Distributed Clustering with Redundant Data Assignment

70   0   0.0 ( 0 )
 نشر من قبل Ankit Singh Rawat
 تاريخ النشر 2020
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

In this paper, we present distributed generalized clustering algorithms that can handle large scale data across multiple machines in spite of straggling or unreliable machines. We propose a novel data assignment scheme that enables us to obtain global information about the entire data even when some machines fail to respond with the results of the assigned local computations. The assignment scheme leads to distributed algorithms with good approximation guarantees for a variety of clustering and dimensionality reduction problems.



قيم البحث

اقرأ أيضاً

This paper introduces a new resource allocation problem in distributed computing called distributed serving with mobile servers (DSMS). In DSMS, there are $k$ identical mobile servers residing at the processors of a network. At arbitrary points of ti me, any subset of processors can invoke one or more requests. To serve a request, one of the servers must move to the processor that invoked the request. Resource allocation is performed in a distributed manner since only the processor that invoked the request initially knows about it. All processors cooperate by passing messages to achieve correct resource allocation. They do this with the goal to minimize the communication cost. Routing servers in large-scale distributed systems requires a scalable location service. We introduce the distributed protocol GNN that solves the DSMS problem on overlay trees. We prove that GNN is starvation-free and correctly integrates locating the servers and synchronizing the concurrent access to servers despite asynchrony, even when the requests are invoked over time. Further, we analyze GNN for one-shot executions, i.e., all requests are invoked simultaneously. We prove that when running GNN on top of a special family of tree topologies---known as hierarchically well-separated trees (HSTs)---we obtain a randomized distributed protocol with an expected competitive ratio of $O(log n)$ on general network topologies with $n$ processors. From a technical point of view, our main result is that GNN optimally solves the DSMS problem on HSTs for one-shot executions, even if communication is asynchronous. Further, we present a lower bound of $Omega(max{k, log n/loglog n})$ on the competitive ratio for DSMS. The lower bound even holds when communication is synchronous and requests are invoked sequentially.
We show that the $(degree+1)$-list coloring problem can be solved deterministically in $O(D cdot log n cdotlog^2Delta)$ rounds in the CONGEST model, where $D$ is the diameter of the graph, $n$ the number of nodes, and $Delta$ the maximum degree. Usin g the recent polylogarithmic-time deterministic network decomposition algorithm by Rozhov{n} and Ghaffari [STOC 2020], this implies the first efficient (i.e., $polylog n$-time) deterministic CONGEST algorithm for the $(Delta+1)$-coloring and the $(mathit{degree}+1)$-list coloring problem. Previously the best known algorithm required $2^{O(sqrt{log n})}$ rounds and was not based on network decompositions. Our techniques also lead to deterministic $(mathit{degree}+1)$-list coloring algorithms for the congested clique and the massively parallel computation (MPC) model. For the congested clique, we obtain an algorithm with time complexity $O(logDeltacdotloglogDelta)$, for the MPC model, we obtain algorithms with round complexity $O(log^2Delta)$ for the linear-memory regime and $O(log^2Delta + log n)$ for the sublinear memory regime.
Suppose, we are given a set of $n$ elements to be clustered into $k$ (unknown) clusters, and an oracle/expert labeler that can interactively answer pair-wise queries of the form, do two elements $u$ and $v$ belong to the same cluster?. The goal is to recover the optimum clustering by asking the minimum number of queries. In this paper, we initiate a rigorous theoretical study of this basic problem of query complexity of interactive clustering, and provide strong information theoretic lower bounds, as well as nearly matching upper bounds. Most clustering problems come with a similarity matrix, which is used by an automated process to cluster similar points together. Our main contribution in this paper is to show the dramatic power of side information aka similarity matrix on reducing the query complexity of clustering. A similarity matrix represents noisy pair-wise relationships such as one computed by some function on attributes of the elements. A natural noisy model is where similarity values are drawn independently from some arbitrary probability distribution $f_+$ when the underlying pair of elements belong to the same cluster, and from some $f_-$ otherwise. We show that given such a similarity matrix, the query complexity reduces drastically from $Theta(nk)$ (no similarity matrix) to $O(frac{k^2log{n}}{cH^2(f_+|f_-)})$ where $cH^2$ denotes the squared Hellinger divergence. Moreover, this is also information-theoretic optimal within an $O(log{n})$ factor. Our algorithms are all efficient, and parameter free, i.e., they work without any knowledge of $k, f_+$ and $f_-$, and only depend logarithmically with $n$. Along the way, our work also reveals intriguing connection to popular community detection models such as the {em stochastic block model}, significantly generalizes them, and opens up many venues for interesting future research.
In this paper we consider neighborhood load balancing in the context of selfish clients. We assume that a network of n processors and m tasks is given. The processors may have different speeds and the tasks may have different weights. Every task is c ontrolled by a selfish user. The objective of the user is to allocate his/her task to a processor with minimum load. We revisit the concurrent probabilistic protocol introduced in [6], which works in sequential rounds. In each round every task is allowed to query the load of one randomly chosen neighboring processor. If that load is smaller the task will migrate to that processor with a suitably chosen probability. Using techniques from spectral graph theory we obtain upper bounds on the expected convergence time towards approximate and exact Nash equilibria that are significantly better than the previous results in [6]. We show results for uniform tasks on non-uniform processors and the general case where the tasks have different weights and the machines have speeds. To the best of our knowledge, these are the first results for this general setting.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا