No Arabic abstract
Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(ncdot mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the quality of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(ncdot mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.
We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space ${1, 2, ldots Delta}^d$. Our algorithms use $k epsilon^{-2} poly(d log Delta)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+epsilon)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+epsilon)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+epsilon)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.
The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $frac{vert A cap B vert}{vert Acup Bvert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1pm varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.
Recently, Hierarchical Clustering (HC) has been considered through the lens of optimization. In particular, two maximization objectives have been defined. Moseley and Wang defined the emph{Revenue} objective to handle similarity information given by a weighted graph on the data points (w.l.o.g., $[0,1]$ weights), while Cohen-Addad et al. defined the emph{Dissimilarity} objective to handle dissimilarity information. In this paper, we prove structural lemmas for both objectives allowing us to convert any HC tree to a tree with constant number of internal nodes while incurring an arbitrarily small loss in each objective. Although the best-known approximations are 0.585 and 0.667 respectively, using our lemmas we obtain approximations arbitrarily close to 1, if not all weights are small (i.e., there exist constants $epsilon, delta$ such that the fraction of weights smaller than $delta$, is at most $1 - epsilon$); such instances encompass many metric-based similarity instances, thereby improving upon prior work. Finally, we introduce Hierarchical Correlation Clustering (HCC) to handle instances that contain similarity and dissimilarity information simultaneously. For HCC, we provide an approximation of 0.4767 and for complementary similarity/dissimilarity weights (analogous to $+/-$ correlation clustering), we again present nearly-optimal approximations.
Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem on graphs using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental analysis of our techniques.
In Bipartite Correlation Clustering (BCC) we are given a complete bipartite graph $G$ with `+ and `- edges, and we seek a vertex clustering that maximizes the number of agreements: the number of all `+ edges within clusters plus all `- edges cut across clusters. BCC is known to be NP-hard. We present a novel approximation algorithm for $k$-BCC, a variant of BCC with an upper bound $k$ on the number of clusters. Our algorithm outputs a $k$-clustering that provably achieves a number of agreements within a multiplicative ${(1-delta)}$-factor from the optimal, for any desired accuracy $delta$. It relies on solving a combinatorially constrained bilinear maximization on the bi-adjacency matrix of $G$. It runs in time exponential in $k$ and $delta^{-1}$, but linear in the size of the input. Further, we show that, in the (unconstrained) BCC setting, an ${(1-delta)}$-approximation can be achieved by $O(delta^{-1})$ clusters regardless of the size of the graph. In turn, our $k$-BCC algorithm implies an Efficient PTAS for the BCC objective of maximizing agreements.