Efficient Similarity Search in Dynamic Data Streams

110 0 0.0 ( 0 )

Download Cite

Added by Chris Schwiegelshohn

Publication date 2016

fields Informatics Engineering

and research's language is English

Authors Marc Bury - Chris Schwiegelshohn - Mara Sorella

Data Structures and Algorithms

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $frac{vert A cap B vert}{vert Acup Bvert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1pm varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.

rate research

Clustering High Dimensional Dynamic Data Streams

138 - Vladimir Braverman , Gereon Frahling , Harry Lang 2017

We present data streaming algorithms for the $k$-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space ${1, 2, ldots Delta}^d$. Our algorithms use $k epsilon^{-2} poly(d log Delta)$ space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of $k$ centers the cost of the coreset $(1+epsilon)$-approximates the cost of the streamed point set. We also provide algorithms that guarantee only positive weights in the coreset with additional logarithmic factors in the space and time complexities. We can use this positively-weighted coreset to compute a $(1+epsilon)$-approximation for the $k$-median problem by any efficient offline $k$-median algorithm. All previous algorithms for computing a $(1+epsilon)$-approximation for the $k$-median problem over dynamic data streams required space and time exponential in $d$. Our algorithms can be generalized to metric spaces of bounded doubling dimension.

Data Structures and Algorithms

Sublinear Estimation of Weighted Matchings in Dynamic Data Streams

353 - Marc Bury , Chris Schwiegelshohn 2015

This paper presents an algorithm for estimating the weight of a maximum weighted matching by augmenting any estimation routine for the size of an unweighted matching. The algorithm is implementable in any streaming model including dynamic graph streams. We also give the first constant estimation for the maximum matching size in a dynamic graph stream for planar graphs (or any graph with bounded arboricity) using $tilde{O}(n^{4/5})$ space which also extends to weighted matching. Using previous results by Kapralov, Khanna, and Sudan (2014) we obtain a $mathrm{polylog}(n)$ approximation for general graphs using $mathrm{polylog}(n)$ space in random order streams, respectively. In addition, we give a space lower bound of $Omega(n^{1-varepsilon})$ for any randomized algorithm estimating the size of a maximum matching up to a $1+O(varepsilon)$ factor for adversarial streams.

Data Structures and Algorithms

Single Pass Spectral Sparsification in Dynamic Streams

468 - Michael Kapralov , Yin Tat Lee , Cameron Musco 2014

We present the first single pass algorithm for computing spectral sparsifiers of graphs in the dynamic semi-streaming model. Given a single pass over a stream containing insertions and deletions of edges to a graph G, our algorithm maintains a randomized linear sketch of the incidence matrix of G into dimension O((1/epsilon^2) n polylog(n)). Using this sketch, at any point, the algorithm can output a (1 +/- epsilon) spectral sparsifier for G with high probability. While O((1/epsilon^2) n polylog(n)) space algorithms are known for computing cut sparsifiers in dynamic streams [AGM12b, GKP12] and spectral sparsifiers in insertion-only streams [KL11], prior to our work, the best known single pass algorithm for maintaining spectral sparsifiers in dynamic streams required sketches of dimension Omega((1/epsilon^2) n^(5/3)) [AGM14]. To achieve our result, we show that, using a coarse sparsifier of G and a linear sketch of Gs incidence matrix, it is possible to sample edges by effective resistance, obtaining a spectral sparsifier of arbitrary precision. Sampling from the sketch requires a novel application of ell_2/ell_2 sparse recovery, a natural extension of the ell_0 methods used for cut sparsifiers in [AGM12b]. Recent work of [MP12] on row sampling for matrix approximation gives a recursive approach for obtaining the required coarse sparsifiers. Under certain restrictions, our approach also extends to the problem of maintaining a spectral approximation for a general matrix A^T A given a stream of updates to rows in A.

Data Structures and Algorithms

Correlation Clustering in Data Streams

128 - Kook Jin Ahn , Graham Cormode , Sudipto Guha 2018

Clustering is a fundamental tool for analyzing large data sets. A rich body of work has been devoted to designing data-stream algorithms for the relevant optimization problems such as $k$-center, $k$-median, and $k$-means. Such algorithms need to be both time and and space efficient. In this paper, we address the problem of correlation clustering in the dynamic data stream model. The stream consists of updates to the edge weights of a graph on $n$ nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We present polynomial-time, $O(ncdot mbox{polylog}~n)$-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the quality of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Unfortunately, the standard LP and SDP formulations are not obviously solvable in $O(ncdot mbox{polylog}~n)$-space. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivity of the sampling.

Data Structures and Algorithms

Dynamic Enumeration of Similarity Joins

75 - Pankaj K. Agarwal , Xiao Hu , Stavros Sintos 2021

This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of $n$ points $A,B$ in $mathbb{R}^d$, a metric $phi(cdot)$, and a distance threshold $r > 0$, report all pairs of points $(a, b) in A times B$ with $phi(a,b) le r$. Our goal is to store $A,B$ into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from $A$ or $B$. We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for $ell_1, ell_infty$ metrics with $log^{O(1)} n$ update time and delay. We show that such a data structure is not feasible for the $ell_2$ metric for $d ge 4$. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for $ell_p$ metric, with $log^{O(1)} n$ delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).

Data Structures and Algorithms Databases