Do you want to publish a course? Click here

Dynamic Enumeration of Similarity Joins

76   0   0.0 ( 0 )
 Added by Stavros Sintos
 Publication date 2021
and research's language is English




Ask ChatGPT about the research

This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of $n$ points $A,B$ in $mathbb{R}^d$, a metric $phi(cdot)$, and a distance threshold $r > 0$, report all pairs of points $(a, b) in A times B$ with $phi(a,b) le r$. Our goal is to store $A,B$ into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from $A$ or $B$. We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for $ell_1, ell_infty$ metrics with $log^{O(1)} n$ update time and delay. We show that such a data structure is not feasible for the $ell_2$ metric for $d ge 4$. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for $ell_p$ metric, with $log^{O(1)} n$ delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).



rate research

Read More

The Jaccard index is an important similarity measure for item sets and Boolean data. On large datasets, an exact similarity computation is often infeasible for all item pairs both due to time and space constraints, giving rise to faster approximate methods. The algorithm of choice used to quickly compute the Jaccard index $frac{vert A cap B vert}{vert Acup Bvert}$ of two item sets $A$ and $B$ is usually a form of min-hashing. Most min-hashing schemes are maintainable in data streams processing only additions, but none are known to work when facing item-wise deletions. In this paper, we investigate scalable approximation algorithms for rational set similarities, a broad class of similarity measures including Jaccard. Motivated by a result of Chierichetti and Kumar [J. ACM 2015] who showed any rational set similarity $S$ admits a locality sensitive hashing (LSH) scheme if and only if the corresponding distance $1-S$ is a metric, we can show that there exists a space efficient summary maintaining a $(1pm varepsilon)$ multiplicative approximation to $1-S$ in dynamic data streams. This in turn also yields a $varepsilon$ additive approximation of the similarity. The existence of these approximations hints at, but does not directly imply a LSH scheme in dynamic data streams. Our second and main contribution now lies in the design of such a LSH scheme maintainable in dynamic data streams. The scheme is space efficient, easy to implement and to the best of our knowledge the first of its kind able to process deletions.
Encryption provides a method to protect data outsourced to a DBMS provider, e.g., in the cloud. However, performing database operations over encrypted data requires specialized encryption schemes that carefully balance security and performance. In this paper, we present a new encryption scheme that can efficiently perform equi-joins over encrypted data with better security than the state-of-the-art. In particular, our encryption scheme reduces the leakage to equality of rows that match a selection criterion and only reveals the transitive closure of the sum of the leakages of each query in a series of queries. Our encryption scheme is provable secure. We implemented our encryption scheme and evaluated it over a dataset from the TPC-H benchmark.
An enumeration kernel as defined by Creignou et al. [Theory Comput. Syst. 2017] for a parameterized enumeration problem consists of an algorithm that transforms each instance into one whose size is bounded by the parameter plus a solution-lifting algorithm that efficiently enumerates all solutions from the set of the solutions of the kernel. We propose to consider two n
A tree decomposition of a graph facilitates computations by grouping vertices into bags that are interconnected in an acyclic structure, hence their importance in a plethora of problems such as query evaluation over databases and inference over probabilistic graphical models. The relative benefit from different tree decompositions is measured by diverse (sometime complex) cost functions that vary from one application to another. For generic cost functions like width and fill-in, an optimal tree decomposition can be efficiently computed in some cases, notably when the number of minimal separators is bounded by a polynomial (due to Bouchitte and Todinca), we refer to this assumption as poly-MS. To cover the variety of cost functions in need, it has recently been proposed to devise algorithms for enumerating many decomposition candidates for applications to choose from using specialized, or even machine-learned, cost functions. We explore the ability to produce a large collection of high quality tree decompositions. We present the first algorithm for ranked enumeration of the proper (non-redundant) tree decompositions, or equivalently minimal triangulations, under a wide class of cost functions that substantially generalizes the above generic ones. On the theoretical side, we establish the guarantee of polynomial delay if poly-MS is assumed, or if we are interested in tree decompositions of a width bounded by a constant. We describe an experimental evaluation on graphs of various domains (including join queries, Bayesian networks, treewidth benchmarks and random), and explore both the applicability of the poly-MS assumption and the performance of our algorithm relative to the state of the art.
154 - Haoyu Zhang , Qin Zhang 2018
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds important applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted significant attention in the past two decades. However, all previous algorithms either cannot scale well to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper we propose a new algorithm for edit similarity joins using a novel string partition based approach. We show mathematically that with high probability our algorithm achieves a perfect accuracy, and runs in linear time plus a data-dependent verification step. Experiments on real world datasets show that our algorithm significantly outperforms the state-of-the-art algorithms for edit similarity joins, and achieves perfect accuracy on all the datasets that we have tested.
comments
Fetching comments Fetching comments
Sign in to be able to follow your search criteria
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا