Auto-FuzzyJoin: Auto-Program Fuzzy Similarity Joins Without Labeled Examples

73 0 0.0 ( 0 )

Download Cite

Added by Yeye He

Publication date 2021

fields Informatics Engineering

and research's language is English

Authors Peng Li - Xiang Cheng - Xu Chu

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Fuzzy similarity join is an important database operator widely used in practice. So far the research community has focused exclusively on optimizing fuzzy join textit{scalability}. However, practitioners today also struggle to optimize fuzzy-join textit{quality}, because they face a daunting space of parameters (e.g., distance-functions, distance-thresholds, tokenization-options, etc.), and often have to resort to a manual trial-and-error approach to program these parameters in order to optimize fuzzy-join quality. This key challenge of automatically generating high-quality fuzzy-join programs has received surprisingly little attention thus far. In this work, we study the problem of auto-program fuzzy-joins. Leveraging a geometric interpretation of distance-functions, we develop an unsupervised textsc{Auto-FuzzyJoin} framework that can infer suitable fuzzy-join programs on given input tables, without requiring explicit human input such as labeled training data. Using textsc{Auto-FuzzyJoin}, users only need to provide two input tables $L$ and $R$, and a desired precision target $tau$ (say 0.9). textsc{Auto-FuzzyJoin} leverages the fact that one of the input is a reference table to automatically program fuzzy-joins that meet the precision target $tau$ in expectation, while maximizing fuzzy-join recall (defined as the number of correctly joined records). Experiments on both existing benchmarks and a new benchmark with 50 fuzzy-join tasks created from Wikipedia data suggest that the proposed textsc{Auto-FuzzyJoin} significantly outperforms existing unsupervised approaches, and is surprisingly competitive even against supervised approaches (e.g., Magellan and DeepMatcher) when 50% of ground-truth labels are used as training data.

rate research

MinJoin: Efficient Edit Similarity Joins via Local Hash Minima

154 - Haoyu Zhang , Qin Zhang 2018

We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds important applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted significant attention in the past two decades. However, all previous algorithms either cannot scale well to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper we propose a new algorithm for edit similarity joins using a novel string partition based approach. We show mathematically that with high probability our algorithm achieves a perfect accuracy, and runs in linear time plus a data-dependent verification step. Experiments on real world datasets show that our algorithm significantly outperforms the state-of-the-art algorithms for edit similarity joins, and achieves perfect accuracy on all the datasets that we have tested.

Databases Data Structures and Algorithms

Frequent itemsets mining for database auto-administration

158 - Kamel Aouiche , Le Gruenwald 2008

With the wide development of databases in general and data warehouses in particular, it is important to reduce the tasks that a database administrator must perform manually. The aim of auto-administrative systems is to administrate and adapt themselves automatically without loss (or even with a gain) in performance. The idea of using data mining techniques to extract useful knowledge for administration from the data themselves has existed for some years. However, little research has been achieved. This idea nevertheless remains a very promising approach, notably in the field of data warehousing, where queries are very heterogeneous and cannot be interpreted easily. The aim of this study is to search for a way of extracting useful knowledge from stored data themselves to automatically apply performance optimization techniques, and more particularly indexing techniques. We have designed a tool that extracts frequent itemsets from a given workload to compute an index configuration that helps optimizing data access time. The experiments we performed showed that the index configurations generated by our tool allowed performance gains of 15% to 25% on a test database and a test data warehouse.

Databases

Auto-Validate: Unsupervised Data Validation Using Data-Domain Patterns Inferred from Data Lakes

102 - Jie Song , Yeye He 2021

Complex data pipelines are increasingly common in diverse applications such as BI reporting and ML modeling. These pipelines often recur regularly (e.g., daily or weekly), as BI reports need to be refreshed, and ML models need to be retrained. However, it is widely reported that in complex production pipelines, upstream data feeds can change in unexpected ways, causing downstream applications to break silently that are expensive to resolve. Data validation has thus become an important topic, as evidenced by notable recent efforts from Google and Amazon, where the objective is to catch data quality issues early as they arise in the pipelines. Our experience on production data suggests, however, that on string-valued data, these existing approaches yield high false-positive rates and frequently require human intervention. In this work, we develop a corpus-driven approach to auto-validate emph{machine-generated data} by inferring suitable data-validation patterns that accurately describe the underlying data domain, which minimizes false positives while maximizing data quality issues caught. Evaluations using production data from real data lakes suggest that Auto-Validate is substantially more effective than existing methods. Part of this technology ships as an Auto-Tag feature in Microsoft Azure Purview.

Databases Artificial Intelligence

Auto-Pipeline: Synthesizing Complex Data Pipelines By-Target Using Reinforcement Learning and Search

80 - Junwen Yang , Yeye He , Surajit Chaudhuri 2021

Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel by-target paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a target table (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically look like. While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines with up to 10 steps.

Databases Machine Learning

Dynamic Enumeration of Similarity Joins

75 - Pankaj K. Agarwal , Xiao Hu , Stavros Sintos 2021

This paper considers enumerating answers to similarity-join queries under dynamic updates: Given two sets of $n$ points $A,B$ in $mathbb{R}^d$, a metric $phi(cdot)$, and a distance threshold $r > 0$, report all pairs of points $(a, b) in A times B$ with $phi(a,b) le r$. Our goal is to store $A,B$ into a dynamic data structure that, whenever asked, can enumerate all result pairs with worst-case delay guarantee, i.e., the time between enumerating two consecutive pairs is bounded. Furthermore, the data structure can be efficiently updated when a point is inserted into or deleted from $A$ or $B$. We propose several efficient data structures for answering similarity-join queries in low dimension. For exact enumeration of similarity join, we present near-linear-size data structures for $ell_1, ell_infty$ metrics with $log^{O(1)} n$ update time and delay. We show that such a data structure is not feasible for the $ell_2$ metric for $d ge 4$. For approximate enumeration of similarity join, where the distance threshold is a soft constraint, we obtain a unified linear-size data structure for $ell_p$ metric, with $log^{O(1)} n$ delay and update time. In high dimensions, we present an efficient data structure with worst-case delay-guarantee using locality sensitive hashing (LSH).

Data Structures and Algorithms Databases