No Arabic abstract
Bad data has a direct impact on 88% of companies, with the average company losing 12% of its revenue due to it. Duplicates - multiple but different representations of the same real-world entities - are among the main reasons for poor data quality. Therefore, finding and configuring the right deduplication solution is essential. Various data matching benchmarks exist which address this issue. However, many of them focus on the quality of matching results and neglect other important factors, such as business requirements. Additionally, they often do not specify how to explore benchmark results, which helps understand matching solution behavior. To address this gap between the mere counting of record pairs vs. a comprehensive means to evaluate data matching approaches, we present the benchmark platform Frost. Frost combines existing benchmarks, established quality metrics, a benchmark dimension for soft KPIs, and techniques to systematically explore and understand matching results. Thus, it can be used to compare multiple matching solutions regarding quality, usability, and economic aspects, but also to compare multiple runs of the same matching solution for understanding its behavior. Frost is implemented and published in the open-source application Snowman, which includes the visual exploration of matching results.
We consider the problem of duplicate detection in noisy and incomplete data: given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into unique entities, and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.
Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that compares well-tuned implementations of three learned index structures against several state-of-the-art traditional baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We also investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investigate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.
Reproducibility is a fundamental requirement of the scientific process since it enables outcomes to be replicated and verified. Computational scientific experiments can benefit from improved reproducibility for many reasons, including validation of results and reuse by other scientists. However, designing reproducible experiments remains a challenge and hence the need for developing methodologies and tools that can support this process. Here, we propose a conceptual model for reproducibility to specify its main attributes and properties, along with a framework that allows for computational experiments to be findable, accessible, interoperable, and reusable. We present a case study in ecological niche modeling to demonstrate and evaluate the implementation of this framework.
We present a new approach to e-matching based on relational join; in particular, we apply recent database query execution techniques to guarantee worst-case optimal run time. Compared to the conventional backtracking approach that always searches the e-graph top down, our new relational e-matching approach can better exploit pattern structure by searching the e-graph according to an optimized query plan. We also establish the first data complexity result for e-matching, bounding run time as a function of the e-graph size and output size. We prototyped and evaluated our technique in the state-of-the-art egg e-graph framework. Compared to a conventional baseline, relational e-matching is simpler to implement and orders of magnitude faster in practice.
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is significant adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ various mechanisms to implement critical components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is critical for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a crucial optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.