No Arabic abstract
To date, the principal use case for schema matching research has been as a precursor for code generation, i.e., constructing mappings between schema elements with the end goal of data transfer. In this paper, we argue that schema matching plays valuable roles independent of mapping construction, especially as schemata grow to industrial scales. Specifically, in large enterprises human decision makers and planners are often the immediate consumer of information derived from schema matchers, instead of schema mapping tools. We list a set of real application areas illustrating this role for schema matching, and then present our experiences tackling a customer problem in one of these areas. We describe the matcher used, where the tool was effective, where it fell short, and our lessons learned about how well current schema matching technology is suited for use in large enterprises. Finally, we suggest a new agenda for schema matching research based on these experiences.
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.
Learning novel concepts and relations from relational databases is an important problem with many applications in database systems and machine learning. Relational learning algorithms learn the definition of a new relation in terms of existing relations in the database. Nevertheless, the same data set may be represented under different schemas for various reasons, such as efficiency, data quality, and usability. Unfortunately, the output of current relational learning algorithms tends to vary quite substantially over the choice of schema, both in terms of learning accuracy and efficiency. This variation complicates their off-the-shelf application. In this paper, we introduce and formalize the property of schema independence of relational learning algorithms, and study both the theoretical and empirical dependence of existing algorithms on the common class of (de) composition schema transformations. We study both sample-based learning algorithms, which learn from sets of labeled examples, and query-based algorithms, which learn by asking queries to an oracle. We prove that current relational learning algorithms are generally not schema independent. For query-based learning algorithms we show that the (de) composition transformations influence their query complexity. We propose Castor, a sample-based relational learning algorithm that achieves schema independence by leveraging data dependencies. We support the theoretical results with an empirical study that demonstrates the schema dependence/independence of several algorithms on existing benchmark and real-world datasets under (de) compositions.
In data management, and in particular in data integration, data exchange, query optimization, and data privacy, the notion of view plays a central role. In several contexts, such as data integration, data mashups, and data warehousing, the need arises of designing views starting from a set of known correspondences between queries over different schemas. In this paper we deal with the issue of automating such a design process. We call this novel problem view synthesis from schema mappings: given a set of schema mappings, each relating a query over a source schema to a query over a target schema, automatically synthesize for each source a view over the target schema in such a way that for each mapping, the query over the source is a rewriting of the query over the target wrt the synthesized views. We study view synthesis from schema mappings both in the relational setting, where queries and views are (unions of) conjunctive queries, and in the semistructured data setting, where queries and views are (two-way) regular path queries, as well as unions of conjunctions thereof. We provide techniques and complexity upper bounds for each of these cases.
Data is the king in the age of AI. However data integration is often a laborious task that is hard to automate. Schema change is one significant obstacle to the automation of the end-to-end data integration process. Although there exist mechanisms such as query discovery and schema modification language to handle the problem, these approaches can only work with the assumption that the schema is maintained by a database. However, we observe diversified schema changes in heterogeneous data and open data, most of which has no schema defined. In this work, we propose to use deep learning to automatically deal with schema changes through a super cell representation and automatic injection of perturbations to the training data to make the model robust to schema changes. Our experimental results demonstrate that our proposed approach is effective for two real-world data integration scenarios: coronavirus data integration, and machine log integration.
Ontology-based data integration has been one of the practical methodologies for heterogeneous legacy database integrated service construction. However, it is neither efficient nor economical to build the cross-domain ontology on top of the schemas of each legacy database for the specific integration application than to reuse the existed ontologies. Then the question lies in whether the existed ontology is compatible with the cross-domain queries and with all the legacy systems. It is highly needed an effective criteria to evaluate the compatibility as it limits the upbound quality of the integrated services. This paper studies the semantic similarity of schemas from the aspect of properties. It provides a set of in-depth criteria, namely coverage and flexibility to evaluate the compatibility among the queries, the schemas and the existing ontology. The weights of classes are extended to make precise compatibility computation. The use of such criteria in the practical project verifies the applicability of our method.