ترغب بنشر مسار تعليمي؟ اضغط هنا

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution

153   0   0.0 ( 0 )
 نشر من قبل Leopoldo Bertossi
 تاريخ النشر 2016
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called matching dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating four components of ER: (a) Building a classifier for duplicate/non-duplicate record pairs built using machine learning (ML) techniques; (b) Use of MDs for supporting the blocking phase of ML; (c) Record merging on the basis of the classifier results; and (d) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for all activities related to data processing, and the specification and enforcement of MDs.



قيم البحث

اقرأ أيضاً

Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called match ing dependencies (MDs) have been proposed for specifying similarity conditions under which attribute values in database records are merged. In this work we show the process and the benefits of integrating three components of ER: (a) Classifiers for duplicate/non-duplicate record pairs built using machine learning (ML) techniques, (b) MDs for supporting both the blocking phase of ML and the merge itself; and (c) The use of the declarative language LogiQL -an extended form of Datalog supported by the LogicBlox platform- for data processing, and the specification and enforcement of MDs.
Entity resolution (ER) is the task of identifying all records in a database that refer to the same underlying entity, and are therefore duplicates of each other. Due to inherent ambiguity of data representation and poor data quality, ER is a challeng ing task for any automated process. As a remedy, human-powered ER via crowdsourcing has become popular in recent years. Using crowd to answer queries is costly and time consuming. Furthermore, crowd-answers can often be faulty. Therefore, crowd-based ER methods aim to minimize human participation without sacrificing the quality and use a computer generated similarity matrix actively. While, some of these methods perform well in practice, no theoretical analysis exists for them, and further their worst case performances do not reflect the experimental findings. This creates a disparity in the understanding of the popular heuristics for this problem. In this paper, we make the first attempt to close this gap. We provide a thorough analysis of the prominent heuristic algorithms for crowd-based ER. We justify experimental observations with our analysis and information theoretic lower bounds.
Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based classifiers often provide the state-of-the-art results. Considerable human effort goes into feature engineering and training data creatio n. In this paper, we investigate a new problem: Given a dataset D_T for ER with limited or no training data, is it possible to train a good ML classifier on D_T by reusing and adapting the training data of dataset D_S from same or related domain? Our major contributions include (1) a distributed representation based approach to encode each tuple from diverse datasets into a standard feature space; (2) identification of common scenarios where the reuse of training data can be beneficial; and (3) five algorithms for handling each of the aforementioned scenarios. We have performed comprehensive experiments on 12 datasets from 5 different domains (publications, movies, songs, restaurants, and books). Our experiments show that our algorithms provide significant benefits such as providing superior performance for a fixed training data size.
Unstructured enterprise data such as reports, manuals and guidelines often contain tables. The traditional way of integrating data from these tables is through a two-step process of table detection/extraction and mapping the table layouts to an appro priate schema. This can be an expensive process. In this paper we show that by using semantic technologies (RDF/SPARQL and database dependencies) paired with a simple but powerful way to transform tables with non-relational layouts, it is possible to offer query answering services over these tables with minimal manual work or domain-specific mappings. Our method enables users to exploit data in tables embedded in documents with little effort, not only for simple retrieval queries, but also for structured queries that require joining multiple interrelated tables.
Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeli ng. Many of these efforts are based on batch oriented inference which inhibits a realtime workflow. One important task is entity resolution (ER). ER is the process of determining records (mentions) in a database that correspond to the same real-world entity. Traditional pairwise ER methods can lead to inconsistencies and low accuracy due to localized decisions. Leading ER systems solve this problem by collectively resolving all records using a probabilistic graphical model and Markov chain Monte Carlo (MCMC) inference. However, for large datasets this is an extremely expensive process. One key observation is that, such exhaustive ER process incurs a huge up-front cost, which is wasteful in practice because most users are interested in only a small subset of entities. In this paper, we advocate pay-as-you-go entity resolution by developing a number of query-driven collective ER techniques. We introduce two classes of SQL queries that involve ER operators --- selection-driven ER and join-driven ER. We implement novel variations of the MCMC Metropolis Hastings algorithm to generate biased samples and selectivity-based scheduling algorithms to support the two classes of ER queries. Finally, we show that query-driven ER algorithms can converge and return results within minutes over a database populated with the extraction from a newswire dataset containing 71 million mentions.

الأسئلة المقترحة

التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا