ﻻ يوجد ملخص باللغة العربية
Entity resolution (ER) is one of the fundamental problems in data integration, where machine learning (ML) based classifiers often provide the state-of-the-art results. Considerable human effort goes into feature engineering and training data creation. In this paper, we investigate a new problem: Given a dataset D_T for ER with limited or no training data, is it possible to train a good ML classifier on D_T by reusing and adapting the training data of dataset D_S from same or related domain? Our major contributions include (1) a distributed representation based approach to encode each tuple from diverse datasets into a standard feature space; (2) identification of common scenarios where the reuse of training data can be beneficial; and (3) five algorithms for handling each of the aforementioned scenarios. We have performed comprehensive experiments on 12 datasets from 5 different domains (publications, movies, songs, restaurants, and books). Our experiments show that our algorithms provide significant benefits such as providing superior performance for a fixed training data size.
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called match
Entity resolution (ER), an important and common data cleaning problem, is about detecting data duplicate representations for the same external entities, and merging them into single representations. Relatively recently, declarative rules called match
Probabilistic databases play a preeminent role in the processing and management of uncertain data. Recently, many database research efforts have integrated probabilistic models into databases to support tasks such as information extraction and labeli
Entity resolution (ER) is the problem of identifying and merging records that refer to the same real-world entity. In many scenarios, raw records are stored under heterogeneous environment. Specifically, the schemas of records may differ from each ot
Accurate and efficient entity resolution is an open challenge of particular relevance to intelligence organisations that collect large datasets from disparate sources with differing levels of quality and standard. Starting from a first-principles for