Toward a view-based data cleaning architecture

222 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Toshiyuki Shimizu

تاريخ النشر 2019

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Toshiyuki Shimizu - Hiroki Omori - Masatoshi Yoshikawa

قواعد البيانات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.

قيم البحث

176 - Jose Picado , John Davis , Arash Termehchy 2020

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible cle

قواعد البيانات التعلم الآلي

Making View Update Strategies Programmable - Toward Controlling and Sharing Distributed Data -

63 - Yasuhito Asano , Soichiro Hidaka , Zhenjiang Hu 2018

Views are known mechanisms for controlling access of data and for sharing data of different schemas. Despite long and intensive research on views in both the database community and the programming language community, we are facing difficulties to use views in practice. The main reason is that we lack ways to directly describe view update strategies to deal with the inherent ambiguity of view updating. This paper aims to provide a new language-based approach to controlling and sharing distributed data based on views, and establish a software foundation for systematic construction of such data management systems. Our key observation is that a view should be defined through a view update strategy rather than a view definition. We show that Datalog can be used for specifying view update strategies whose unique view definition can be automatically derived, present a novel P2P-based programmable architecture for distributed data management where updatable views are fully utilized for controlling and sharing distributed data, and demonstrate its usefulness through the development of a privacy-preserving ride-sharing alliance system.

قواعد البيانات

Data Mining-based Materialized View and Index Selection in Data Warehouses

171 - Kamel Aouiche , Jer^ome Darmont 2007

Materialized views and indexes are physical structures for accelerating data access that are casually used in data warehouses. However, these data structures generate some maintenance overhead. They also share the same storage space. Most existing st udies about materialized view and index selection consider these structures separately. In this paper, we adopt the opposite stance and couple materialized view and index selection to take view-index interactions into account and achieve efficient storage space sharing. Candidate materialized views and indexes are selected through a data mining process. We also exploit cost models that evaluate the respective benefit of indexing and view materialization, and help select a relevant configuration of indexes and materialized views among the candidates. Experimental results show that our strategy performs better than an independent selection of materialized views and indexes.

قواعد البيانات

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

119 - Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov 2021

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with d ata wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods.

قواعد البيانات

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

336 - Zheng Zheng , Longtao Zheng , Morteza Alipour Langouri 2021

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cl eaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.

قواعد البيانات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة إيبلا الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Toward a view-based data cleaning architecture

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً