A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

120 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Ga Young Lee

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Ga Young Lee - Lubna Alzamil - Bakhtiyar Doskenov

قواعد البيانات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods.

قيم البحث

176 - Jose Picado , John Davis , Arash Termehchy 2020

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible cle

قواعد البيانات التعلم الآلي

Toward a view-based data cleaning architecture

221 - Toshiyuki Shimizu , Hiroki Omori , Masatoshi Yoshikawa 2019

Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automa tically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.

قواعد البيانات

Discovery and Contextual Data Cleaning with Ontology Functional Dependencies

336 - Zheng Zheng , Longtao Zheng , Morteza Alipour Langouri 2021

Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cl eaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.

قواعد البيانات

Machine Learning over Static and Dynamic Relational Data

154 - Ahmet Kara , Milos Nikolic , Dan Olteanu 2021

This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future research

قواعد البيانات

Storing Multi-model Data in RDBMSs based on Reinforcement Learning

86 - Gongsheng Yuan , Jiaheng Lu , Shuxun Zhang 2021

How to manage various data in a unified way is a significant research topic in the field of databases. To address this problem, researchers have proposed multi-model databases to support multiple data models in a uniform platform with a single unifie d query language. However, since relational databases are predominant in the current market, it is expensive to replace them with others. Besides, due to the theories and technologies of RDBMSs having been enhanced over decades, it is hard to use few years to develop a multi-model database that can be compared with existing RDBMSs in handling security, query optimization, transaction management, etc. In this paper, we reconsider employing relational databases to store and query multi-model data. Unfortunately, the mismatch between the complexity of multi-model data structure and the simplicity of flat relational tables makes this difficult. Against this challenge, we utilize the reinforcement learning (RL) method to learn a relational schema by interacting with an RDBMS. Instead of using the classic Q-learning algorithm, we propose a variant Q-learning algorithm, called textit{Double Q-tables}, to reduce the dimension of the original Q-table and improve learning efficiency. Experimental results show that our approach could learn a relational schema outperforming the existing multi-model storage schema in terms of query time and space consumption.

قواعد البيانات

سجل دخول لتتمكن من نشر تعليقات

التعليقات

جاري جلب التعليقات

سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها

جامعة الوادي الدولية الخاصة

تفاصيل إضافية المزيد من الجامعات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

اسأل ChatGPT حول البحث

ﻻ يوجد ملخص باللغة العربية

اقرأ أيضاً