ترغب بنشر مسار تعليمي؟ اضغط هنا

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

120   0   0.0 ( 0 )
 نشر من قبل Ga Young Lee
 تاريخ النشر 2021
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods.

قيم البحث

اقرأ أيضاً

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible cle
Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automa tically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.
Functional Dependencies (FDs) define attribute relationships based on syntactic equality, and, when usedin data cleaning, they erroneously label syntactically different but semantically equivalent values as errors. We explore dependency-based data cl eaning with Ontology Functional Dependencies(OFDs), which express semantic attribute relationships such as synonyms and is-a hierarchies defined by an ontology. We study the theoretical foundations for OFDs, including sound and complete axioms and a linear-time inference procedure. We then propose an algorithm for discovering OFDs (exact ones and ones that hold with some exceptions) from data that uses the axioms to prune the search space. Towards enabling OFDs as data quality rules in practice, we study the problem of finding minimal repairs to a relation and ontology with respect to a set of OFDs. We demonstrate the effectiveness of our techniques on real datasets, and show that OFDs can significantly reduce the number of false positive errors in data cleaning techniques that rely on traditional FDs.
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future research
How to manage various data in a unified way is a significant research topic in the field of databases. To address this problem, researchers have proposed multi-model databases to support multiple data models in a uniform platform with a single unifie d query language. However, since relational databases are predominant in the current market, it is expensive to replace them with others. Besides, due to the theories and technologies of RDBMSs having been enhanced over decades, it is hard to use few years to develop a multi-model database that can be compared with existing RDBMSs in handling security, query optimization, transaction management, etc. In this paper, we reconsider employing relational databases to store and query multi-model data. Unfortunately, the mismatch between the complexity of multi-model data structure and the simplicity of flat relational tables makes this difficult. Against this challenge, we utilize the reinforcement learning (RL) method to learn a relational schema by interacting with an RDBMS. Instead of using the classic Q-learning algorithm, we propose a variant Q-learning algorithm, called textit{Double Q-tables}, to reduce the dimension of the original Q-table and improve learning efficiency. Experimental results show that our approach could learn a relational schema outperforming the existing multi-model storage schema in terms of query time and space consumption.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا