Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Learning Over Dirty Data Without Cleaning

177 0 0.0 ( 0 )

Download Cite

Added by Jose Picado

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Jose Picado - John Davis - Arash Termehchy

Databases Machine Learning

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate models. Users have to spend a great deal of time and effort to repair data errors and create a clean database for learning. Moreover, as the information required to repair these errors is not often available, there may be numerous possible cle

rate research

Usable & Scalable Learning Over Relational Data With Automatic Language Bias

174 - Jose Picado , Arash Termehchy , Sudhanshu Pathak 2017

Relational databases are valuable resources for learning novel and interesting relations and concepts. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Unfortunately, specifying the language bias is done via trial and error and is guided by the experts intuitions. We propose AutoBias, a system that leverages information in the schema and content of the database to automatically induce the language bias used by popular relational learning systems. We show that AutoBias delivers the same accuracy as using manually-written language bias by imposing only a slight overhead on the running time of the learning algorithm.

Databases Machine Learning

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

119 - Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov 2021

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with data wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods.

Databases

Toward a view-based data cleaning architecture

221 - Toshiyuki Shimizu , Hiroki Omori , Masatoshi Yoshikawa 2019

Big data analysis has become an active area of study with the growth of machine learning techniques. To properly analyze data, it is important to maintain high-quality data. Thus, research on data cleaning is also important. It is difficult to automatically detect and correct inconsistent values for data requiring expert knowledge or data created by many contributors, such as integrated data from heterogeneous data sources. An example of such data is metadata for scientific datasets, which should be confirmed by data managers while handling the data. To support the efficient cleaning of data by data managers, we propose a data cleaning architecture in which data managers interactively browse and correct portions of data through views. In this paper, we explain our view-based data cleaning architecture and discuss some remaining issues.

Databases

Learning Models over Relational Data: A Brief Tutorial

143 - Maximilian Schleich , Dan Olteanu , Mahmoud Abo-Khamis 2019

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database. The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.

Databases

F-IVM: Learning over Fast-Evolving Relational Data

122 - Milos Nikolic , Haozhe Zhang , Ahmet Kara 2020

F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.

Databases

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Learning Over Dirty Data Without Cleaning

Ask ChatGPT about the research

No Arabic abstract

Read More

suggested questions