ﻻ يوجد ملخص باللغة العربية
This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research. The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database. The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.
F-IVM is a system for real-time analytics such as machine learning applications over training datasets defined by queries over fast-evolving relational databases. We will demonstrate F-IVM for three such applications: model selection, Chow-Liu trees, and ridge linear regression.
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning
Relational databases are valuable resources for learning novel and interesting relations and concepts. In order to constraint the search through the large space of candidate definitions, users must tune the algorithm by specifying a language bias. Un
Causal inference is at the heart of empirical research in natural and social sciences and is critical for scientific discovery and informed decision making. The gold standard in causal inference is performing randomized controlled trials; unfortunate
Real-world datasets are dirty and contain many errors. Examples of these issues are violations of integrity constraints, duplicates, and inconsistencies in representing data values and entities. Learning over dirty databases may result in inaccurate