أوراق بحثية, رسائل ماجستير ودكتوراه حول قواعد البيانات

Northlight: Declarative and Optimized Analysis of Atmospheric Datasets in SparkSQL

178 - Justus Henneberg , Felix Schuhknecht , Philipp Reutter 2021

Performing data-intensive analytics is an essential part of modern Earth science. As such, research in atmospheric physics and meteorology frequently requires the processing of very large observational and/or modeled datasets. Typically, these datase ts (a) have high dimensionality, i.e. contain various measurements per spatiotemporal point, (b) are extremely large, containing observations over a long time period. Additionally, (c) the analytical tasks being performed on these datasets are structurally complex. Over the years, the binary format NetCDF has been established as a de-facto standard in distributing and exchanging such multi-dimensional datasets in the Earth science community -- along with tools and APIs to visualize, process, and generate them. Unfortunately, these access methods typically lack either (1) an easy-to-use but rich query interface or (2) an automatic optimization pipeline tailored towards the specialities of these datasets. As such, researchers from the field of Earth sciences (which are typically not computer scientists) unnecessarily struggle in efficiently working with these datasets on a daily basis. Consequently, in this work, we aim at resolving the aforementioned issues. Instead of proposing yet another specialized tool and interface to work with atmospheric datasets, we integrate sophisticated NetCDF processing capabilities into the established SparkSQL dataflow engine -- resulting in our system Northlight. In contrast to comparable systems, Northlight introduces a set of fully automatic optimizations specifically tailored towards NetCDF processing. We experimentally show that Northlight scales gracefully with the selectivity of the analysis tasks and outperforms the comparable state-of-the-art pipeline by up to a factor of 6x.

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية

Frequent Itemset Mining with Multiple Minimum Supports: a Constraint-based Approach

347 - Mohamed-Bachir Belaid , Nadjib Lazaar 2021

The problem of discovering frequent itemsets including rare ones has received a great deal of attention. The mining process needs to be flexible enough to extract frequent and rare regularities at once. On the other hand, it has recently been shown t hat constraint programming is a flexible way to tackle data mining tasks. In this paper, we propose a constraint programming approach for mining itemsets with multiple minimum supports. Our approach provides the user with the possibility to express any kind of constraints on the minimum item supports. An experimental analysis shows the practical effectiveness of our approach compared to the state of the art.

الذكاء الاصطناعي قواعد البيانات

SEACOW: Synopsis Embedded Array Compression using Wavelet Transform

207 - Minsoo Kim , Hyubjin Lee , 2021

Recently, multidimensional data is produced in various domains; because a large volume of this data is often used in complex analytical tasks, it must be stored compactly and able to respond quickly to queries. Existing compression schemes well reduc e the data storage; however, they might increase overall computational costs while performing queries. Effectively querying compressed data requires a compression scheme carefully designed for the tasks. This study presents a novel compression scheme, SEACOW, for storing and querying multidimensional array data. The scheme is based on wavelet transform and utilizes a hierarchical relationship between sub-arrays in the transformed data to compress the array. A result of the compression embeds a synopsis, improving query processing performance while acting as an index. To perform experiments, we implemented an array database, SEACOW storage, and evaluated query processing performance on real data sets. Our experiments show that 1) SEACOW provides a high compression ratio comparable to existing compression schemes and 2) the synopsis improves analytical query processing performance.

قواعد البيانات

PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching

146 - Roee Shraga , Avigdor Gal 2021

Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e. g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.

قواعد البيانات تفاعل الإنسان والحاسوب التعلم الآلي

Evaluation of Distributed Databases in Hybrid Clouds and Edge Computing: Energy, Bandwidth, and Storage Consumption

216 - Yaser Mansouri , Victor Prokhorenko , Faheem Ullah 2021

A benchmark study of modern distributed databases is an important source of information to select the right technology for managing data in the cloud-edge paradigms. To make the right decision, it is required to conduct an extensive experimental stud y on a variety of hardware infrastructures. While most of the state-of-the-art studies have investigated only response time and scalability of distributed databases, focusing on other various metrics (e.g., energy, bandwidth, and storage consumption) is essential to fully understand the resources consumption of the distributed databases. Also, existing studies have explored the response time and scalability of these databases either in private or public cloud. Hence, there is a paucity of investigation into the evaluation of these databases deployed in a hybrid cloud, which is the seamless integration of public and private cloud. To address these research gaps, in this paper, we investigate energy, bandwidth and storage consumption of the most used and common distributed databases. For this purpose, we have evaluated four open-source databases (Cassandra, Mongo, Redis and MySQL) on the hybrid cloud spanning over local OpenStack and Microsoft Azure, and a variety of edge computing nodes including Raspberry Pi, a cluster of Raspberry Pi, and low and high power servers. Our extensive experimental results reveal several helpful insights for the deployment selection of modern distributed databases in edge-cloud environments.

قواعد البيانات النظم الموزعة والتوازية والحوسبة العنقودية

A Survey on Data Cleaning Methods for Improved Machine Learning Model Performance

119 - Ga Young Lee , Lubna Alzamil , Bakhtiyar Doskenov 2021

Data cleaning is the initial stage of any machine learning project and is one of the most critical processes in data analysis. It is a critical step in ensuring that the dataset is devoid of incorrect or erroneous data. It can be done manually with d ata wrangling tools, or it can be completed automatically with a computer program. Data cleaning entails a slew of procedures that, once done, make the data ready for analysis. Given its significance in numerous fields, there is a growing interest in the development of efficient and effective data cleaning frameworks. In this survey, some of the most recent advancements of data cleaning approaches are examined for their effectiveness and the future research directions are suggested to close the gap in each of the methods.

قواعد البيانات

A Ring Model for Data Anomalies

123 - Haixiang Li , Xiaoyan Li , Chang Liu 2021

A distributed system keeps consistency by disallowing data anomalies. However, especially in the database, the definitions of data anomalies in the current ANSI standard are controversial. The standard does not include all anomalies and does not intr oduce characters of anomalies. First, the definitions lack a mathematical formalization and cause ambiguous interpretations. Second, the definitions of anomalies are case-by-case, which could not have a comprehensive understanding of data anomalies. In this paper, we propose a ring anomalies detection method (the bingo model) in the distribution system and applying it to databases. The bingo model introduces anomalies construction and gives the base anomalies formalization method. Based on anomalies we propose consistency levels. We prove the simplified anomaly rings in the model to classified anomalies to give the independent consistency levels. We specify the bingo model to databases and find 22 anomalies in addition to existing anomalies.

قواعد البيانات

ML Based Lineage in Databases

445 - Michael Leybovich , Oded Shmueli 2021

In this work, we track the lineage of tuples throughout their database lifetime. That is, we consider a scenario in which tuples (records) that are produced by a query may affect other tuple insertions into the DB, as part of a normal workflow. As ti me goes on, exact provenance explanations for such tuples become deeply nested, increasingly consuming space, and resulting in decreased clarity and readability. We present a novel approach for approximating lineage tracking, using a Machine Learning (ML) and Natural Language Processing (NLP) technique; namely, word embedding. The basic idea is summarizing (and approximating) the lineage of each tuple via a small set of constant-size vectors (the number of vectors per-tuple is a hyperparameter). Therefore, our solution does not suffer from space complexity blow-up over time, and it naturally ranks explanations to the existence of a tuple. We devise an alternative and improved lineage tracking mechanism, that of keeping track of and querying lineage at the column level; thereby, we manage to better distinguish between the provenance features and the textual characteristics of a tuple. We integrate our lineage computations into the PostgreSQL system via an extension (ProvSQL) and experimentally exhibit useful results in terms of accuracy against exact, semiring-based, justifications. In the experiments, we focus on tuples with multiple generations of tuples in their lifelong lineage and analyze them in terms of direct and distant lineage. The experiments suggest a high usefulness potential for the proposed approximate lineage methods and the further suggested enhancements. This especially holds for the column-based vectors method which exhibits high precision and high per-level recall.

قواعد البيانات التعلم الآلي

Augmenting Decision Making via Interactive What-If Analysis

226 - Sneha Gathani , Madelon Hulsebos , James Gale 2021

The fundamental goal of business data analysis is to improve business decisions using data. Business users such as sales, marketing, product, or operations managers often make decisions to achieve key performance indicator (KPI) goals such as increas ing customer retention, decreasing cost, and increasing sales. To discover the relationship between data attributes hypothesized to be drivers and those corresponding to KPIs of interest, business users currently need to perform lengthy exploratory analyses, considering multitudes of combinations and scenarios, slicing, dicing, and transforming the data accordingly. For example, analyzing customer retention across quarters of the year or suggesting optimal media channels across strata of customers. However, the increasing complexity of datasets combined with the cognitive limitations of humans makes it challenging to carry over multiple hypotheses, even for simple datasets. Therefore mentally performing such analyses is hard. Existing commercial tools either provide partial solutions whose effectiveness remains unclear or fail to cater to business users. Here we argue for four functionalities that we believe are necessary to enable business users to interactively learn and reason about the relationships (functions) between sets of data attributes, facilitating data-driven decision making. We implement these functionalities in SystemD, an interactive visual analysis system enabling business users to experiment with the data by asking what-if questions. We evaluate the system through three business use cases: marketing mix modeling analysis, customer retention analysis, and deal closing analysis, and report on feedback from multiple business users. Overall, business users find SystemD intuitive and useful for quick testing and validation of their hypotheses around interested KPI as well as in making effective and fast data-driven decisions.

قواعد البيانات تفاعل الإنسان والحاسوب التعلم الآلي

An End-to-end Point of Interest (POI) Conflation Framework

173 - Raymond Low , Zeynep D. Tekler , Lynette Cheah 2021

Point of interest (POI) data serves as a valuable source of semantic information for places of interest and has many geospatial applications in real estate, transportation, and urban planning. With the availability of different data sources, POI conf lation serves as a valuable technique for enriching data quality and coverage by merging the POI data from multiple sources. This study proposes a novel end-to-end POI conflation framework consisting of six steps, starting with data procurement, schema standardisation, taxonomy mapping, POI matching, POI unification, and data verification. The feasibility of the proposed framework was demonstrated in a case study conducted in the eastern region of Singapore, where the POI data from five data sources was conflated to form a unified POI dataset. Based on the evaluation conducted, the resulting unified dataset was found to be more comprehensive and complete than any of the five POI data sources alone. Furthermore, the proposed approach for identifying POI matches between different data sources outperformed all baseline approaches with a matching accuracy of 97.6% with an average run time below 3 minutes when matching over 12,000 POIs to result in 8,699 unique POIs, thereby demonstrating the frameworks scalability for large scale implementation in dense urban contexts.

التعلم الآلي قواعد البيانات

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد