A Lightweight Algorithm to Uncover Deep Relationships in Data Tables

79 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jin Cao

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Jin Cao - Yibo Zhao - Linjun Zhang

بنى وهياكل البيانات والخوارزميات قواعد البيانات التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Many data we collect today are in tabular form, with rows as records and columns as attributes associated with each record. Understanding the structural relationship in tabular data can greatly facilitate the data science process. Traditionally, much of this relational information is stored in table schema and maintained by its creators, usually domain experts. In this paper, we develop automated methods to uncover deep relationships in a single data table without expert or domain knowledge. Our method can decompose a data table into layers of smaller tables, revealing its deep structure. The key to our approach is a computationally lightweight forward addition algorithm that we developed to recursively extract the functional dependencies between table columns that are scalable to tables with many columns. With our solution, data scientists will be provided with automatically generated, data-driven insights when exploring new data sets.

قيم البحث

68 - Paolo Ferragina , Giorgio Vinciguerra 2019

The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give out standing improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGM-index), which achieves guaranteed I/O-optimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITing-tree). We show that the PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGM-index. First, we design a compressed PGM-index that further reduces its space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGM-index that adapts itself to the distribution of the queries, thus resulting in the first known distribution-aware learned index to date. Finally, given its flexibility in the offered space-time trade-offs, we propose the multicriteria PGM-index that efficiently auto-tune itself in a few seconds over hundreds of millions of keys to the possibly evolving space-time constraints imposed by the application of use. We remark to the reader that this paper is an extended and improved version of our previous paper titled Superseding traditional indexes by orchestrating learning and geometry (arXiv:1903.00507).

بنى وهياكل البيانات والخوارزميات قواعد البيانات استرجاع المعلومات

Data-driven Algorithm Design

140 - Maria-Florina Balcan 2020

Data driven algorithm design is an important aspect of modern data science and algorithm design. Rather than using off the shelf algorithms that only have worst case performance guarantees, practitioners often optimize over large families of parametr ized algorithms and tune the parameters of these algorithms using a training set of problem instances from their domain to determine a configuration with high expected performance over future instances. However, most of this work comes with no performance guarantees. The challenge is that for many combinatorial problems of significant importance including partitioning, subset selection, and alignment problems, a small tweak to the parameters can cause a cascade of changes in the algorithms behavior, so the algorithms performance is a discontinuous function of its parameters. In this chapter, we survey recent work that helps put data-driven combinatorial algorithm design on firm foundations. We provide strong computational and statistical performance guarantees, both for the batch and online scenarios where a collection of typical problem instances from the given application are presented either all at once or in an online fashion, respectively.

بنى وهياكل البيانات والخوارزميات الذكاء الاصطناعي التعلم الآلي

Lightweight Data Indexing and Compression in External Memory

163 - Paolo Ferragina , Travis Gagie , Giovanni Manzini 2009

In this paper we describe algorithms for computing the BWT and for building (compressed) indexes in external memory. The innovative feature of our algorithms is that they are lightweight in the sense that, for an input of size $n$, they use only ${n} $ bits of disk working space while all previous approaches use $Th{n log n}$ bits of disk working space. Moreover, our algorithms access disk data only via sequential scans, thus they take full advantage of modern disk features that make sequential disk accesses much faster than random accesses. We also present a scan-based algorithm for inverting the BWT that uses $Th{n}$ bits of working space, and a lightweight {em internal-memory} algorithm for computing the BWT which is the fastest in the literature when the available working space is $os{n}$ bits. Finally, we prove {em lower} bounds on the complexity of computing and inverting the BWT via sequential scans in terms of the classic product: internal-memory space $times$ number of passes over the disk data.

بنى وهياكل البيانات والخوارزميات

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

95 - Madelon Hulsebos , Kevin Hu , Michiel Bakker 2019

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expres sion matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

التعلم الآلي قواعد البيانات استرجاع المعلومات

Instance-Optimality in the Noisy Value-and Comparison-Model --- Accept, Accept, Strong Accept: Which Papers get in?

202 - Vincent Cohen-Addad , Frederik Mallmann-Trenn , Claire Mathieu 2018

Motivated by crowdsourced computation, peer-grading, and recommendation systems, Braverman, Mao and Weinberg [STOC16] studied the emph{query} and emph{round} complexity of fundamental problems such as finding the maximum (textsc{max}), finding all el ements above a certain value (textsc{threshold-$v$}) or computing the top$-k$ elements (textsc{Top}-$k$) in a noisy environment. For example, consider the task of selecting papers for a conference. This task is challenging due the crowdsourcing nature of peer reviews: the results of reviews are noisy and it is necessary to parallelize the review process as much as possible. We study the noisy value model and the noisy comparison model: In the emph{noisy value model}, a reviewer is asked to evaluate a single element: What is the value of paper $i$? (eg accept). In the emph{noisy comparison model} (introduced in the seminal work of Feige, Peleg, Raghavan and Upfal [SICOMP94]) a reviewer is asked to do a pairwise comparison: Is paper $i$ better than paper $j$? In this paper, we show optimal worst-case query complexity for the textsc{max},textsc{threshold-$v$} and textsc{Top}-$k$ problems. For textsc{max} and textsc{Top}-$k$, we obtain optimal worst-case upper and lower bounds on the round vs query complexity in both models. For textsc{threshold}-$v$, we obtain optimal query complexity and nearly-optimal round complexity, where $k$ is the size of the output) for both models. We then go beyond the worst-case and address the question of the importance of knowledge of the instance by providing, for a large range of parameters, instance-optimal algorithms with respect to the query complexity. Furthermore, we show that the value model is strictly easier than the comparison model.

بنى وهياكل البيانات والخوارزميات قواعد البيانات التعلم الآلي