Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

61 0 0.0 ( 0 )

Download Cite

Added by Jim Pivarski

Publication date 2017

fields Informatics Engineering

and research's language is English

Authors Jim Pivarski - Peter Elmer - Brian Bockelman

Programming Languages Databases Information Retrieval

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Big Data query systems represent data in a columnar format for fast, selective access, and in some cases (e.g. Apache Drill), perform calculations directly on the columnar data without row materialization, avoiding runtime costs. However, many analysis procedures cannot be easily or efficiently expressed as SQL. In High Energy Physics, the majority of data processing requires nested loops with complex dependencies. When faced with tasks like these, the conventional approach is to convert the columnar data back into an object form, usually with a performance price. This paper describes a new technique to transform procedural code so that it operates on hierarchically nested, columnar data natively, without row materialization. It can be viewed as a compiler pass on the typed abstract syntax tree, rewriting references to objects as columnar array lookups. We will also present performance comparisons between transformed code and conventional object-oriented code in a High Energy Physics context.

rate research

Query Lifting: Language-integrated query for heterogeneous nested collections

132 - Wilmer Ricciotti , James Cheney 2021

Language-integrated query based on comprehension syntax is a powerful technique for safe database programming, and provides a basis for advanced techniques such as query shredding or query flattening that allow efficient programming with complex nested collections. However, the foundations of these techniques are lacking: although SQL, the most widely-used database query language, supports heterogeneous queries that mix set and multiset semantics, these important capabilities are not supported by known correctness results or implementations that assume homogeneous collections. In this paper we study language-integrated query for a heterogeneous query language $NRC_lambda(Set,Bag)$ that combines set and multiset constructs. We show how to normalize and translate queries to SQL, and develop a novel approach to querying heterogeneous nested collections, based on the insight that ``local query subexpressions that calculate nested subcollections can be ``lifted to the top level analogously to lambda-lifting for local function definitions.

Programming Languages Databases

Towards a Semantics-Aware Code Transformation Toolchain for Heterogeneous Systems

165 - Salvador Tamarit 2017

Obtaining good performance when programming heterogeneous computing platforms poses significant challenges. We present a program transformation environment, implemented in Haskell, where architecture-agnostic scientific C code with semantic annotations is transformed into functionally equivalent code better suited for a given platform. The transformation steps are represented as rules that can be fired when certain syntactic and semantic conditions are fulfilled. These rules are not hard-wired into the rewriting engine: they are written in a C-like language and are automatically processed and incorporated into the rewriting engine. That makes it possible for end-users to add their own rules or to provide sets of rules that are adapted to certain specific domains or purposes.

Programming Languages Distributed Parallel and Cluster Computing Software Engineering

Sherlock: A Deep Learning Approach to Semantic Data Type Detection

95 - Madelon Hulsebos , Kevin Hu , Michiel Bakker 2019

Correctly detecting the semantic type of data columns is crucial for data science tasks such as automated data cleaning, schema matching, and data discovery. Existing data preparation and analysis systems rely on dictionary lookups and regular expression matching to detect semantic types. However, these matching-based approaches often are not robust to dirty data and only detect a limited number of types. We introduce Sherlock, a multi-input deep neural network for detecting semantic types. We train Sherlock on $686,765$ data columns retrieved from the VizNet corpus by matching $78$ semantic types from DBpedia to column headers. We characterize each matched column with $1,588$ features describing the statistical properties, character distributions, word embeddings, and paragraph vectors of column values. Sherlock achieves a support-weighted F$_1$ score of $0.89$, exceeding that of machine learning baselines, dictionary and regular expression benchmarks, and the consensus of crowdsourced annotations.

Machine Learning Databases Information Retrieval

Deductive Optimization of Relational Data Storage

222 - John K. Feser , Samuel Madden , Nan Tang 2019

Optimizing the physical data storage and retrieval of data are two key database management problems. In this paper, we propose a language that can express a wide range of physical database layouts, going well beyond the row- and column-based methods that are widely used in database management systems. We use deductive synthesis to turn a high-level relational representation of a database query into a highly optimized low-level implementation which operates on a specialized layout of the dataset. We build a compiler for this language and conduct experiments using a popular database benchmark, which shows that the performance of these specialized queries is competitive with a state-of-the-art in memory compiled database system.

Programming Languages Databases

The PGM-index: a multicriteria, compressed and learned approach to data indexing

68 - Paolo Ferragina , Giorgio Vinciguerra 2019

The recent introduction of learned indexes has shaken the foundations of the decades-old field of indexing data structures. Combining, or even replacing, classic design elements such as B-tree nodes with machine learning models has proven to give outstanding improvements in the space footprint and time efficiency of data systems. However, these novel approaches are based on heuristics, thus they lack any guarantees both in their time and space requirements. We propose the Piecewise Geometric Model index (shortly, PGM-index), which achieves guaranteed I/O-optimality in query operations, learns an optimal number of linear models, and its peculiar recursive construction makes it a purely learned data structure, rather than a hybrid of traditional and learned indexes (such as RMI and FITing-tree). We show that the PGM-index improves the space of the FITing-tree by 63.3% and of the B-tree by more than four orders of magnitude, while achieving their same or even better query time efficiency. We complement this result by proposing three variants of the PGM-index. First, we design a compressed PGM-index that further reduces its space footprint by exploiting the repetitiveness at the level of the learned linear models it is composed of. Second, we design a PGM-index that adapts itself to the distribution of the queries, thus resulting in the first known distribution-aware learned index to date. Finally, given its flexibility in the offered space-time trade-offs, we propose the multicriteria PGM-index that efficiently auto-tune itself in a few seconds over hundreds of millions of keys to the possibly evolving space-time constraints imposed by the application of use. We remark to the reader that this paper is an extended and improved version of our previous paper titled Superseding traditional indexes by orchestrating learning and geometry (arXiv:1903.00507).

Data Structures and Algorithms Databases Information Retrieval

comments

Fetching comments

Sohag University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Fast Access to Columnar, Hierarchically Nested Data via Code Transformation

Ask ChatGPT about the research

No Arabic abstract

Read More