Designing Succinct Secondary Indexing Mechanism by Exploiting Column Correlations (Extended Version)

142 0 0.0 ( 0 )

Download Cite

Added by Yingjun Wu

Publication date 2019

fields Informatics Engineering

and research's language is English

Authors Yingjun Wu - Jia Yu - Yuanyuan Tian

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Database administrators construct secondary indexes on data tables to accelerate query processing in relational database management systems (RDBMSs). These indexes are built on top of the most frequently queried columns according to the data statistics. Unfortunately, maintaining multiple secondary indexes in the same database can be extremely space consuming, causing significant performance degradation due to the potential exhaustion of memory space. In this paper, we demonstrate that there exist many opportunities to exploit column correlations for accelerating data access. We propose HERMIT, a succinct secondary indexing mechanism for modern RDBMSs. HERMIT judiciously leverages the rich soft functional dependencies hidden among columns to prune out redundant structures for indexed key access. Instead of building a complete index that stores every single entry in the key columns, HERMIT navigates any incoming key access queries to an existing index built on the correlated columns. This is achieved through the Tiered Regression Search Tree (TRS-Tree), a succinct, ML-enhanced data structure that performs fast curve fitting to adaptively and dynamically capture both column correlations and outliers. Our extensive experimental study in two different RDBMSs have confirmed that HERMIT can significantly reduce space consumption with limited performance overhead in terms of query response time and index maintenance time, especially when supporting complex range queries.

rate research

Multi-version Indexing in Flash-based Key-Value Stores

256 - Pulkit A. Misra , Jeffrey S. Chase , Johannes Gehrke 2019

Maintaining multip

Databases Operating Systems

ProGS: Property Graph Shapes Language (Extended Version)

88 - Philipp Seifer , Ralf Lammel , Steffen Staab 2021

Property graphs constitute data models for representing knowledge graphs. They allow for the convenient representation of facts, including facts about facts, represented by triples in subject or object position of other triples. Knowledge graphs such as Wikidata are created by a diversity of contributors and a range of sources leaving them prone to two types of errors. The first type of error, falsity of facts, is addressed by property graphs through the representation of provenance and validity, making triples occur as first-order objects in subject position of metadata triples. The second type of error, violation of domain constraints, has not been addressed with regard to property graphs so far. In RDF representations, this error can be addressed by shape languages such as SHACL or ShEx, which allow for checking whether graphs are valid with respect to a set of domain constraints. Borrowing ideas from the syntax and semantics definitions of SHACL, we design a shape language for property graphs, ProGS, which allows for formulating shape constraints on property graphs including their specific constructs, such as edges with identities and key-value annotations to both nodes and edges. We define a formal semantics of ProGS, investigate the resulting complexity of validating property graphs against sets of ProGS shapes, compare with corresponding results for SHACL, and implement a prototypical validator that utilizes answer set programming.

Databases Artificial Intelligence Programming Languages

Snapshot Semantics for Temporal Multiset Relations (Extended Version)

200 - Anton Dignos , Boris Glavic , Xing Niu 2019

Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation.

Databases

198 - Feilong Liu , Ario Salmasi , Spyros Blanas 2018

Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains significant for high-cardinality aggregations even after local pre-aggregation. The problem is that the repartition-based algorithm for high-cardinality aggregation does not fully utilize the network. In this work, we first formulate a mathematical model that captures the performance of parallel aggregation. We prove that finding optimal aggregation plans from a known data distribution is NP-hard, assuming the Small Set Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling Protocol that decomposes parallel aggregation into phases. GRASP is distribution-aware as it aggregates the most similar partitions in each phase to reduce the transmitted data size in subsequent phases. In addition, GRASP takes the available network bandwidth into account when scheduling aggregations in each phase to maximize network utilization. The experimental evaluation on real data shows that GRASP outperforms repartition-based aggregation by 3.5x and LOOM by 2.0x.

Databases

Gcube Indexing

368 - M.Laxmaiah 2013

Spatial Online Analytical Processing System involves the non-categorical attribute information also whereas standard Online Analytical Processing System deals with only categorical attributes. Providing spatial information to the data warehouse (DW); two major challenges faced are;1.Defining and Aggregation of Spatial/Continues values and 2.Representation, indexing, updating and efficient query processing. In this paper, we present GCUBE(Geographical Cube) storage and indexing procedure to aggregate the spatial information/Continuous values. We employed the proposed approach storing and indexing using synthetic and real data sets and evaluated its build, update and Query time. It is observed that the proposed procedure offers significant performance advantage.

Databases