Analysis of Indexing Structures for Immutable Data

312 0 0.0 ( 0 )

Download Cite

Added by Cong Yue

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Cong Yue - Zhongle Xie - Meihui Zhang

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

In emerging applications such as blockchains and collaborative data analytics, there are strong demands for data immutability, multi-version accesses, and tamper-evident controls. This leads to three new index structures for immutable data, namely Merkle Patricia Trie (MPT), Merkle Bucket Tree (MBT), and Pattern-Oriented-Split Tree (POS-Tree). Although these structures have been adopted in real applications, there is no systematic evaluation of their pros and cons in the literature. This makes it difficult for practitioners to choose the right index structure for their applications, as there is only a limited understanding of the characteristics of each index. To alleviate the above deficiency, we present a comprehensive analysis of the existing index structures for immutable data, evaluating both their asymptotic and empirical performance. Specifically, we show that MPT, MBT, and POS-Tree are all instances of a recently proposed framework, dubbed my{Structurally Invariant and Reusable Indexes (SIRI)}. We propose to evaluate the SIRI instances based on five essential metrics: their efficiency for four index operations (i.e., lookup, update, comparison, and merge), as well as their my{deduplication ratios} (i.e., the size of the index with deduplication over the size without deduplication). We establish the worst-case guarantees of each index in terms of these five metrics, and we experimentally evaluate all indexes in a large variety of settings. Based on our theoretical and empirical analysis, we conclude that POS-Tree is a favorable choice for indexing immutable data.

rate research

MESSI: In-Memory Data Series Indexing

74 - Botao Peng , Panagiota Fatourou , Themis Palpanas 2020

Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-core and multi-socket architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in _50msec (30-75msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.

Databases

ParIS+: Data Series Indexing on Multi-Core Architectures

69 - Botao Peng , Panagiota Fatourou , Themis Palpanas 2020

Data series similarity search is a core operation for several data series analysis applications across many different domains. Nevertheless, even state-of-the-art techniques cannot provide the time performance required for large data series collections. We propose ParIS and ParIS+, the first disk-based data series indices carefully designed to inherently take advantage of multi-core architectures, in order to accelerate similarity search processing times. Our experiments demonstrate that ParIS+ completely removes the CPU latency during index construction for disk-resident data, and for exact query answering is up to 1 order of magnitude faster than the current state of the art index scan method, and up to 3 orders of magnitude faster than the optimized serial scan method. ParIS+ (which is an evolution of the ADS+ index) owes its efficiency to the effective use of multi-core and multi-socket architectures, in order to distribute and execute in parallel both index construction and query answering, and to the exploitation of the Single Instruction Multiple Data (SIMD) capabilities of modern CPUs, in order to further parallelize the execution of instructions inside each core.

Databases

Gcube Indexing

377 - M.Laxmaiah 2013

Spatial Online Analytical Processing System involves the non-categorical attribute information also whereas standard Online Analytical Processing System deals with only categorical attributes. Providing spatial information to the data warehouse (DW); two major challenges faced are;1.Defining and Aggregation of Spatial/Continues values and 2.Representation, indexing, updating and efficient query processing. In this paper, we present GCUBE(Geographical Cube) storage and indexing procedure to aggregate the spatial information/Continuous values. We employed the proposed approach storing and indexing using synthetic and real data sets and evaluated its build, update and Query time. It is observed that the proposed procedure offers significant performance advantage.

Databases

ForkBase: Immutable, Tamper-evident Storage Substrate for Branchable Applications

181 - Qian Lin , Kaiyuan Yang , Tien Tuan Anh Dinh 2020

Data collaboration activities typically require systematic or protocol-based coordination to be scalable. Git, an effective enabler for collaborative coding, has been attested for its success in countless projects around the world. Hence, applying the Git philosophy to general data collaboration beyond coding is motivating. We call it Git for data. However, the original Git design handles data at the file granule, which is considered too coarse-grained for many database applications. We argue that Git for data should be co-designed with database systems. To this end, we developed ForkBase to make Git for data practical. ForkBase is a distributed, immutable storage system designed for data version management and data collaborative operation. In this demonstration, we show how ForkBase can greatly facilitate collaborative data management and how its novel data deduplication technique can improve storage efficiency for archiving massive da

Databases

Structural Indexing for Conjunctive Path Queries

335 - Yuya Sasaki , George Fletcher , Makoto Onizuka 2020

Structural indexing is an approach to accelerating query evaluation, whereby data objects are partitioned and indexed reflecting the precise expressive power of a given query language. Each partition block of the index holds exactly those objects that are indistinguishable with respect to queries expressible in the language. Structural indexes have proven successful for XML, RDF, and relational data management. In this paper we study structural indexing for conjunctive path queries (CPQ). CPQ forms the core of contemporary graph query languages such as SPARQL, Cypher, PGQL, and G-CORE. CPQ plays the same fundamental role with respect to contemporary graph query languages as the classic conjunctive queries play for SQL. We develop the first practical structural indexes for this important query language. In particular, we propose a structural index based on k-path-bisimulation, tightly coupled to the expressive power of CPQ, and develop algorithms for efficient query processing with our index. Furthermore, we study workload-aware structural indexes to reduce both the construction and space costs according to a given workload. We demonstrate through extensive experiments using real and synthetic graphs that our methods accelerate query processing by up to multiple orders of magnitude over the state-of-the-art methods, without increasing index size.

Databases