Optimizing ETL Dataflow Using Shared Caching and Parallelization Methods

420 0 0.0 ( 0 )

Download Cite

Added by Xiufeng Liu

Publication date 2014

fields Informatics Engineering

and research's language is English

Authors Xiufeng Liu

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framework classifies the components in an ETL dataflow into different categories based on their data operation properties. The framework then partitions the dataflow based on the classification at different granularities. Furthermore, the framework applies optimization techniques such as cache re-using, pipelining and multi-threading to the already-partitioned dataflows. The proposed techniques reduce system memory footprint and the frequency of copying data between different components, and also take full advantage of the computing power of multi-core processors. The experimental results show that the proposed optimization framework is 4.7 times faster than the ordinary ETL dataflows (without using the proposed optimization techniques), and outperforms the similar tool (Kettle).

rate research

Parallelization of XPath Queries using Modern XQuery Processors

68 - Shigeyuki Sato , Wei Hao , 2018

A practical and promising approach to parallelizing XPath queries was proposed by Bordawekar et al. in 2009, which enables parallelization on top of existing XML database engines. Although they experimentally demonstrated the speedup by their approach, their practice has already been out of date because the software environment has largely changed with the capability of XQuery processing. In this work, we implement their approach in two ways on top of a state-of-the-art XML database engine and experimentally demonstrate that our implementations can bring significant speedup on a commodity server.

Databases Programming Languages

Two-level Data Staging ETL for Transaction Data

504 - Xiufeng Liu 2014

In data warehousing, Extract-Transform-Load (ETL) extracts the data from data sources into a central data warehouse regularly for the support of business decision-makings. The data from transaction processing systems are featured with the high frequent changes of insertion, update, and deletion. It is challenging for ETL to propagate the changes to the data warehouse, and maintain the change history. Moreover, ETL jobs typically run in a sequential order when processing the data with dependencies, which is not optimal, eg, when processing early-arriving data. In this paper, we propose a two-level data staging ETL for handling transaction data. The proposed method detects the changes of the data from transactional processing systems, identifies the corresponding operation codes for the changes, and uses two staging databases to facilitate the data processing in an ETL process. The proposed ETL provides the one-stop method for fast-changing, slowly-changing and early-arriving data processing.

Databases

Optimizing the hybrid parallelization of BHAC

145 - Salvatore Cielo , Oliver Porth , Luigi Iapichino 2021

We present our experience with the modernization on the GR-MHD code BHAC, aimed at improving its novel hybrid (MPI+OpenMP) parallelization scheme. In doing so, we showcase the use of performance profiling tools usable on x86 (Intel-based) architectures. Our performance characterization and threading analysis provided guidance in improving the concurrency and thus the efficiency of the OpenMP parallel regions. We assess scaling and communication patterns in order to identify and alleviate MPI bottlenecks, with both runtime switches and precise code interventions. The performance of optimized version of BHAC improved by $sim28%$, making it viable for scaling on several hundreds of supercomputer nodes. We finally test whether porting such optimizations to different hardware is likewise beneficial on the new architecture by running on ARM A64FX vector nodes.

Distributed Parallel and Cluster Computing Instrumentation and Methods for Astrophysics

Secretive Coded Caching with Shared Caches

172 - Shreya Shrestha Meel , B. Sundar Rajan 2021

We consider the problem of emph{secretive coded caching} in a shared cache setup where the number of users accessing a particular emph{helper cache} is more than one, and every user can access exactly one helper cache. In secretive coded caching, the constraint of emph{perfect secrecy} must be satisfied. It requires that the users should not gain, either from their caches or from the transmissions, any information about the content of the files that they did not request from the server. In order to accommodate the secrecy constraint, our problem setup requires, in addition to a helper cache, a dedicated emph{user cache} of minimum capacity of 1 unit to every user. This is where our formulation differs from the original work on shared caches (``Fundamental Limits of Coded Caching With Multiple Antennas, Shared Caches and Uncoded Prefetching by E.~Parrinello, A.~{{U}}nsal and P.~Elia in Trans. Inf. Theory, 2020). In this work, we propose a secretively achievable coded caching scheme with shared caches under centralized placement. When our scheme is applied to the dedicated cache setting, it matches the scheme by Ravindrakumar emph{et al.} (``Private Coded Caching, in Trans. Inf. Forensics and Security, 2018).

Information Theory Information Theory

Theory and Practice of Transactional Method Caching

331 - Daniel Pfeifer , Peter C. Lockemann 2005

Nowadays, tiered architectures are widely accepted for constructing large scale information systems. In this context application servers often form the bottleneck for a systems efficiency. An application server exposes an object oriented interface consisting of set of methods which are accessed by potentially remote clients. The idea of method caching is to store results of read-only method invocations with respect to the application servers interface on the client side. If the client invokes the same method with the same arguments again, the corresponding result can be taken from the cache without contacting the server. It has been shown that this approach can considerably improve a real world systems efficiency. This paper extends the concept of method caching by addressing the case where clients wrap related method invocations in ACID transactions. Demarcating sequences of method calls in this way is supported by many important application server standards. In this context the paper presents an architecture, a theory and an efficient protocol for maintaining full transactional consistency and in particular serializability when using a method cache on the client side. In order to create a protocol for scheduling cached method results, the paper extends a classical transaction formalism. Based on this extension, a recovery protocol and an optimistic serializability protocol are derived. The latter one differs from traditional transactional cache protocols in many essential ways. An efficiency experiment validates the approach: Using the cache a systems performance and scalability are considerably improved.

Databases