No Arabic abstract
A practical and promising approach to parallelizing XPath queries was proposed by Bordawekar et al. in 2009, which enables parallelization on top of existing XML database engines. Although they experimentally demonstrated the speedup by their approach, their practice has already been out of date because the software environment has largely changed with the capability of XQuery processing. In this work, we implement their approach in two ways on top of a state-of-the-art XML database engine and experimentally demonstrate that our implementations can bring significant speedup on a commodity server.
The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache VXQuery, an open-source scalable XQuery processor. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a language agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics (data model, data-model dependent functions and optimizations, and a parser). We describe the architecture of Apache VXQuery, its integration with Hyracks and Algebricks, and the XQuery optimization rules applied to the query plan to improve path expression efficiency and to enable query parallelism. An experimental evaluation using a real 500GB dataset with various selection, aggregation and join XML queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up. Our experiments show that it is about 3x faster than Saxon (an open-source and commercial XQuery processor) on a 4-core, single node implementation, and around 2.5x faster than Apache MRQL (a MapReduce-based parallel query processor) on an eight (4-core) node cluster.
Extract-Transform-Load (ETL) handles large amount of data and manages workload through dataflows. ETL dataflows are widely regarded as complex and expensive operations in terms of time and system resources. In order to minimize the time and the resources required by ETL dataflows, this paper presents a framework to optimize dataflows using shared cache and parallelization techniques. The framework classifies the components in an ETL dataflow into different categories based on their data operation properties. The framework then partitions the dataflow based on the classification at different granularities. Furthermore, the framework applies optimization techniques such as cache re-using, pipelining and multi-threading to the already-partitioned dataflows. The proposed techniques reduce system memory footprint and the frequency of copying data between different components, and also take full advantage of the computing power of multi-core processors. The experimental results show that the proposed optimization framework is 4.7 times faster than the ordinary ETL dataflows (without using the proposed optimization techniques), and outperforms the similar tool (Kettle).
Responsive Web Design (RWD) enables web applications to adapt to the characteristics of different devices such as screen size which is important for mobile browsing. Today, the only W3C standard to support this adaptability is CSS media queries. However, using media queries it is impossible to create applications in a modular way, because responsive elements then always depend on the global context. Hence, responsive elements can only be reused if the global context is exactly the same, severely limiting their reusability. This makes it extremely challenging to develop large responsive applications, because the lack of true modularity makes certain requirement changes either impossible or expensive to realize. In this paper we extend RWD to also include responsive modules, i.e., modules that adapt their design based on their local context independently of the global context. We present the ELQ project which implements our approach. ELQ is a novel implementation of so-called element queries which generalize media queries. Importantly, our design conforms to existing web specifications, enabling adoption on a large scale. ELQ is designed to be heavily extensible using plugins. Experimental results show speed-ups of the core algorithms of up to 37x compared to previous approaches.
Large-scale graph-structured data arising from social networks, databases, knowledge bases, web graphs, etc. is now available for analysis and mining. Graph-mining often involves relationship queries, which seek a ranked list of interesting interconnections among a given set of entities, corresponding to nodes in the graph. While relationship queries have been studied for many years, using various terminologies, e.g., keyword-search, Steiner-tree in a graph etc., the solutions proposed in the literature so far have not focused on scaling relationship queries to large graphs having billions of nodes and edges, such are now publicly available in the form of linked-open-data. In this paper, we present an algorithm for distributed keyword search (DKS) on large graphs, based on the graph-parallel computing paradigm Pregel. We also present an analytical proof that our algorithm produces an optimally ranked list of answers if run to completion. Even if terminated early, our algorithm produces approximate answers along with bounds. We describe an optimized implementation of our DKS algorithm along with time-complexity analysis. Finally, we report and analyze experiments using an implementation of DKS on Giraph the graph-parallel computing framework based on Pregel, and demonstrate that we can efficiently process relationship queries on large-scale subsets of linked-open-data.
The k-regret query aims to return a size-k subset S of a database D such that, for any query user that selects a data object from this size-k subset S rather than from database D, her regret ratio is minimized. The regret ratio here is modeled by the relative difference in the optimality between the locally optimal object in S and the globally optimal object in D. The optimality of a data object in turn is modeled by a utility function of the query user. Unlike traditional top-k queries, the k-regret query does not minimize the regret ratio for a specific utility function. Instead, it considers a family of infinite utility functions F, and aims to find a size-k subset that minimizes the maximum regret ratio of any utility function in F. Studies on k-regret queries have focused on the family of additive utility functions, which have limitations in modeling individuals preferences and decision making processes, especially for a common observation called the diminishing marginal rate of substitution (DMRS). We introduce k-regret queries with multiplicative utility functions, which are more expressive in modeling the DMRS, to overcome those limitations. We propose a query algorithm with bounded regret ratios. To showcase the applicability of the algorithm, we apply it to a special family of multiplicative utility functions, the Cobb-Douglas family of utility functions, and a closely related family of utility functions, the Constant Elasticity of Substitution family of utility functions, both of which are frequently used utility functions in microeconomics. After a further study of the query properties, we propose a heuristic algorithm that produces even smaller regret ratios in practice. Extensive experiments on the proposed algorithms confirm that they consistently achieve small maximum regret ratios.