Bitvector-aware Query Optimization for Decision Support Queries (extended version)

113 0 0.0 ( 0 )

Download Cite

Added by Bailu Ding

Publication date 2020

fields Informatics Engineering

and research's language is English

Authors Bailu Ding - Surajit Chaudhuri - Vivek Narasayya

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Bitvector filtering is an important query processing technique that can significantly reduce the cost of execution, especially for complex decision support queries with multiple joins. Despite its wide application, however, its implication to query optimization is not well understood. In this work, we study how bitvector filters impact query optimization. We show that incorporating bitvector filters into query optimization straightforwardly can increase the plan space complexity by an exponential factor in the number of relations in the query. We analyze the plans with bitvector filters for star and snowflake queries in the plan space of right deep trees without cross products. Surprisingly, with some simplifying assumptions, we prove that, the plan of the minimal cost with bitvector filters can be found from a linear number of plans in the number of relations in the query. This greatly reduces the plan space complexity for such queries from exponential to linear. Motivated by our analysis, we propose an algorithm that accounts for the impact of bitvector filters in query optimization. Our algorithm optimizes the join order for an arbitrary decision support query by choosing from a linear number of candidate plans in the number of relations in the query. We implement our algorithm in Microsoft SQL Server as a transformation rule. Our evaluation on both industry standard benchmarks and customer workload shows that, compared with the original Microsoft SQL Server, our technique reduces the total CPU execution time by 22%-64% for the workloads, with up to two orders of magnitude reduction in CPU execution time for individual queries.

rate research

198 - Feilong Liu , Ario Salmasi , Spyros Blanas 2018

Parallel aggregation is a ubiquitous operation in data analytics that is expressed as GROUP BY in SQL, reduce in Hadoop, or segment in TensorFlow. Parallel aggregation starts with an optional local pre-aggregation step and then repartitions the intermediate result across the network. While local pre-aggregation works well for low-cardinality aggregations, the network communication cost remains significant for high-cardinality aggregations even after local pre-aggregation. The problem is that the repartition-based algorithm for high-cardinality aggregation does not fully utilize the network. In this work, we first formulate a mathematical model that captures the performance of parallel aggregation. We prove that finding optimal aggregation plans from a known data distribution is NP-hard, assuming the Small Set Expansion conjecture. We propose GRASP, a GReedy Aggregation Scheduling Protocol that decomposes parallel aggregation into phases. GRASP is distribution-aware as it aggregates the most similar partitions in each phase to reduce the transmitted data size in subsequent phases. In addition, GRASP takes the available network bandwidth into account when scheduling aggregations in each phase to maximize network utilization. The experimental evaluation on real data shows that GRASP outperforms repartition-based aggregation by 3.5x and LOOM by 2.0x.

Databases

Learning GraphQL Query Costs (Extended Version)

110 - Georgios Mavroudeas , Guillaume Baudart , Alan Cha 2021

GraphQL is a query language for APIs and a runtime for executing those queries, fetching the requested data from existing microservices, REST APIs, databases, or other sources. Its expressiveness and its flexibility have made it an attractive candidate for API providers in many industries, especially through the web. A major drawback to blindly servicing a clients query in GraphQL is that the cost of a query can be unexpectedly large, creating computation and resource overload for the provider, and API rate-limit overages and infrastructure overload for the client. To mitigate these drawbacks, it is necessary to efficiently estimate the cost of a query before executing it. Estimating query cost is challenging, because GraphQL queries have a nested structure, GraphQL APIs follow different design conventions, and the underlying data sources are hidden. Estimates based on worst-case static query analysis have had limited success because they tend to grossly overestimate cost. We propose a machine-learning approach to efficiently and accurately estimate the query cost. We also demonstrate the power of this approach by testing it on query-response data from publicly available commercial APIs. Our framework is efficient and predicts query costs with high accuracy, consistently outperforming the static analysis by a large margin.

Software Engineering Databases Machine Learning

DBL: Efficient Reachability Queries on Dynamic Graphs (Complete Version)

157 - Qiuyi Lyu , Yuchen Li , Bingsheng He 2021

Reachability query is a fundamental problem on graphs, which has been extensively studied in academia and industry. Since graphs are subject to frequent updates in many applications, it is essential to support efficient graph updates while offering good performance in reachability queries. Existing solutions compress the original graph with the Directed Acyclic Graph (DAG) and propose efficient query processing and index update techniques. However, they focus on optimizing the scenarios where the Strong Connected Components(SCCs) remain unchanged and have overlooked the prohibitively high cost of the DAG maintenance when SCCs are updated. In this paper, we propose DBL, an efficient DAG-free index to support the reachability query on dynamic graphs with insertion-only updates. DBL builds on two complementary indexes: Dynamic Landmark (DL) label and Bidirectional Leaf (BL) label. The former leverages landmark nodes to quickly determine reachable pairs whereas the latter prunes unreachable pairs by indexing the leaf nodes in the graph. We evaluate DBL against the state-of-the-art approaches on dynamic reachability index with extensive experiments on real-world datasets. The results have demonstrated that DBL achieves orders of magnitude speedup in terms of index update, while still producing competitive query efficiency.

Databases

Safest Nearby Neighbor Queries in Road Networks (Full Version)

426 - Punam Biswas , Tanzima Hashem , Muhammad Aamir Cheema 2021

Traditional route planning and $k$ nearest neighbors queries only consider distance or travel time and ignore road safety altogether. However, many travellers prefer to avoid risky or unpleasant road conditions such as roads with high crime rates (e.g., robberies, kidnapping, riots etc.) and bumpy roads. To facilitate safe travel, we introduce a novel query for road networks called the $k$ safest nearby neighbors ($k$SNN) query. Given a query location $v_l$, a distance constraint $d_c$ and a point of interest $p_i$, we define the safest path from $v_l$ to $p_i$ as the path with the highest path safety score among all the paths from $v_l$ to $p_i$ with length less than $d_c$. The path safety score is computed considering the road safety of each road segment on the path. Given a query location $v_l$, a distance constraint $d_c$ and a set of POIs $P$, a $k$SNN query returns $k$ POIs with the $k$ highest path safety scores in $P$ along with their respective safest paths from the query location. We develop two novel indexing structures called $Ct$-tree and a safety score based Voronoi diagram (SNVD). We propose two efficient query processing algorithms each exploiting one of the proposed indexes to effectively refine the search space using the properties of the index. Our extensive experimental study on real datasets demonstrates that our solution is on average an order of magnitude faster than the baselines.

Databases

Snapshot Semantics for Temporal Multiset Relations (Extended Version)

200 - Anton Dignos , Boris Glavic , Xing Niu 2019

Snapshot semantics is widely used for evaluating queries over temporal data: temporal relations are seen as sequences of snapshot relations, and queries are evaluated at each snapshot. In this work, we demonstrate that current approaches for snapshot semantics over interval-timestamped multiset relations are subject to two bugs regarding snapshot aggregation and bag difference. We introduce a novel temporal data model based on K-relations that overcomes these bugs and prove it to correctly encode snapshot semantics. Furthermore, we present an efficient implementation of our model as a database middleware and demonstrate experimentally that our approach is competitive with native implementations and significantly outperforms such implementations on queries that involve aggregation.

Databases