Efficient Approximation of Well-Designed SPARQL Queries

163 0 0.0 ( 0 )

Download Cite

Added by Xiaowang Zhang

Publication date 2016

fields Informatics Engineering

and research's language is English

Authors Zhenyu Song - Zhiyong Feng - Xiaowang Zhang

Databases

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Query response time often influences user experience in the real world. However, it possibly takes more time to answer a query with its all exact solutions, especially when it contains the OPT operations since the OPT operation is the least conventional operator in SPARQL. So it becomes essential to make a trade-off between the query response time and the accuracy of their solutions. In this paper, based on the depth of the OPT operation occurring in a query, we propose an approach to obtain its all approximate queries with less depth of the OPT operation. This paper mainly discusses those queries with well-designed patterns since the OPT operation in a well-designed pattern is really optional. Firstly, we transform a well-designed pattern in OPT normal form into a well-designed tree, whose inner nodes are labeled by OPT operation and leaf nodes are labeled by patterns containing other operations such as the AND operation and the FILTER operation. Secondly, based on this well-designed tree, we remove optional well-designed subtrees with less depth of the OPT operation and then obtain approximate queries with different depths of the OPT operation. Finally, we evaluate the approximate query efficiency with the degree of approximation.

rate research

PIWD: A Plugin-based Framework for Well-Designed SPARQL

265 - Xiaowang Zhang , Zhenyu Song , Zhiyong Feng 2016

In the real world datasets (e.g.,DBpedia query log), queries built on well-designed patterns containing only AND and OPT operators (for short, WDAO-patterns) account for a large proportion among all SPARQL queries. In this paper, we present a plugin-based framework for all SELECT queries built on WDAO-patterns, named PIWD. The framework is based on a parse tree called emph{well-designed AND-OPT tree} (for short, WDAO-tree) whose leaves are basic graph patterns (BGP) and inner nodes are the OPT operators. We prove that for any WDAO-pattern, its parse tree can be equivalently transformed into a WDAO-tree. Based on the proposed framework, we can employ any query engine to evaluate BGP for evaluating queries built on WDAO-patterns in a convenient way. Theoretically, we can reduce the query evaluation of WDAO-patterns to subgraph homomorphism as well as BGP since the query evaluation of BGP is equivalent to subgraph homomorphism. Finally, our preliminary experiments on gStore and RDF-3X show that PIWD can answer all queries built on WDAO-patterns effectively and efficiently.

Databases

TrQuery: An Embedding-based Framework for Recommanding SPARQL Queries

221 - Lijing Zhang , Xiaowang Zhang , Zhiyong Feng 2018

In this paper, we present an embedding-based framework (TrQuery) for recommending solutions of a SPARQL query, including approximate solutions when exact querying solutions are not available due to incompleteness or inconsistencies of real-world RDF data. Within this framework, embedding is applied to score solutions together with edit distance so that we could obtain more fine-grained recommendations than those recommendations via edit distance. For instance, graphs of two querying solutions with a similar structure can be distinguished in our proposed framework while the edit distance depending on structural difference becomes unable. To this end, we propose a novel score model built on vector space generated in embedding system to compute the similarity between an approximate subgraph matching and a whole graph matching. Finally, we evaluate our approach on large RDF datasets DBpedia and YAGO, and experimental results show that TrQuery exhibits an excellent behavior in terms of both effectiveness and efficiency.

Databases Artificial Intelligence

MapSQ: A MapReduce-based Framework for SPARQL Queries on GPU

165 - Jiaying Feng , Xiaowang Zhang , Zhiyong Feng 2017

In this paper, we present a MapReduce-based framework for evaluating SPARQL queries on GPU (named MapSQ) to large-scale RDF datesets efficiently by applying both high performance. Firstly, we develop a MapReduce-based Join algorithm to handle SPARQL queries in a parallel way. Secondly, we present a coprocessing strategy to manage the process of evaluating queries where CPU is used to assigns subqueries and GPU is used to compute the join of subqueries. Finally, we implement our proposed framework and evaluate our proposal by comparing with two popular and latest SPARQL query engines gStore and gStoreD on the LUBM benchmark. The experiments demonstrate that our proposal MapSQ is highly efficient and effective (up to 50% speedup).

Databases

GSmart: An Efficient SPARQL Query Engine Using Sparse Matrix Algebra -- Full Version

219 - Yuedan Chen , M. Tamer Ozsu , Guoqing Xiao 2021

Efficient execution of SPARQL queries over large RDF datasets is a topic of considerable interest due to increased use of RDF to encode data. Most of this work has followed either relational or graph-based approaches. In this paper, we propose an alternative query engine, called gSmart, based on matrix algebra. This approach can potentially better exploit the computing power of high-performance heterogeneous architectures that we target. gSmart incorporates: (1) grouped incident edge-based SPARQL query evaluation, in which all unevaluated edges of a vertex are evaluated together using a series of matrix operations to fully utilize query constraints and narrow down the solution space; (2) a graph query planner that determines the order in which vertices in query graphs should be evaluated; (3) memory- and computation-efficient data structures including the light-weight sparse matrix (LSpM) storage for RDF data and the tree-based representation for evaluation results; (4) a multi-stage data partitioner to map the incident edge-based query evaluation into heterogeneous HPC architectures and develop multi-level parallelism; and (5) a parallel executor that uses the fine-grained processing scheme, pre-pruning technique, and tree-pruning technique to lower inter-node communication and enable high throughput. Evaluations of gSmart on a CPU+GPU HPC architecture show execution time speedups of up to 46920.00x compared to the existing SPARQL query engines on a single node machine. Additionally, gSmart on the Tianhe-1A supercomputer achieves a maximum speedup of 6.90x scaling from 2 to 16 CPU+GPU nodes.

Databases Distributed Parallel and Cluster Computing

Efficient Genomic Interval Queries Using Augmented Range Trees

76 - Chengsheng Mao , Alal Eran , Yuan Luo 2018

Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allens interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allens relations. We design and implement a novel approach to address this unmet need. Through rewriting Allens interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allens relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allens relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allens relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.

Databases Data Structures and Algorithms Quantitative Methods