No Arabic abstract
The RDF graph-based data model has seen ever-broadening adoption in recent years, prompting the standardization of the SPARQL query language for RDF, and the development of local and distributed engines for processing SPARQL queries. This survey paper provides a comprehensive review of techniques, engines and benchmarks for querying RDF knowledge graphs. While other reviews on this topic tend to focus on the distributed setting, the main focus of the work is on providing a comprehensive survey of state-of-the-art storage, indexing and query processing techniques for efficiently evaluating SPARQL queries in a local setting (on one machine). To keep the survey self-contained, we also provide a short discussion on graph partitioning techniques used in the distributed setting. We conclude by discussing contemporary research challenges for further improving SPARQL query engines. An online extended version also provides a survey of over one hundred SPARQL query engines and the techniques they use, along with twelve benchmarks and their features.
Graph query languages feature mainly two kinds of queries when applied to a graph database: those inspired by relational databases which return tables such as SELECT queries and those which return graphs such as CONSTRUCT queries in SPARQL. The latter are object of study in the present paper. For this purpose, a core graph query language GrAL is defined with focus on CONSTRUCT queries. Queries in GrAL form the final step of a recursive process involving so-called GrAL patterns. By evaluating a query over a graph one gets a graph, while by evaluating a pattern over a graph one gets a set of matches which involves both a graph and a table. CONSTRUCT queries are based on CONSTRUCT patterns, and sub-CONSTRUCT patterns come for free from the recursive definition of patterns. The semantics of GrAL is based on RDF graphs with a slight modification which consists in accepting isolated nodes. Such an extension of RDF graphs eases the definition of the evaluation semantics, which is mainly captured by a unique operation called Merge. Besides, we define aggregations as part of GrAL expressions, which leads to an original local processing of aggregations.
Knowledge Graphs (KGs) have emerged as the de-facto standard for modeling and querying datasets with a graph-like structure in the Semantic Web domain. Our focus is on the performance challenges associated with querying KGs. We developed three informationally equivalent JSON-based representations for KGs, namely, Subject-based Name/Value (JSON-SNV), Documents of Triples (JSON-DT), and Chain-based Name/Value (JSON-CNV). We analyzed the effects of these representations on query performance by storing them on two prominent document-based Data Management Systems (DMSs), namely, MongoDB and Couchbase and executing a set of benchmark queries over them. We also compared the execution times with row-store Virtuoso, column-store Virtuoso, and mbox{Blazegraph} as three major DMSs with different architectures (aka, RDF-stores). Our results indicate that the representation type has a significant performance impact on query execution. For instance, the JSON-SNV outperforms others by nearly one order of magnitude to execute subject-subject join queries. This and the other results presented in this paper can assist in more accurate benchmarking of the emerging DMSs.
The recent advancements of the Semantic Web and Linked Data have changed the working of the traditional web. There is significant adoption of the Resource Description Framework (RDF) format for saving of web-based data. This massive adoption has paved the way for the development of various centralized and distributed RDF processing engines. These engines employ various mechanisms to implement critical components of the query processing engines such as data storage, indexing, language support, and query execution. All these components govern how queries are executed and can have a substantial effect on the query runtime. For example, the storage of RDF data in various ways significantly affects the data storage space required and the query runtime performance. The type of indexing approach used in RDF engines is critical for fast data lookup. The type of the underlying querying language (e.g., SPARQL or SQL) used for query execution is a crucial optimization component of the RDF storage solutions. Finally, query execution involving different join orders significantly affects the query response time. This paper provides a comprehensive review of centralized and distributed RDF engines in terms of storage, indexing, language support, and query execution.
The class of queries for detecting path is an important as those can extract implicit binary relations over the nodes of input graphs. Most of the path querying languages used by the RDF community, like property paths in W3C SPARQL 1.1 and nested regular expressions in nSPARQL are based on the regular expressions. Federated queries allow for combining graph patterns and relational database that enables the evaluations over several heterogeneous data resources within a single query. Federated queries in W3C SPARQL 1.1 currently evaluated over different SPARQL endpoints. In this paper, we present a federated path querying language as an extension of regular path querying language for supporting RDF graph integration with relational database. The federated path querying language is absolutely more expressive than nested regular expressions and negation-free property paths. Its additional expressivity can be used for capturing the conjunction and federation of nested regular path queries. Despite the increase in expressivity, we also show that federated path queries are still enjoy a low computational complexity and can be evaluated efficiently.
Finding a good query plan is key to the optimization of query runtime. This holds in particular for cost-based federation engines, which make use of cardinality estimations to achieve this goal. A number of studies compare SPARQL federation engines across different performance metrics, including query runtime, result set completeness and correctness, number of sources selected and number of requests sent. Albeit informative, these metrics are generic and unable to quantify and evaluate the accuracy of the cardinality estimators of cost-based federation engines. To thoroughly evaluate cost-based federation engines, the effect of estimated cardinality errors on the overall query runtime performance must be measured. In this paper, we address this challenge by presenting novel evaluation metrics targeted at a fine-grained benchmarking of cost-based federated SPARQL query engines. We evaluate five cost-based federated SPARQL query engines using existing as well as novel evaluation metrics by using LargeRDFBench queries. Our results provide a detailed analysis of the experimental outcomes that reveal novel insights, useful for the development of future cost-based federated SPARQL query processing engines.