ﻻ يوجد ملخص باللغة العربية
The wide use of XML for document management and data exchange has created the need to query large repositories of XML data. To efficiently query such large data collections and take advantage of parallelism, we have implemented Apache VXQuery, an open-source scalable XQuery processor. The system builds upon two other open-source frameworks -- Hyracks, a parallel execution engine, and Algebricks, a language agnostic compiler toolbox. Apache VXQuery extends these two frameworks and provides an implementation of the XQuery specifics (data model, data-model dependent functions and optimizations, and a parser). We describe the architecture of Apache VXQuery, its integration with Hyracks and Algebricks, and the XQuery optimization rules applied to the query plan to improve path expression efficiency and to enable query parallelism. An experimental evaluation using a real 500GB dataset with various selection, aggregation and join XML queries shows that Apache VXQuery performs well both in terms of scale-up and speed-up. Our experiments show that it is about 3x faster than Saxon (an open-source and commercial XQuery processor) on a 4-core, single node implementation, and around 2.5x faster than Apache MRQL (a MapReduce-based parallel query processor) on an eight (4-core) node cluster.
A practical and promising approach to parallelizing XPath queries was proposed by Bordawekar et al. in 2009, which enables parallelization on top of existing XML database engines. Although they experimentally demonstrated the speedup by their approac
The need for modern data analytics to combine relational, procedural, and map-reduce-style functional processing is widely recognized. State-of-the-art systems like Spark have added SQL front-ends and relational query optimization, which promise an i
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle
Dataframes are a popular abstraction to represent, prepare, and analyze data. Despite the remarkable success of dataframe libraries in Rand Python, dataframes face performance issues even on moderately large datasets. Moreover, there is significant a
Resource Description Framework (RDF) has been widely used to represent information on the web, while SPARQL is a standard query language to manipulate RDF data. Given a SPARQL query, there often exist many joins which are the bottlenecks of efficienc