No Arabic abstract
Much like on-premises systems, the natural choice for running database analytics workloads in the cloud is to provision a cluster of nodes to run a database instance. However, analytics workloads are often bursty or low volume, leaving clusters idle much of the time, meaning customers pay for compute resources even when unused. The ability of cloud function services, such as AWS Lambda or Azure Functions, to run small, fine granularity tasks make them appear to be a natural choice for query processing in such settings. But implementing an analytics system on cloud functions comes with its own set of challenges. These include managing hundreds of tiny stateless resource-constrained workers, handling stragglers, and shuffling data through opaque cloud services. In this paper we present Starling, a query execution engine built on cloud function services that employs number of techniques to mitigate these challenges, providing interactive query latency at a lower total cost than provisioned systems with low-to-moderate utilization. In particular, on a 1TB TPC-H dataset in cloud storage, Starling is less expensive than the best provisioned systems for workloads when queries arrive 1 minute apart or more. Starling also has lower latency than competing systems reading from cloud object stores and can scale to larger datasets.
Resource Description Framework (RDF) has been widely used to represent information on the web, while SPARQL is a standard query language to manipulate RDF data. Given a SPARQL query, there often exist many joins which are the bottlenecks of efficiency of query processing. Besides, the real RDF datasets often reveal strong data sparsity, which indicates that a resource often only relates to a few resources even the number of total resources is large. In this paper, we propose a sparse matrix-based (SM-based) SPARQL query processing approach over RDF datasets which con- siders both join optimization and data sparsity. Firstly, we present a SM-based storage for RDF datasets to lift the storage efficiency, where valid edges are stored only, and then introduce a predicate- based hash index on the storage. Secondly, we develop a scalable SM-based join algorithm for SPARQL query processing. Finally, we analyze the overall cost by accumulating all intermediate results and design a query plan generated algorithm. Besides, we extend our SM-based join algorithm on GPU for parallelizing SPARQL query processing. We have evaluated our approach compared with the state-of-the-art RDF engines over benchmark RDF datasets and the experimental results show that our proposal can significantly improve SPARQL query processing with high scalability.
Efficient execution of SPARQL queries over large RDF datasets is a topic of considerable interest due to increased use of RDF to encode data. Most of this work has followed either relational or graph-based approaches. In this paper, we propose an alternative query engine, called gSmart, based on matrix algebra. This approach can potentially better exploit the computing power of high-performance heterogeneous architectures that we target. gSmart incorporates: (1) grouped incident edge-based SPARQL query evaluation, in which all unevaluated edges of a vertex are evaluated together using a series of matrix operations to fully utilize query constraints and narrow down the solution space; (2) a graph query planner that determines the order in which vertices in query graphs should be evaluated; (3) memory- and computation-efficient data structures including the light-weight sparse matrix (LSpM) storage for RDF data and the tree-based representation for evaluation results; (4) a multi-stage data partitioner to map the incident edge-based query evaluation into heterogeneous HPC architectures and develop multi-level parallelism; and (5) a parallel executor that uses the fine-grained processing scheme, pre-pruning technique, and tree-pruning technique to lower inter-node communication and enable high throughput. Evaluations of gSmart on a CPU+GPU HPC architecture show execution time speedups of up to 46920.00x compared to the existing SPARQL query engines on a single node machine. Additionally, gSmart on the Tianhe-1A supercomputer achieves a maximum speedup of 6.90x scaling from 2 to 16 CPU+GPU nodes.
Facility location queries identify the best locations to set up new facilities for providing service to its users. Majority of the existing works in this space assume that the user locations are static. Such limitations are too restrictive for planning many modern real-life services such as fuel stations, ATMs, convenience stores, cellphone base-stations, etc. that are widely accessed by mobile users. The placement of such services should, therefore, factor in the mobility patterns or trajectories of the users rather than simply their static locations. In this work, we introduce the TOPS (Trajectory-Aware Optimal Placement of Services) query that locates the best k sites on a road network. The aim is to optimize a wide class of objective functions defined over the user trajectories. We show that the problem is NP-hard and even the greedy heuristic with an approximation bound of (1-1/e) fails to scale on urban-scale datasets. To overcome this challenge, we develop a multi-resolution clustering based indexing framework called NetClus. Empirical studies on real road network trajectory datasets show that NetClus offers solutions that are comparable in terms of quality with those of the greedy heuristic, while having practical response times and low memory footprints. Additionally, the NetClus framework can absorb dynamic updates in mobility patterns, handle constraints such as site-costs and capacity, and existing services, thereby providing an effective solution for modern urban-scale scenarios.
Finding patterns in data and being able to retrieve information from those patterns is an important task in Information retrieval. Complex search requirements which are not fulfilled by simple string matching and require exploring certain patterns in data demand a better query engine that can support searching via structured queries. In this article, we built a structured query engine which supports searching data through structured queries on the lines of ElasticSearch. We will show how we achieved real time indexing and retrieving of data through a RESTful API and how complex queries can be created and processed using efficient data structures we created for storing the data in structured way. Finally, we will conclude with an example of movie recommendation system built on top of this query engine.
Spreadsheets are end-user programs and domain models that are heavily employed in administration, financial forecasting, education, and science because of their intuitive, flexible, and direct approach to computation. As a result, institutions are swamped by millions of spreadsheets that are becoming increasingly difficult to manage, access, and control. This note presents the XLSearch system, a novel search engine for spreadsheets. It indexes spreadsheet formulae and efficiently answers formula queries via unification (a complex query language that allows metavariables in both the query as well as the index). But a web-based search engine is only one application of the underlying technology: Spreadsheet formula export to web standards like MathML combined with formula indexing can be used to find similar spreadsheets or common formula errors.