No Arabic abstract
We investigate trade-offs in static and dynamic evaluation of hierarchical queries with arbitrary free variables. In the static setting, the trade-off is between the time to partially compute the query result and the delay needed to enumerate its tuples. In the dynamic setting, we additionally consider the time needed to update the query result in the presence of single-tuple inserts and deletes to the input database. Our approach observes the degree of values in the database and uses different computation and maintenance strategies for high-degree and low-degree values. For the latter it partially computes the result, while for the former it computes enough information to allow for on-the-fly enumeration. The main result of this work defines the preprocessing time, the update time, and the enumeration delay as functions of the light/heavy threshold and of the factorization width of the hierarchical query. By conveniently choosing this threshold, our approach can recover a number of prior results when restricted to hierarchical queries. For a restricted class of hierarchical queries, our approach can achieve worst-case optimal update time and enumeration delay conditioned on the Online Matrix-Vector Multiplication Conjecture.
We study here fundamental issues involved in top-k query evaluation in probabilistic databases. We consider simple probabilistic databases in which probabilities are associated with individual tuples, and general probabilistic databases in which, additionally, exclusivity relationships between tuples can be represented. In contrast to other recent research in this area, we do not limit ourselves to injective scoring functions. We formulate three intuitive postulates that the semantics of top-k queries in probabilistic databases should satisfy, and introduce a new semantics, Global-Topk, that satisfies those postulates to a large degree. We also show how to evaluate queries under the Global-Topk semantics. For simple databases we design dynamic-programming based algorithms, and for general databases we show polynomial-time reductions to the simple cases. For example, we demonstrate that for a fixed k the time complexity of top-k query evaluation is as low as linear, under the assumption that probabilistic databases are simple and scoring functions are injective.
Traditional indexing techniques commonly employed in da-ta-ba-se systems perform poorly on multidimensional array scientific data. Bitmap indices are widely used in commercial databases for processing complex queries, due to their effective use of bit-wise operations and space-efficiency. However, bitmap indices apply natively to relational or linearized datasets, which is especially notable in binned or compressed indices. We propose a new method for multidimensional array indexing that overcomes the dimensionality-induced inefficiencies. The hierarchical indexing method is based on $n$-di-men-sional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on range involving both dimensions and attributes, as it prunes the search space early, avoids reading entire index data, and does at most a single index traversal. Moreover, the indexing is easily extensible to membership queries. The indexing method was implemented on top of a state of the art bitmap indexing library Fastbit. We show that the hierarchical bitmap index outperforms conventional bitmap indexing built on auxiliary attribute for each dimension. Furthermore, the adaptive binning significantly reduces the amount of bins and therefore memory requirements.
Computing the shortest path between two given locations in a road network is an important problem that finds applications in various map services and commercial navigation products. The state-of-the-art solutions for the problem can be divided into two categories: spatial-coherence-based methods and vertex-importance-based approaches. The two categories of techniques, however, have not been compared systematically under the same experimental framework, as they were developed from two independent lines of research that do not refer to each other. This renders it difficult for a practitioner to decide which technique should be adopted for a specific application. Furthermore, the experimental evaluation of the existing techniques, as presented in previous work, falls short in several aspects. Some methods were tested only on small road networks with up to one hundred thousand vertices; some approaches were evaluated using distance queries (instead of shortest path queries), namely, queries that ask only for the length of the shortest path; a state-of-the-art technique was examined based on a faulty implementation that led to incorrect query results. To address the above issues, this paper presents a comprehensive comparison of the most advanced spatial-coherence-based and vertex-importance-based approaches. Using a variety of real road networks with up to twenty million vertices, we evaluated each technique in terms of its preprocessing time, space consumption, and query efficiency (for both shortest path and distance queries). Our experimental results reveal the characteristics of different techniques, based on which we provide guidelines on selecting appropriate methods for various scenarios.
This tutorial overviews principles behind recent works on training and maintaining machine learning models over relational data, with an emphasis on the exploitation of the relational data structure to improve the runtime performance of the learning task. The tutorial has the following parts: 1) Database research for data science 2) Three main ideas to achieve performance improvements 2.1) Turn the ML problem into a DB problem 2.2) Exploit structure of the data and problem 2.3) Exploit engineering tools of a DB researcher 3) Avenues for future research
In order to achieve near-time insights, scientific workflows tend to be organized in a flexible and dynamic way. Data-driven triggering of tasks has been explored as a way to support workflows that evolve based on the data. However, the overhead introduced by such dynamic triggering of tasks is an under-studied topic. This paper discusses different facets of dynamic task triggers. Particularly, we explore different ways of constructing a data-driven dynamic workflow and then evaluate the overheads introduced by such design decisions. We evaluate workflows with varying data size, percentage of interesting data, temporal data distribution, and number of tasks triggered. Finally, we provide advice based upon analysis of the evaluation results for users looking to construct data-driven scientific workflows.