No Arabic abstract
Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allens interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allens relations. We design and implement a novel approach to address this unmet need. Through rewriting Allens interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allens relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allens relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allens relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.
The range, segment and rectangle query problems are fundamental problems in computational geometry, and have extensive applications in many domains. Despite the significant theoretical work on these problems, efficient implementations can be complicated. We know of very few practical implementations of the algorithms in parallel, and most implementations do not have tight theoretical bounds. We focus on simple and efficient parallel algorithms and implementations for these queries, which have tight worst-case bound in theory and good parallel performance in practice. We propose to use a simple framework (the augmented map) to model the problem. Based on the augmented map interface, we develop both multi-level tree structures and sweepline algorithms supporting range, segment and rectangle queries in two dimensions. For the sweepline algorithms, we propose a parallel paradigm and show corresponding cost bounds. All of our data structures are work-efficient to build in theory and achieve a low parallel depth. The query time is almost linear to the output size. We have implemented all the data structures described in the paper using a parallel augmented map library. Based on the library each data structure only requires about 100 lines of C++ code. We test their performance on large data sets (up to $10^8$ elements) and a machine with 72-cores (144 hyperthreads). The parallel construction achieves 32-68x speedup. Speedup numbers on queries are up to 126-fold. Our sequential implementation outperforms the CGAL library by at least 2x in both construction and queries. Our sequential implementation can be slightly slower than the R-tree in the Boost library in some cases (0.6-2.5x), but has significantly better query performance (1.6-1400x) than Boost.
Traditional indexing techniques commonly employed in da-ta-ba-se systems perform poorly on multidimensional array scientific data. Bitmap indices are widely used in commercial databases for processing complex queries, due to their effective use of bit-wise operations and space-efficiency. However, bitmap indices apply natively to relational or linearized datasets, which is especially notable in binned or compressed indices. We propose a new method for multidimensional array indexing that overcomes the dimensionality-induced inefficiencies. The hierarchical indexing method is based on $n$-di-men-sional sparse trees for dimension partitioning, with bound number of individual, adaptively binned indices for attribute partitioning. This indexing performs well on range involving both dimensions and attributes, as it prunes the search space early, avoids reading entire index data, and does at most a single index traversal. Moreover, the indexing is easily extensible to membership queries. The indexing method was implemented on top of a state of the art bitmap indexing library Fastbit. We show that the hierarchical bitmap index outperforms conventional bitmap indexing built on auxiliary attribute for each dimension. Furthermore, the adaptive binning significantly reduces the amount of bins and therefore memory requirements.
Learning a regression function using censored or interval-valued output data is an important problem in fields such as genomics and medicine. The goal is to learn a real-valued prediction function, and the training output labels indicate an interval of possible values. Whereas most existing algorithms for this task are linear models, in this paper we investigate learning nonlinear tree models. We propose to learn a tree by minimizing a margin-based discriminative objective function, and we provide a dynamic programming algorithm for computing the optimal solution in log-linear time. We show empirically that this algorithm achieves state-of-the-art speed and prediction accuracy in a benchmark of several data sets.
Graphs are widely used to model data in many application domains. Thanks to the wide spread use of GPS-enabled devices, many applications assign a spatial attribute to graph vertices (e.g., geo-tagged social media). Users may issue a Reachability Query with Spatial Range Predicate (abbr. RangeReach). RangeReach finds whether an input vertex can reach any spatial vertex that lies within an input spatial range. An example of a RangeReach query is: Given a social graph, find whether Alice can reach any of the venues located within the geographical area of Arizona State University. The paper proposes GeoReach an approach that adds spatial data awareness to a graph database management system (GDBMS). GeoReach allows efficient execution of RangeReach queries, yet without compromising a lot on the overall system scalability (measured in terms of storage size and initialization/maintenance time). To achieve that, GeoReach is equipped with a light-weight data structure, namely SPA-Graph, that augments the underlying graph data with spatial indexing directories. When a RangeReach query is issued, the system employs a pruned-graph traversal approach. Experiments based on real system implementation inside Neo4j proves that GEOREACH exhibits up to two orders of magnitude better query response time and up to four times less storage than the state-of-the-art spatial and reachability indexing approaches.
We consider the task of enumerating and counting answers to $k$-ary conjunctive queries against relational databases that may be updated by inserting or deleting tuples. We exhibit a new notion of q-hierarchical conjunctive queries and show that these can be maintained efficiently in the following sense. During a linear time preprocessing phase, we can build a data structure that enables constant delay enumeration of the query results; and when the database is updated, we can update the data structure and restart the enumeration phase within constant time. For the special case of self-join free conjunctive queries we obtain a dichotomy: if a query is not q-hierarchical, then query enumeration with sublinear$^ast$ delay and sublinear update time (and arbitrary preprocessing time) is impossible. For answering Boolean conjunctive queries and for the more general problem of counting the number of solutions of k-ary queries we obtain complete dichotomies: if the querys homomorphic core is q-hierarchical, then size of the the query result can be computed in linear time and maintained with constant update time. Otherwise, the size of the query result cannot be maintained with sublinear update time. All our lower bounds rely on the OMv-conjecture, a conjecture on the hardness of online matrix-vector multiplication that has recently emerged in the field of fine-grained complexity to characterise the hardness of dynamic problems. The lower bound for the counting problem additionally relies on the orthogonal vectors conjecture, which in turn is implied by the strong exponential time hypothesis. $^ast)$ By sublinear we mean $O(n^{1-varepsilon})$ for some $varepsilon>0$, where $n$ is the size of the active domain of the current database.