No Arabic abstract
The range, segment and rectangle query problems are fundamental problems in computational geometry, and have extensive applications in many domains. Despite the significant theoretical work on these problems, efficient implementations can be complicated. We know of very few practical implementations of the algorithms in parallel, and most implementations do not have tight theoretical bounds. We focus on simple and efficient parallel algorithms and implementations for these queries, which have tight worst-case bound in theory and good parallel performance in practice. We propose to use a simple framework (the augmented map) to model the problem. Based on the augmented map interface, we develop both multi-level tree structures and sweepline algorithms supporting range, segment and rectangle queries in two dimensions. For the sweepline algorithms, we propose a parallel paradigm and show corresponding cost bounds. All of our data structures are work-efficient to build in theory and achieve a low parallel depth. The query time is almost linear to the output size. We have implemented all the data structures described in the paper using a parallel augmented map library. Based on the library each data structure only requires about 100 lines of C++ code. We test their performance on large data sets (up to $10^8$ elements) and a machine with 72-cores (144 hyperthreads). The parallel construction achieves 32-68x speedup. Speedup numbers on queries are up to 126-fold. Our sequential implementation outperforms the CGAL library by at least 2x in both construction and queries. Our sequential implementation can be slightly slower than the R-tree in the Boost library in some cases (0.6-2.5x), but has significantly better query performance (1.6-1400x) than Boost.
Efficient large-scale annotation of genomic intervals is essential for personal genome interpretation in the realm of precision medicine. There are 13 possible relations between two intervals according to Allens interval algebra. Conventional interval trees are routinely used to identify the genomic intervals satisfying a coarse relation with a query interval, but cannot support efficient query for more refined relations such as all Allens relations. We design and implement a novel approach to address this unmet need. Through rewriting Allens interval relations, we transform an interval query to a range query, then adapt and utilize the range trees for querying. We implement two types of range trees: a basic 2-dimensional range tree (2D-RT) and an augmented range tree with fractional cascading (RTFC) and compare them with the conventional interval tree (IT). Theoretical analysis shows that RTFC can achieve the best time complexity for interval queries regarding all Allens relations among the three trees. We also perform comparative experiments on the efficiency of RTFC, 2D-RT and IT in querying noncoding element annotations in a large collection of personal genomes. Our experimental results show that 2D-RT is more efficient than IT for interval queries regarding most of Allens relations, RTFC is even more efficient than 2D-RT. The results demonstrate that RTFC is an efficient data structure for querying large-scale datasets regarding Allens relations between genomic intervals, such as those required by interpreting genome-wide variation in large populations.
Ordered (key-value) maps are an important and widely-used data type for large-scale data processing frameworks. Beyond simple search, insertion and deletion, more advanced operations such as range extraction, filtering, and bulk updates form a critical part of these frameworks. We describe an interface for ordered maps that is augmented to support fast range queries and sums, and introduce a parallel and concurrent library called PAM (Parallel Augmented Maps) that implements the interface. The interface includes a wide variety of functions on augmented maps ranging from basic insertion and deletion to more interesting functions such as union, intersection, filtering, extracting ranges, splitting, and range-sums. We describe algorithms for these functions that are efficient both in theory and practice. As examples of the use of the interface and the performance of PAM, we apply the library to four applications: simple range sums, interval trees, 2D range trees, and ranked word index searching. The interface greatly simplifies the implementation of these data structures over direct implementations. Sequentially the code achieves performance that matches or exceeds existing libraries designed specially for a single application, and in parallel our implementation gets speedups ranging from 40 to 90 on 72 cores with 2-way hyperthreading.
In this paper, we first consider the subpath convex hull query problem: Given a simple path $pi$ of $n$ vertices, preprocess it so that the convex hull of any query subpath of $pi$ can be quickly obtained. Previously, Guibas, Hershberger, and Snoeyink [SODA 90] proposed a data structure of $O(n)$ space and $O(log nloglog n)$ query time; reducing the query time to $O(log n)$ increases the space to $O(nloglog n)$. We present an improved result that uses $O(n)$ space while achieving $O(log n)$ query time. Like the previous work, our query algorithm returns a compact interval tree representing the convex hull so that standard binary-search-based queries on the hull can be performed in $O(log n)$ time each. Our new result leads to improvements for several other problems. In particular, with the help of the above result, we present new algorithms for the ray-shooting problem among segments. Given a set of $n$ (possibly intersecting) line segments in the plane, preprocess it so that the first segment hit by a query ray can be quickly found. We give a data structure of $O(nlog n)$ space that can answer each query in $(sqrt{n}log n)$ time. If the segments are nonintersecting or if the segments are lines, then the space can be reduced to $O(n)$. All these are classical problems that have been studied extensively. Previously data structures of $widetilde{O}(sqrt{n})$ query time (the notation $widetilde{O}$ suppresses a polylogarithmic factor) were known in early 1990s; nearly no progress has been made for over two decades. For all problems, our results provide improvements by reducing the space of the data structures by at least a logarithmic factor while the preprocessing and query times are the same as before or even better.
Let $mathcal{P}$ be a polygonal domain of $h$ holes and $n$ vertices. We study the problem of constructing a data structure that can compute a shortest path between $s$ and $t$ in $mathcal{P}$ under the $L_1$ metric for any two query points $s$ and $t$. To do so, a standard approach is to first find a set of $n_s$ gateways for $s$ and a set of $n_t$ gateways for $t$ such that there exist a shortest $s$-$t$ path containing a gateway of $s$ and a gateway of $t$, and then compute a shortest $s$-$t$ path using these gateways. Previous algorithms all take quadratic $O(n_scdot n_t)$ time to solve this problem. In this paper, we propose a divide-and-conquer technique that solves the problem in $O(n_s + n_t log n_s)$ time. As a consequence, we construct a data structure of $O(n+(h^2log^3 h/loglog h))$ size in $O(n+(h^2log^4 h/loglog h))$ time such that each query can be answered in $O(log n)$ time.
Drawing network maps automatically comprises two challenging steps, namely laying out the map and placing non-overlapping labels. In this paper we tackle the problem of labeling an already existing network map considering the application of metro maps. We present a flexible and versatile labeling model. Despite its simplicity, we prove that it is NP-complete to label a single line of the network. For a restricted variant of that model, we then introduce an efficient algorithm that optimally labels a single line with respect to a given weighting function. Based on that algorithm, we present a general and sophisticated workflow for multiple metro lines, which is experimentally evaluated on real-world metro maps.