No Arabic abstract
We study two fundamental problems dealing with curves in the plane, namely, the nearest-neighbor problem and the center problem. Let $mathcal{C}$ be a set of $n$ polygonal curves, each of size $m$. In the nearest-neighbor problem, the goal is to construct a compact data structure over $mathcal{C}$, such that, given a query curve $Q$, one can efficiently find the curve in $mathcal{C}$ closest to $Q$. In the center problem, the goal is to find a curve $Q$, such that the maximum distance between $Q$ and the curves in $mathcal{C}$ is minimized. We use the well-known discrete Frechet distance function, both under~$L_infty$ and under $L_2$, to measure the distance between two curves. For the nearest-neighbor problem, despite discouraging previous results, we identify two important cases for which it is possible to obtain practical bounds, even when $m$ and $n$ are large. In these cases, either $Q$ is a line segment or $mathcal{C}$ consists of line segments, and the bounds on the size of the data structure and query time are nearly linear in the size of the input and query curve, respectively. The returned answer is either exact under $L_infty$, or approximated to within a factor of $1+varepsilon$ under~$L_2$. We also consider the variants in which the location of the input curves is only fixed up to translation, and obtain similar bounds, under $L_infty$. As for the center problem, we study the case where the center is a line segment, i.e., we seek the line segment that represents the given set as well as possible. We present near-linear time exact algorithms under $L_infty$, even when the location of the input curves is only fixed up to translation. Under $L_2$, we present a roughly $O(n^2m^3)$-time exact algorithm.
In the $(1+varepsilon,r)$-approximate near-neighbor problem for curves (ANNC) under some distance measure $delta$, the goal is to construct a data structure for a given set $mathcal{C}$ of curves that supports approximate near-neighbor queries: Given a query curve $Q$, if there exists a curve $Cinmathcal{C}$ such that $delta(Q,C)le r$, then return a curve $Cinmathcal{C}$ with $delta(Q,C)le(1+varepsilon)r$. There exists an efficient reduction from the $(1+varepsilon)$-approximate nearest-neighbor problem to ANNC, where in the former problem the answer to a query is a curve $Cinmathcal{C}$ with $delta(Q,C)le(1+varepsilon)cdotdelta(Q,C^*)$, where $C^*$ is the curve of $mathcal{C}$ closest to $Q$. Given a set $mathcal{C}$ of $n$ curves, each consisting of $m$ points in $d$ dimensions, we construct a data structure for ANNC that uses $ncdot O(frac{1}{varepsilon})^{md}$ storage space and has $O(md)$ query time (for a query curve of length $m$), where the similarity between two curves is their discrete Frechet or dynamic time warping distance. Our method is simple to implement, deterministic, and results in an exponential improvement in both query time and storage space compared to all previous bounds. Further, we also consider the asymmetric version of ANNC, where the length of the query curves is $k ll m$, and obtain essentially the same storage and query bounds as above, except that $m$ is replaced by $k$. Finally, we apply our method to a version of approximate range counting for curves and achieve similar bounds.
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore, which allows them to learn through explicitly memorizing the training datapoints. While effective, these models often require retrieval from a large datastore at test time, significantly increasing the inference overhead and thus limiting the deployment of non-parametric NLMs in practical applications. In this paper, we take the recently proposed $k$-nearest neighbors language model (Khandelwal et al., 2019) as an example, exploring methods to improve its efficiency along various dimensions. Experiments on the standard WikiText-103 benchmark and domain-adaptation datasets show that our methods are able to achieve up to a 6x speed-up in inference speed while retaining comparable performance. The empirical analysis we present may provide guidelines for future research seeking to develop or deploy more efficient non-parametric NLMs.
We show new applications of the nearest-neighbor chain algorithm, a technique that originated in agglomerative hierarchical clustering. We apply it to a diverse class of geometric problems: we construct the greedy multi-fragment tour for Euclidean TSP in $O(nlog n)$ time in any fixed dimension and for Steiner TSP in planar graphs in $O(nsqrt{n}log n)$ time; we compute motorcycle graphs (which are a central part in straight skeleton algorithms) in $O(n^{4/3+varepsilon})$ time for any $varepsilon>0$; we introduce a narcissistic variant of the $k$-attribute stable matching model, and solve it in $O(n^{2-4/(k(1+varepsilon)+2)})$ time; we give a linear-time $2$-approximation for a 1D geometric set cover problem with applications to radio station placement.
Previously in 2014, we proposed the Nearest Descent (ND) method, capable of generating an efficient Graph, called the in-tree (IT). Due to some beautiful and effective features, this IT structure proves well suited for data clustering. Although there exist some redundant edges in IT, they usually have salient features and thus it is not hard to remove them. Subsequently, in order to prevent the seemingly redundant edges from occurring, we proposed the Nearest Neighbor Descent (NND) by adding the Neighborhood constraint on ND. Consequently, clusters automatically emerged, without the additional requirement of removing the redundant edges. However, NND proved still not perfect, since it brought in a new yet worse problem, the over-partitioning problem. Now, in this paper, we propose a method, called the Hierarchical Nearest Neighbor Descent (H-NND), which overcomes the over-partitioning problem of NND via using the hierarchical strategy. Specifically, H-NND uses ND to effectively merge the over-segmented sub-graphs or clusters that NND produces. Like ND, H-NND also generates the IT structure, in which the redundant edges once again appear. This seemingly comes back to the situation that ND faces. However, compared with ND, the redundant edges in the IT structure generated by H-NND generally become more salient, thus being much easier and more reliable to be identified even by the simplest edge-removing method which takes the edge length as the only measure. In other words, the IT structure constructed by H-NND becomes more fitted for data clustering. We prove this on several clustering datasets of varying shapes, dimensions and attributes. Besides, compared with ND, H-NND generally takes less computation time to construct the IT data structure for the input data.
The use of summary statistics beyond the two-point correlation function to analyze the non-Gaussian clustering on small scales is an active field of research in cosmology. In this paper, we explore a set of new summary statistics -- the $k$-Nearest Neighbor Cumulative Distribution Functions ($k{rm NN}$-${rm CDF}$). This is the empirical cumulative distribution function of distances from a set of volume-filling, Poisson distributed random points to the $k$-nearest data points, and is sensitive to all connected $N$-point correlations in the data. The $k{rm NN}$-${rm CDF}$ can be used to measure counts in cell, void probability distributions and higher $N$-point correlation functions, all using the same formalism exploiting fast searches with spatial tree data structures. We demonstrate how it can be computed efficiently from various data sets - both discrete points, and the generalization for continuous fields. We use data from a large suite of $N$-body simulations to explore the sensitivity of this new statistic to various cosmological parameters, compared to the two-point correlation function, while using the same range of scales. We demonstrate that the use of $k{rm NN}$-${rm CDF}$ improves the constraints on the cosmological parameters by more than a factor of $2$ when applied to the clustering of dark matter in the range of scales between $10h^{-1}{rm Mpc}$ and $40h^{-1}{rm Mpc}$. We also show that relative improvement is even greater when applied on the same scales to the clustering of halos in the simulations at a fixed number density, both in real space, as well as in redshift space. Since the $k{rm NN}$-${rm CDF}$ are sensitive to all higher order connected correlation functions in the data, the gains over traditional two-point analyses are expected to grow as progressively smaller scales are included in the analysis of cosmological data.