ترغب بنشر مسار تعليمي؟ اضغط هنا

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

140   0   0.0 ( 0 )
 نشر من قبل C. Seshadhri
 تاريخ النشر 2012
  مجال البحث الهندسة المعلوماتية
والبحث باللغة English




اسأل ChatGPT حول البحث

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $delta > 0$, given streaming access to an array of length $n$ provides a $(1+delta)$-multiplicative approximation to the emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((log^2 n)/delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $delta n$ for any $delta >0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$delta n$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(log^2 n)/delta)$, where $k$ is the maximum number of times any one symbol appears in $y$.



قيم البحث

اقرأ أيضاً

154 - Xin Li , Yu Zheng 2021
In this paper, we study edit distance (ED) and longest common subsequence (LCS) in the asymmetric streaming model, introduced by Saks and Seshadhri [SS13]. As an intermediate model between the random access model and the streaming model, this model a llows one to have streaming access to one string and random access to the other string. Our first main contribution is a systematic study of space lower bounds for ED and LCS in the asymmetric streaming model. Previously, there are no explicitly stated results in this context, although some lower bounds about LCS can be inferred from the lower bounds for longest increasing subsequence (LIS) in [SW07][GG10][EJ08]. Yet these bounds only work for large alphabet size. In this paper, we develop several new techniques to handle ED in general and LCS for small alphabet size, thus establishing strong lower bounds for both problems. In particular, our lower bound for ED provides an exponential separation between edit distance and Hamming distance in the asymmetric streaming model. Our lower bounds also extend to LIS and longest non-decreasing sequence (LNS) in the standard streaming model. Together with previous results, our bounds provide an almost complete picture for these two problems. As our second main contribution, we give improved algorithms for ED and LCS in the asymmetric streaming model. For ED, we improve the space complexity of the constant factor approximation algorithms in [FHRS20][CJLZ20] from $tilde{O}(frac{n^delta}{delta})$ to $O(frac{d^delta}{delta};mathsf{polylog}(n))$, where $n$ is the length of each string and $d$ is the edit distance between the two strings. For LCS, we give the first $1/2+epsilon$ approximation algorithm with space $n^{delta}$ for any constant $delta>0$, over a binary alphabet.
We study the problem of estimating the edit distance between two $n$-character strings. While exact computation in the worst case is believed to require near-quadratic time, previous work showed that in certain regimes it is possible to solve the fol lowing {em gap edit distance} problem in sub-linear time: distinguish between inputs of distance $le k$ and $>k^2$. Our main result is a very simple algorithm for this benchmark that runs in time $tilde O(n/sqrt{k})$, and in particular settles the open problem of obtaining a truly sublinear time for the entire range of relevant $k$. Building on the same framework, we also obtain a $k$-vs-$k^2$ algorithm for the one-sided preprocessing model with $tilde O(n)$ preprocessing time and $tilde O(n/k)$ query time (improving over a recent $tilde O(n/k+k^2)$-query time algorithm for the same problem [GRS20].
A (1 + eps)-approximate distance oracle for a graph is a data structure that supports approximate point-to-point shortest-path-distance queries. The most relevant measures for a distance-oracle construction are: space, query time, and preprocessing t ime. There are strong distance-oracle constructions known for planar graphs (Thorup, JACM04) and, subsequently, minor-excluded graphs (Abraham and Gavoille, PODC06). However, these require Omega(eps^{-1} n lg n) space for n-node graphs. We argue that a very low space requirement is essential. Since modern computer architectures involve hierarchical memory (caches, primary memory, secondary memory), a high memory requirement in effect may greatly increase the actual running time. Moreover, we would like data structures that can be deployed on small mobile devices, such as handhelds, which have relatively small primary memory. In this paper, for planar graphs, bounded-genus graphs, and minor-excluded graphs we give distance-oracle constructions that require only O(n) space. The big O hides only a fixed constant, independent of epsilon and independent of genus or size of an excluded minor. The preprocessing times for our distance oracle are also faster than those for the previously known constructions. For planar graphs, the preprocessing time is O(n lg^2 n). However, our constructions have slower query times. For planar graphs, the query time is O(eps^{-2} lg^2 n). For our linear-space results, we can in fact ensure, for any delta > 0, that the space required is only 1 + delta times the space required just to represent the graph itself.
Computing efficiently a robust measure of similarity or dissimilarity between graphs is a major challenge in Pattern Recognition. The Graph Edit Distance (GED) is a flexible measure of dissimilarity between graphs which arises in error-tolerant graph matching. It is defined from an optimal sequence of edit operations (edit path) transforming one graph into an other. Unfortunately, the exact computation of this measure is NP-hard. In the last decade, several approaches have been proposed to approximate the GED in polynomial time, mainly by solving linear programming problems. Among them, the bipartite GED has received much attention. It is deduced from a linear sum assignment of the nodes of the two graphs, which can be efficiently computed by Hungarian-type algorithms. However, edit operations on nodes and edges are not handled simultaneously, which limits the accuracy of the approximation. To overcome this limitation, we propose to extend the linear assignment model to a quadratic one, for directed or undirected graphs having labelized nodes and edges. This is realized through the definition of a family of edit paths induced by assignments between nodes. We formally show that the GED, restricted to the paths in this family, is equivalent to a quadratic assignment problem. Since this problem is NP-hard, we propose to compute an approximate solution by an adaptation of the Integer Projected Fixed Point method. Experiments show that the proposed approach is generally able to reach a more accurate approximation of the optimal GED than the bipartite GED, with a computational cost that is still affordable for graphs of non trivial sizes.
We give cell-probe bounds for the computation of edit distance, Hamming distance, convolution and longest common subsequence in a stream. In this model, a fixed string of $n$ symbols is given and one $delta$-bit symbol arrives at a time in a stream. After each symbol arrives, the distance between the fixed string and a suffix of most recent symbols of the stream is reported. The cell-probe model is perhaps the strongest model of computation for showing data structure lower bounds, subsuming in particular the popular word-RAM model. * We first give an $Omega((delta log n)/(w+loglog n))$ lower bound for the time to give each output for both online Hamming distance and convolution, where $w$ is the word size. This bound relies on a new encoding scheme and for the first time holds even when $w$ is as small as a single bit. * We then consider the online edit distance and longest common subsequence problems in the bit-probe model ($w=1$) with a constant sized input alphabet. We give a lower bound of $Omega(sqrt{log n}/(loglog n)^{3/2})$ which applies for both problems. This second set of results relies both on our new encoding scheme as well as a carefully constructed hard distribution. * Finally, for the online edit distance problem we show that there is an $O((log n)^2/w)$ upper bound in the cell-probe model. This bound gives a contrast to our new lower bound and also establishes an exponential gap between the known cell-probe and RAM model complexities.
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا