Lower Bounds and Improved Algorithms for Asymmetric Streaming Edit Distance and Longest Common Subsequence

155 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Yu Zheng

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Xin Li - Yu Zheng

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

In this paper, we study edit distance (ED) and longest common subsequence (LCS) in the asymmetric streaming model, introduced by Saks and Seshadhri [SS13]. As an intermediate model between the random access model and the streaming model, this model allows one to have streaming access to one string and random access to the other string. Our first main contribution is a systematic study of space lower bounds for ED and LCS in the asymmetric streaming model. Previously, there are no explicitly stated results in this context, although some lower bounds about LCS can be inferred from the lower bounds for longest increasing subsequence (LIS) in [SW07][GG10][EJ08]. Yet these bounds only work for large alphabet size. In this paper, we develop several new techniques to handle ED in general and LCS for small alphabet size, thus establishing strong lower bounds for both problems. In particular, our lower bound for ED provides an exponential separation between edit distance and Hamming distance in the asymmetric streaming model. Our lower bounds also extend to LIS and longest non-decreasing sequence (LNS) in the standard streaming model. Together with previous results, our bounds provide an almost complete picture for these two problems. As our second main contribution, we give improved algorithms for ED and LCS in the asymmetric streaming model. For ED, we improve the space complexity of the constant factor approximation algorithms in [FHRS20][CJLZ20] from $tilde{O}(frac{n^delta}{delta})$ to $O(frac{d^delta}{delta};mathsf{polylog}(n))$, where $n$ is the length of each string and $d$ is the edit distance between the two strings. For LCS, we give the first $1/2+epsilon$ approximation algorithm with space $n^{delta}$ for any constant $delta>0$, over a binary alphabet.

قيم البحث

214 - Michael Saks , C. Seshadhri 2012

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $delta > 0$, given streaming access to an array of length $n$ provides a $(1+delta)$-multiplicative approximation to the emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((log^2 n)/delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $delta n$ for any $delta >0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$delta n$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(log^2 n)/delta)$, where $k$ is the maximum number of times any one symbol appears in $y$.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Heuristic algorithms for the Longest Filled Common Subsequence Problem

121 - Radu Stefan Mincu University ofn Bucharest 2019

At CPM 2017, Castelli et al. define and study a new variant of the Longest Common Subsequence Problem, termed the Longest Filled Common Subsequence Problem (LFCS). For the LFCS problem, the input consists of two strings $A$ and $B$ and a multiset of characters $mathcal{M}$. The goal is to insert the characters from $mathcal{M}$ into the string $B$, thus obtaining a new string $B^*$, such that the Longest Common Subsequence (LCS) between $A$ and $B^*$ is maximized. Casteli et al. show that the problem is NP-hard and provide a 3/5-approximation algorithm for the problem. In this paper we study the problem from the experimental point of view. We introduce, implement and test new heuristic algorithms and compare them with the approximation algorithm of Casteli et al. Moreover, we introduce an Integer Linear Program (ILP) model for the problem and we use the state of the art ILP solver, Gurobi, to obtain exact solution for moderate sized instances.

بنى وهياكل البيانات والخوارزميات

Variants of Constrained Longest Common Subsequence

351 - Paola Bonizzoni , Gianluca Della Vedova , Riccardo Dondi 2009

In this work, we consider a variant of the classical Longest Common Subsequence problem called Doubly-Constrained Longest Common Subsequence (DC-LCS). Given two strings s1 and s2 over an alphabet A, a set C_s of strings, and a function Co from A to N , the DC-LCS problem consists in finding the longest subsequence s of s1 and s2 such that s is a supersequence of all the strings in Cs and such that the number of occurrences in s of each symbol a in A is upper bounded by Co(a). The DC-LCS problem provides a clear mathematical formulation of a sequence comparison problem in Computational Biology and generalizes two other constrained variants of the LCS problem: the Constrained LCS and the Repetition-Free LCS. We present two results for the DC-LCS problem. First, we illustrate a fixed-parameter algorithm where the parameter is the length of the solution. Secondly, we prove a parameterized hardness result for the Constrained LCS problem when the parameter is the number of the constraint strings and the size of the alphabet A. This hardness result also implies the parameterized hardness of the DC-LCS problem (with the same parameters) and its NP-hardness when the size of the alphabet is constant.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

Polynomial Pass Lower Bounds for Graph Streaming Algorithms

90 - Sepehr Assadi , Yu Chen , Sanjeev Khanna 2019

We present new lower bounds that show that a polynomial number of passes are necessary for solving some fundamental graph problems in the streaming model of computation. For instance, we show that any streaming algorithm that finds a weighted minimum $s$-$t$ cut in an $n$-vertex undirected graph requires $n^{2-o(1)}$ space unless it makes $n^{Omega(1)}$ passes over the stream. To prove our lower bounds, we introduce and analyze a new four-player communication problem that we refer to as the hidden-pointer chasing problem. This is a problem in spirit of the standard pointer chasing problem with the key difference that the pointers in this problem are hidden to players and finding each one of them requires solving another communication problem, namely the set intersection problem. Our lower bounds for graph problems are then obtained by reductions from the hidden-pointer chasing problem. Our hidden-pointer chasing problem appears flexible enough to find other applications and is therefore interesting in its own right. To showcase this, we further present an interesting application of this problem beyond streaming algorithms. Using a reduction from hidden-pointer chasing, we prove that any algorithm for submodular function minimization needs to make $n^{2-o(1)}$ value queries to the function unless it has a polynomial degree of adaptivity.

بنى وهياكل البيانات والخوارزميات

A Linear-Time $n^{0.4}$-Approximation for Longest Common Subsequence

106 - Karl Bringmann , Vincent Cohen-Addad , 2021

We consider the classic problem of computing the Longest Common Subsequence (LCS) of two strings of length $n$. While a simple quadratic algorithm has been known for the problem for more than 40 years, no faster algorithm has been found despite an ex tensive effort. The lack of progress on the problem has recently been explained by Abboud, Backurs, and Vassilevska Williams [FOCS15] and Bringmann and Kunnemann [FOCS15] who proved that there is no subquadratic algorithm unless the Strong Exponential Time Hypothesis fails. This has led the community to look for subquadratic approximation algorithms for the problem. Yet, unlike the edit distance problem for which a constant-factor approximation in almost-linear time is known, very little progress has been made on LCS, making it a notoriously difficult problem also in the realm of approximation. For the general setting, only a naive $O(n^{varepsilon/2})$-approximation algorithm with running time $tilde{O}(n^{2-varepsilon})$ has been known, for any constant $0 < varepsilon le 1$. Recently, a breakthrough result by Hajiaghayi, Seddighin, Seddighin, and Sun [SODA19] provided a linear-time algorithm that yields a $O(n^{0.497956})$-approximation in expectation; improving upon the naive $O(sqrt{n})$-approximation for the first time. In this paper, we provide an algorithm that in time $O(n^{2-varepsilon})$ computes an $tilde{O}(n^{2varepsilon/5})$-approximation with high probability, for any $0 < varepsilon le 1$. Our result (1) gives an $tilde{O}(n^{0.4})$-approximation in linear time, improving upon the bound of Hajiaghayi, Seddighin, Seddighin, and Sun, (2) provides an algorithm whose approximation scales with any subquadratic running time $O(n^{2-varepsilon})$, improving upon the naive bound of $O(n^{varepsilon/2})$ for any $varepsilon$, and (3) instead of only in expectation, succeeds with high probability.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات