Approximating Edit Distance Within Constant Factor in Truly Sub-Quadratic Time

59 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Diptarka Chakraborty

تاريخ النشر 2018

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Diptarka Chakraborty - Debarati Das - Elazar Goldenberg

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer, and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within an approximation factor $text{poly}(log n)$. In this paper, we provide an algorithm with running time $tilde{O}(n^{2-2/7})$ that approximates the edit distance within a constant factor.

قيم البحث

126 - Joshua Brakensiek , Aviad Rubinstein 2019

We show that the edit distance between two strings of length $n$ can be computed within a factor of $f(epsilon)$ in $n^{1+epsilon}$ time as long as the edit distance is at least $n^{1-delta}$ for some $delta(epsilon) > 0$.

بنى وهياكل البيانات والخوارزميات

A Quadratic Assignment Formulation of the Graph Edit Distance

140 - Sebastien Bougleux , Luc Brun , Vincenzo Carletti 2015

Computing efficiently a robust measure of similarity or dissimilarity between graphs is a major challenge in Pattern Recognition. The Graph Edit Distance (GED) is a flexible measure of dissimilarity between graphs which arises in error-tolerant graph matching. It is defined from an optimal sequence of edit operations (edit path) transforming one graph into an other. Unfortunately, the exact computation of this measure is NP-hard. In the last decade, several approaches have been proposed to approximate the GED in polynomial time, mainly by solving linear programming problems. Among them, the bipartite GED has received much attention. It is deduced from a linear sum assignment of the nodes of the two graphs, which can be efficiently computed by Hungarian-type algorithms. However, edit operations on nodes and edges are not handled simultaneously, which limits the accuracy of the approximation. To overcome this limitation, we propose to extend the linear assignment model to a quadratic one, for directed or undirected graphs having labelized nodes and edges. This is realized through the definition of a family of edit paths induced by assignments between nodes. We formally show that the GED, restricted to the paths in this family, is equivalent to a quadratic assignment problem. Since this problem is NP-hard, we propose to compute an approximate solution by an adaptation of the Integer Projected Fixed Point method. Experiments show that the proposed approach is generally able to reach a more accurate approximation of the optimal GED than the bipartite GED, with a computational cost that is still affordable for graphs of non trivial sizes.

بنى وهياكل البيانات والخوارزميات

Simulating Population Protocols in Sub-Constant Time per Interaction

61 - Petra Berenbrink , David Hammer , Dominik Kaaser 2020

We consider the problem of efficiently simulating population protocols. In the population model, we are given a distributed system of $n$ agents modeled as identical finite-state machines. In each time step, a pair of agents is selected uniformly at random to interact. In an interaction, agents update their states according to a common transition function. We empirically and analytically analyze two classes of simulators for this model. First, we consider sequential simulators executing one interaction after the other. Key to the performance of these simulators is the data structure storing the agents states. For our analysis, we consider plain arrays, binary search trees, and a novel Dynamic Alias Table data structure. Secondly, we consider batch processing to efficiently update the states of multiple independent agents in one step. For many protocols considered in literature, our simulator requires amortized sub-constant time per interaction and is fast in practice: given a fixed time budget, the implementation of our batched simulator is able to simulate population protocols several orders of magnitude larger compared to the sequential competitors, and can carry out $2^{50}$ interactions among the same number of agents in less than 400s.

بنى وهياكل البيانات والخوارزميات

A Simple Sublinear Algorithm for Gap Edit Distance

339 - Joshua Brakensiek , Moses Charikar , Aviad Rubinstein 2020

We study the problem of estimating the edit distance between two $n$-character strings. While exact computation in the worst case is believed to require near-quadratic time, previous work showed that in certain regimes it is possible to solve the fol lowing {em gap edit distance} problem in sub-linear time: distinguish between inputs of distance $le k$ and $>k^2$. Our main result is a very simple algorithm for this benchmark that runs in time $tilde O(n/sqrt{k})$, and in particular settles the open problem of obtaining a truly sublinear time for the entire range of relevant $k$. Building on the same framework, we also obtain a $k$-vs-$k^2$ algorithm for the one-sided preprocessing model with $tilde O(n)$ preprocessing time and $tilde O(n/k)$ query time (improving over a recent $tilde O(n/k+k^2)$-query time algorithm for the same problem [GRS20].

بنى وهياكل البيانات والخوارزميات

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

164 - Michael Saks , C. Seshadhri 2012

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $delta > 0$, given streaming access to an array of length $n$ provides a $(1+delta)$-multiplicative approximation to the emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((log^2 n)/delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $delta n$ for any $delta >0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$delta n$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(log^2 n)/delta)$, where $k$ is the maximum number of times any one symbol appears in $y$.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

سجل دخول لتتمكن من نشر تعليقات