A Simple Sublinear Algorithm for Gap Edit Distance

340 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Joshua Brakensiek

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Joshua Brakensiek - Moses Charikar - Aviad Rubinstein

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We study the problem of estimating the edit distance between two $n$-character strings. While exact computation in the worst case is believed to require near-quadratic time, previous work showed that in certain regimes it is possible to solve the following {em gap edit distance} problem in sub-linear time: distinguish between inputs of distance $le k$ and $>k^2$. Our main result is a very simple algorithm for this benchmark that runs in time $tilde O(n/sqrt{k})$, and in particular settles the open problem of obtaining a truly sublinear time for the entire range of relevant $k$. Building on the same framework, we also obtain a $k$-vs-$k^2$ algorithm for the one-sided preprocessing model with $tilde O(n)$ preprocessing time and $tilde O(n/k)$ query time (improving over a recent $tilde O(n/k+k^2)$-query time algorithm for the same problem [GRS20].

قيم البحث

235 - Sepehr Assadi , Michael Kapralov , Sanjeev Khanna 2018

In the subgraph counting problem, we are given a input graph $G(V, E)$ and a target graph $H$; the goal is to estimate the number of occurrences of $H$ in $G$. Our focus here is on designing sublinear-time algorithms for approximately counting occurr ences of $H$ in $G$ in the setting where the algorithm is given query access to $G$. This problem has been studied in several recent papers which primarily focused on specific families of graphs $H$ such as triangles, cliques, and stars. However, not much is known about approximate counting of arbitrary graphs $H$. This is in sharp contrast to the closely related subgraph enumeration problem that has received significant attention in the database community as the database join problem. The AGM bound shows that the maximum number of occurrences of any arbitrary subgraph $H$ in a graph $G$ with $m$ edges is $O(m^{rho(H)})$, where $rho(H)$ is the fractional edge-cover of $H$, and enumeration algorithms with matching runtime are known for any $H$. We bridge this gap between subgraph counting and subgraph enumeration by designing a sublinear-time algorithm that can estimate the number of any arbitrary subgraph $H$ in $G$, denoted by $#H$, to within a $(1pm epsilon)$-approximation w.h.p. in $O(frac{m^{rho(H)}}{#H}) cdot poly(log{n},1/epsilon)$ time. Our algorithm is allowed the standard set of queries for general graphs, namely degree queries, pair queries and neighbor queries, plus an additional edge-sample query that returns an edge chosen uniformly at random. The performance of our algorithm matches those of Eden et.al. [FOCS 2015, STOC 2018] for counting triangles and cliques and extend them to all choices of subgraph $H$ under the additional assumption of edge-sample queries. We further show that our algorithm works for the more general database join size estimation problem and prove a matching lower bound for this problem.

بنى وهياكل البيانات والخوارزميات

A Quadratic Assignment Formulation of the Graph Edit Distance

140 - Sebastien Bougleux , Luc Brun , Vincenzo Carletti 2015

Computing efficiently a robust measure of similarity or dissimilarity between graphs is a major challenge in Pattern Recognition. The Graph Edit Distance (GED) is a flexible measure of dissimilarity between graphs which arises in error-tolerant graph matching. It is defined from an optimal sequence of edit operations (edit path) transforming one graph into an other. Unfortunately, the exact computation of this measure is NP-hard. In the last decade, several approaches have been proposed to approximate the GED in polynomial time, mainly by solving linear programming problems. Among them, the bipartite GED has received much attention. It is deduced from a linear sum assignment of the nodes of the two graphs, which can be efficiently computed by Hungarian-type algorithms. However, edit operations on nodes and edges are not handled simultaneously, which limits the accuracy of the approximation. To overcome this limitation, we propose to extend the linear assignment model to a quadratic one, for directed or undirected graphs having labelized nodes and edges. This is realized through the definition of a family of edit paths induced by assignments between nodes. We formally show that the GED, restricted to the paths in this family, is equivalent to a quadratic assignment problem. Since this problem is NP-hard, we propose to compute an approximate solution by an adaptation of the Integer Projected Fixed Point method. Experiments show that the proposed approach is generally able to reach a more accurate approximation of the optimal GED than the bipartite GED, with a computational cost that is still affordable for graphs of non trivial sizes.

بنى وهياكل البيانات والخوارزميات

Space efficient streaming algorithms for the distance to monotonicity and asymmetric edit distance

164 - Michael Saks , C. Seshadhri 2012

Approximating the length of the longest increasing sequence (LIS) of an array is a well-studied problem. We study this problem in the data stream model, where the algorithm is allowed to make a single left-to-right pass through the array and the key resource to be minimized is the amount of additional memory used. We present an algorithm which, for any $delta > 0$, given streaming access to an array of length $n$ provides a $(1+delta)$-multiplicative approximation to the emph{distance to monotonicity} ($n$ minus the length of the LIS), and uses only $O((log^2 n)/delta)$ space. The previous best known approximation using polylogarithmic space was a multiplicative 2-factor. Our algorithm can be used to estimate the length of the LIS to within an additive $delta n$ for any $delta >0$ while previous algorithms could only achieve additive error $n(1/2-o(1))$. Our algorithm is very simple, being just 3 lines of pseudocode, and has a small update time. It is essentially a polylogarithmic space approximate implementation of a classic dynamic program that computes the LIS. We also give a streaming algorithm for approximating $LCS(x,y)$, the length of the longest common subsequence between strings $x$ and $y$, each of length $n$. Our algorithm works in the asymmetric setting (inspired by cite{AKO10}), in which we have random access to $y$ and streaming access to $x$, and runs in small space provided that no single symbol appears very often in $y$. More precisely, it gives an additive-$delta n$ approximation to $LCS(x,y)$ (and hence also to $E(x,y) = n-LCS(x,y)$, the edit distance between $x$ and $y$ when insertions and deletions, but not substitutions, are allowed), with space complexity $O(k(log^2 n)/delta)$, where $k$ is the maximum number of times any one symbol appears in $y$.

بنى وهياكل البيانات والخوارزميات الرياضيات المتقطعة

A Simple Algorithm for Computing BOCP

185 - Jack Wang 2012

In this article, we devise a concise algorithm for computing BOCP. Our method is simple, easy-to-implement but without loss of efficiency. Given two circular-arc polygons with $m$ and $n$ edges respectively, our method runs in $O(m+n+(l+k)log l)$ tim e, using $O(m+n+k)$ space, where $k$ is the number of intersections, and $l$ is the number of {edge}s. Our algorithm has the power to approximate to linear complexity when $k$ and $l$ are small. The superiority of the proposed algorithm is also validated through empirical study.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية الرسم الحاسوبي

A Simple Deterministic Algorithm for Edge Connectivity

233 - Thatchaphol Saranurak 2020

We show a deterministic algorithm for computing edge connectivity of a simple graph with $m$ edges in $m^{1+o(1)}$ time. Although the fastest deterministic algorithm by Henzinger, Rao, and Wang [SODA17] has a faster running time of $O(mlog^{2}mloglog m)$, we believe that our algorithm is conceptually simpler. The key tool for this simplication is the expander decomposition. We exploit it in a very straightforward way compared to how it has been previously used in the literature.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات