Approximate Trace Reconstruction via Median String (in Average-Case)

71 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Diptarka Chakraborty

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية

والبحث باللغة English

تأليف Diptarka Chakraborty - Debarati Das -

بنى وهياكل البيانات والخوارزميات

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

We consider an emph{approximate} version of the trace reconstruction problem, where the goal is to recover an unknown string $sin{0,1}^n$ from $m$ traces (each trace is generated independently by passing $s$ through a probabilistic insertion-deletion channel with rate $p$). We present a deterministic near-linear time algorithm for the average-case model, where $s$ is random, that uses only emph{three} traces. It runs in near-linear time $tilde O(n)$ and with high probability reports a string within edit distance $O(epsilon p n)$ from $s$ for $epsilon=tilde O(p)$, which significantly improves over the straightforward bound of $O(pn)$. Technically, our algorithm computes a $(1+epsilon)$-approximate median of the three input traces. To prove its correctness, our probabilistic analysis shows that an approximate median is indeed close to the unknown $s$. To achieve a near-linear time bound, we have to bypass the well-known dynamic programming algorithm that computes an optimal median in time $O(n^3)$.

قيم البحث

اقرأ أيضاً

Circular Trace Reconstruction

122 - Shyam Narayanan , Michael Ren 2020

Trace reconstruction is the problem of learning an unknown string $x$ from independent traces of $x$, where traces are generated by independently deleting each bit of $x$ with some deletion probability $q$. In this paper, we initiate the study of Cir cular trace reconstruction, where the unknown string $x$ is circular and traces are now rotated by a random cyclic shift. Trace reconstruction is related to many computational biology problems studying DNA, which is a primary motivation for this problem as well, as many types of DNA are known to be circular. Our main results are as follows. First, we prove that we can reconstruct arbitrary circular strings of length $n$ using $expbig(tilde{O}(n^{1/3})big)$ traces for any constant deletion probability $q$, as long as $n$ is prime or the product of two primes. For $n$ of this form, this nearly matches what was the best known bound of $expbig(O(n^{1/3})big)$ for standard trace reconstruction when this paper was initially released. We note, however, that Chase very recently improved the standard trace reconstruction bound to $expbig(tilde{O}(n^{1/5})big)$. Next, we prove that we can reconstruct random circular strings with high probability using $n^{O(1)}$ traces for any constant deletion probability $q$. Finally, we prove a lower bound of $tilde{Omega}(n^3)$ traces for arbitrary circular strings, which is greater than the best known lower bound of $tilde{Omega}(n^{3/2})$ in standard trace reconstruction.

بنى وهياكل البيانات والخوارزميات نظرية الأعداد

Average-Case Analysis of Online Topological Ordering

161 - Deepak Ajwani , Tobias Friedrich 2008

Many applications like pointer analysis and incremental compilation require maintaining a topological ordering of the nodes of a directed acyclic graph (DAG) under dynamic updates. All known algorithms for this problem are either only analyzed for wo rst-case insertion sequences or only evaluated experimentally on random DAGs. We present the first average-case analysis of online topological ordering algorithms. We prove an expected runtime of O(n^2 polylog(n)) under insertion of the edges of a complete DAG in a random order for the algorithms of Alpern et al. (SODA, 1990), Katriel and Bodlaender (TALG, 2006), and Pearce and Kelly (JEA, 2006). This is much less than the best known worst-case bound O(n^{2.75}) for this problem.

بنى وهياكل البيانات والخوارزميات

Complexity Issues of String to Graph Approximate Matching

67 - Riccardo Dondi , Giancarlo Mauri , Italo Zoppis 2020

The problem of matching a query string to a directed graph, whose vertices are labeled by strings, has application in different fields, from data mining to computational biology. Several variants of the problem have been considered, depending on the fact that the match is exact or approximate and, in this latter case, which edit operations are considered and where are allowed. In this paper we present results on the complexity of the approximate matching problem, where edit operations are symbol substitutions and are allowed only on the graph labels or both on the graph labels and the query string. We introduce a variant of the problem that asks whether there exists a path in a graph that represents a query string with any number of edit operations and we show that is is NP-complete, even when labels have length one and in the case the alphabet is binary. Moreover, when it is parameterized by the length of the input string and graph labels have length one, we show that the problem is fixed-parameter tractable and it is unlikely to admit a polynomial kernel. The NP-completeness of this problem leads to the inapproximability (within any factor) of the approximate matching when edit operations are allowed only on the graph labels. Moreover, we show that the variants of approximate string matching to graph we consider are not fixed-parameter tractable, when the parameter is the number of edit operations, even for graphs that have distance one from a DAG. The reduction for this latter result allows us to prove the inapproximability of the variant where edit operations can be applied both on the query string and on graph labels.

بنى وهياكل البيانات والخوارزميات التعقيد الحسابي الجينوم

Near Neighbor Search via Efficient Average Distortion Embeddings

170 - Deepanshu Kush , Aleksandar Nikolov , Haohua Tang 2021

A recent series of papers by Andoni, Naor, Nikolov, Razenshteyn, and Waingarten (STOC 2018, FOCS 2018) has given approximate near neighbour search (NNS) data structures for a wide class of distance metrics, including all norms. In particular, these d ata structures achieve approximation on the order of $p$ for $ell_p^d$ norms with space complexity nearly linear in the dataset size $n$ and polynomial in the dimension $d$, and query time sub-linear in $n$ and polynomial in $d$. The main shortcoming is the exponential in $d$ pre-processing time required for their construction. In this paper, we describe a more direct framework for constructing NNS data structures for general norms. More specifically, we show via an algorithmic reduction that an efficient NNS data structure for a given metric is implied by an efficient average distortion embedding of it into $ell_1$ or into Euclidean space. In particular, the resulting data structures require only polynomial pre-processing time, as long as the embedding can be computed in polynomial time. As a concrete instantiation of this framework, we give an NNS data structure for $ell_p$ with efficient pre-processing that matches the approximation factor, space and query complexity of the aforementioned data structure of Andoni et al. On the way, we resolve a question of Naor (Analysis and Geometry in Metric Spaces, 2014) and provide an explicit, efficiently computable embedding of $ell_p$, for $p ge 2$, into $ell_2$ with (quadratic) average distortion on the order of $p$. We expect our approach to pave the way for constructing efficient NNS data structures for all norms.

بنى وهياكل البيانات والخوارزميات الهندسة الحسابية هندسة القياسات

Revisiting the Majority Problem: Average-Case Analysis with Arbitrarily Many Colours

313 - Anthony Kleerekoper 2016

The majority problem is a special case of the heavy hitters problem. Given a collection of coloured balls, the task is to identify the majority colour or state that no such colour exists. Whilst the special case of two-colours has been well studied, the average-case performance for arbitrarily many colours has not. In this paper, we present heuristic analysis of the average-case performance of three deterministic algorithms that appear in the literature. We empirically validate our analysis with large scale simulations.

بنى وهياكل البيانات والخوارزميات

سجل دخول لتتمكن من نشر تعليقات