Advanced search powered by artificial intelligence

New community

Subscribe to the gold package and get unlimited access to Shamra Academy

Longest common subsequences between words of very unequal length

125 0 0.0 ( 0 )

Download Cite

Added by Boris Bukh

Publication date 2020

fields

and research's language is English

Authors Boris Bukh - Zichao Dong

Probability Combinatorics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

We consider the expected length of the longest common subsequence between two random words of lengths $n$ and $(1-varepsilon)kn$ over $k$-symbol alphabet. It is well-known that this quantity is asymptotic to $gamma_{k,varepsilon} n$ for some constant $gamma_{k,varepsilon}$. We show that $gamma_{k,varepsilon}$ is of the order $1-cvarepsilon^2$ uniformly in $k$ and $varepsilon$. In addition, for large $k$, we give evidence that $gamma_{k,varepsilon}$ approaches $1-tfrac{1}{4}varepsilon^2$, and prove a matching lower bound.

rate research

Periodic words, common subsequences and frogs

354 - Boris Bukh , Christopher Cox 2019

Let $W^{(n)}$ be the $n$-letter word obtained by repeating a fixed word $W$, and let $R_n$ be a random $n$-letter word over the same alphabet. We show several results about the length of the longest common subsequence (LCS) between $W^{(n)}$ and $R_n$; in particular, we show that its expectation is $gamma_W n-O(sqrt{n})$ for an efficiently-computable constant $gamma_W$. This is done by relating the problem to a new interacting particle system, which we dub frog dynamics. In this system, the particles (`frogs) hop over one another in the order given by their labels. Stripped of the labeling, the frog dynamics reduces to a variant of the PushTASEP. In the special case when all symbols of $W$ are distinct, we obtain an explicit formula for the constant $gamma_W$ and a closed-form expression for the stationary distribution of the associated frog dynamics. In addition, we propose new conjectures about the asymptotic of the LCS of a pair of random words. These conjectures are informed by computer experiments using a new heuristic algorithm to compute the LCS. Through our computations, we found periodic words that are more random-like than a random word, as measured by the LCS.

Probability Combinatorics

On the limiting law of the length of the longest common and increasing subsequences in random words

488 - Jean-Christophe Breton , Christian Houdre 2015

Let $X=(X_i)_{ige 1}$ and $Y=(Y_i)_{ige 1}$ be two sequences of independent and identically distributed (iid) random variables taking their values, uniformly, in a common totally ordered finite alphabet. Let LCI$_n$ be the length of the longest common and (weakly) increasing subsequence of $X_1cdots X_n$ and $Y_1cdots Y_n$. As $n$ grows without bound, and when properly centered and normalized, LCI$_n$ is shown to converge, in distribution, towards a Brownian functional that we identify.

Probability

Optimal alignments of longest common subsequences and their path properties

350 - Juri Lember , Heinrich Matzinger , Anna Vollmer 2014

We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all optimal alignments. We present several simulations indicating that the homologous (having the same common ancestor) sequences have typically the distance between the extremal alignments of much smaller size than independent sequences. In particular, the simulations suggest that for the homologous sequences, the growth of the distance between the extremal alignments is logarithmical. The main theoretical results of the paper prove that (under some assumptions) this is the case, indeed. The paper suggests that the properties of the optimal alignment paths characterize the relatedness of the sequences.

Statistics Theory Statistics Theory

A Note on the Longest Common Compatible Prefix Problem for Partial Words

610 - Maxime Crochemore , Costas S. Iliopoulos , Tomasz Kociumaka 2013

For a partial word $w$ the longest common compatible prefix of two positions $i,j$, denoted $lccp(i,j)$, is the largest $k$ such that $w[i,i+k-1]uparrow w[j,j+k-1]$, where $uparrow$ is the compatibility relation of partial words (it is not an equivalence relation). The LCCP problem is to preprocess a partial word in such a way that any query $lccp(i,j)$ about this word can be answered in $O(1)$ time. It is a natural generalization of the longest common prefix (LCP) problem for regular words, for which an $O(n)$ preprocessing time and $O(1)$ query time solution exists. Recently an efficient algorithm for this problem has been given by F. Blanchet-Sadri and J. Lazarow (LATA 2013). The preprocessing time was $O(nh+n)$, where $h$ is the number of holes in $w$. The algorithm was designed for partial words over a constant alphabet and was quite involved. We present a simple solution to this problem with slightly better runtime that works for any linearly-sortable alphabet. Our preprocessing is in time $O(nmu+n)$, where $mu$ is the number of blocks of holes in $w$. Our algorithm uses ideas from alignment algorithms and dynamic programming.

Data Structures and Algorithms Formal Languages and Automata Theory

Analysis of the Relationships among Longest Common Subsequences, Shortest Common Supersequences and Patterns and its application on Pattern Discovery in Biological Sequences

539 - Kang Ning , Hoong Kee Ng , Hon Wai Leong 2009

For a set of mulitple sequences, their patterns,Longest Common Subsequences (LCS) and Shortest Common Supersequences (SCS) represent different aspects of these sequences profile, and they can all be used for biological sequence comparisons and analysis. Revealing the relationship between the patterns and LCS,SCS might provide us with a deeper view of the patterns of biological sequences, in turn leading to better understanding of them. However, There is no careful examinaton about the relationship between patterns, LCS and SCS. In this paper, we have analyzed their relation, and given some lemmas. Based on their relations, a set of algorithms called the PALS (PAtterns by Lcs and Scs) algorithms are propsoed to discover patterns in a set of biological sequences. These algorithms first generate the results for LCS and SCS of sequences by heuristic, and consequently derive patterns from these results. Experiments show that the PALS algorithms perform well (both in efficiency and in accuracy) on a variety of sequences. The PALS approach also provides us with a solution for transforming between the heuristic results of SCS and LCS.

Data Structures and Algorithms Discrete Mathematics Information Retrieval

comments

Fetching comments

Kalamoon Private University

Additional details More universities

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد

Longest common subsequences between words of very unequal length

Ask ChatGPT about the research

No Arabic abstract

Read More