No Arabic abstract
We investigate the behavior of optimal alignment paths for homologous (related) and independent random sequences. An alignment between two finite sequences is optimal if it corresponds to the longest common subsequence (LCS). We prove the existence of lowest and highest optimal alignments and study their differences. High differences between the extremal alignments imply the high variety of all optimal alignments. We present several simulations indicating that the homologous (having the same common ancestor) sequences have typically the distance between the extremal alignments of much smaller size than independent sequences. In particular, the simulations suggest that for the homologous sequences, the growth of the distance between the extremal alignments is logarithmical. The main theoretical results of the paper prove that (under some assumptions) this is the case, indeed. The paper suggests that the properties of the optimal alignment paths characterize the relatedness of the sequences.
We consider the expected length of the longest common subsequence between two random words of lengths $n$ and $(1-varepsilon)kn$ over $k$-symbol alphabet. It is well-known that this quantity is asymptotic to $gamma_{k,varepsilon} n$ for some constant $gamma_{k,varepsilon}$. We show that $gamma_{k,varepsilon}$ is of the order $1-cvarepsilon^2$ uniformly in $k$ and $varepsilon$. In addition, for large $k$, we give evidence that $gamma_{k,varepsilon}$ approaches $1-tfrac{1}{4}varepsilon^2$, and prove a matching lower bound.
Let $X=(X_i)_{ige 1}$ and $Y=(Y_i)_{ige 1}$ be two sequences of independent and identically distributed (iid) random variables taking their values, uniformly, in a common totally ordered finite alphabet. Let LCI$_n$ be the length of the longest common and (weakly) increasing subsequence of $X_1cdots X_n$ and $Y_1cdots Y_n$. As $n$ grows without bound, and when properly centered and normalized, LCI$_n$ is shown to converge, in distribution, towards a Brownian functional that we identify.
For a set of mulitple sequences, their patterns,Longest Common Subsequences (LCS) and Shortest Common Supersequences (SCS) represent different aspects of these sequences profile, and they can all be used for biological sequence comparisons and analysis. Revealing the relationship between the patterns and LCS,SCS might provide us with a deeper view of the patterns of biological sequences, in turn leading to better understanding of them. However, There is no careful examinaton about the relationship between patterns, LCS and SCS. In this paper, we have analyzed their relation, and given some lemmas. Based on their relations, a set of algorithms called the PALS (PAtterns by Lcs and Scs) algorithms are propsoed to discover patterns in a set of biological sequences. These algorithms first generate the results for LCS and SCS of sequences by heuristic, and consequently derive patterns from these results. Experiments show that the PALS algorithms perform well (both in efficiency and in accuracy) on a variety of sequences. The PALS approach also provides us with a solution for transforming between the heuristic results of SCS and LCS.
Let $W^{(n)}$ be the $n$-letter word obtained by repeating a fixed word $W$, and let $R_n$ be a random $n$-letter word over the same alphabet. We show several results about the length of the longest common subsequence (LCS) between $W^{(n)}$ and $R_n$; in particular, we show that its expectation is $gamma_W n-O(sqrt{n})$ for an efficiently-computable constant $gamma_W$. This is done by relating the problem to a new interacting particle system, which we dub frog dynamics. In this system, the particles (`frogs) hop over one another in the order given by their labels. Stripped of the labeling, the frog dynamics reduces to a variant of the PushTASEP. In the special case when all symbols of $W$ are distinct, we obtain an explicit formula for the constant $gamma_W$ and a closed-form expression for the stationary distribution of the associated frog dynamics. In addition, we propose new conjectures about the asymptotic of the LCS of a pair of random words. These conjectures are informed by computer experiments using a new heuristic algorithm to compute the LCS. Through our computations, we found periodic words that are more random-like than a random word, as measured by the LCS.
Elliptically contoured distributions can be considered to be the distributions for which the contours of the density functions are proportional ellipsoids. We generalize elliptically contoured densities to ``star-shaped distributions with concentric star-shaped contours and show that many results in the former case continue to hold in the more general case. We develop a general theory in the framework of abstract group invariance so that the results can be applied to other cases as well, especially those involving random matrices.