No Arabic abstract
RNA motifs typically consist of short, modular patterns that include base pairs formed within and between modules. Estimating the abundance of these patterns is of fundamental importance for assessing the statistical significance of matches in genomewide searches, and for predicting whether a given function has evolved many times in different species or arose from a single common ancestor. In this manuscript, we review in an integrated and self-contained manner some basic concepts of automata theory, generating functions and transfer matrix methods that are relevant to pattern analysis in biological sequences. We formalize, in a general framework, the concept of Markov chain embedding to analyze patterns in random strings produced by a memoryless source. This conceptualization, together with the capability of automata to recognize complicated patterns, allows a systematic analysis of problems related to the occurrence and frequency of patterns in random strings. The applications we present focus on the concept of synchronization of automata, as well as automata used to search for a finite number of keywords (including sets of patterns generated according to base pairing rules) in a general text.
We consider a Markov chain that iteratively generates a sequence of random finite words in such a way that the $n^{mathrm{th}}$ word is uniformly distributed over the set of words of length $2n$ in which $n$ letters are $a$ and $n$ letters are $b$: at each step an $a$ and a $b$ are shuffled in uniformly at random among the letters of the current word. We obtain a concrete characterization of the Doob-Martin boundary of this Markov chain. Writing $N(u)$ for the number of letters $a$ (equivalently, $b$) in the finite word $u$, we show that a sequence $(u_n)_{n in mathbb{N}}$ of finite words converges to a point in the boundary if, for an arbitrary word $v$, there is convergence as $n$ tends to infinity of the probability that the selection of $N(v)$ letters $a$ and $N(v)$ letters $b$ uniformly at random from $u_n$ and maintaining their relative order results in $v$. We exhibit a bijective correspondence between the points in the boundary and ergodic random total orders on the set ${a_1, b_1, a_2, b_2, ldots }$ that have distributions which are separately invariant under finite permutations of the indices of the $a$s and those of the $b$s. We establish a further bijective correspondence between the set of such random total orders and the set of pairs $(mu, u)$ of diffuse probability measures on $[0,1]$ such that $frac{1}{2}(mu+ u)$ is Lebesgue measure: the restriction of the random total order to ${a_1, b_1, ldots, a_n, b_n}$ is obtained by taking $X_1, ldots, X_n$ (resp. $Y_1, ldots, Y_n$) i.i.d. with common distribution $mu$ (resp. $ u$), letting $(Z_1, ldots, Z_{2n})$ be ${X_1, Y_1, ldots, X_n, Y_n}$ in increasing order, and declaring that the $k^{mathrm{th}}$ smallest element in the restricted total order is $a_i$ (resp. $b_j$) if $Z_k = X_i$ (resp. $Z_k = Y_j$).
This review paper provides an introduction of Markov chains and their convergence rates which is an important and interesting mathematical topic which also has important applications for very widely used Markov chain Monte Carlo (MCMC) algorithm. We first discuss eigenvalue analysis for Markov chains on finite state spaces. Then, using the coupling construction, we prove two quantitative bounds based on minorization condition and drift conditions, and provide descriptive and intuitive examples to showcase how these theorems can be implemented in practice. This paper is meant to provide a general overview of the subject and spark interest in new Markov chain research areas.
We investigate the problem of deterministic pattern matching in multiple streams. In this model, one symbol arrives at a time and is associated with one of s streaming texts. The task at each time step is to report if there is a new match between a fixed pattern of length m and a newly updated stream. As is usual in the streaming context, the goal is to use as little space as possible while still reporting matches quickly. We give almost matching upper and lower space bounds for three distinct pattern matching problems. For exact matching we show that the problem can be solved in constant time per arriving symbol and O(m+s) words of space. For the k-mismatch and k-difference problems we give O(k) time solutions that require O(m+ks) words of space. In all three cases we also give space lower bounds which show our methods are optimal up to a single logarithmic factor. Finally we set out a number of open problems related to this new model for pattern matching.
The game of memory is played with a deck of n pairs of cards. The cards in each pair are identical. The deck is shuffled and the cards laid face down. A move consists of flipping over first one card then another. The cards are removed from play if they match. Otherwise, they are flipped back over and the next move commences. A game ends when all pairs have been matched. We determine that, when the game is played optimally, as n tends to infinity: 1) The expected number of moves is (3 - 2 ln 2)n + 7/8 - 2 ln 2 (approximately 1.61 n), 2) The expected number of times two matching cards are unwittingly flipped over is ln 2, and 3) The expected number of flips until two matching cards have been seen is asymptotically sqrt{pi n}.
We present a Markov chain on the $n$-dimensional hypercube ${0,1}^n$ which satisfies $t_{{rm mix}}(epsilon) = n[1 + o(1)]$. This Markov chain alternates between random and deterministic moves and we prove that the chain has cut-off with a window of size at most $O(n^{0.5+delta})$ where $delta>0$. The deterministic moves correspond to a linear shift register.