No Arabic abstract
In parametric sequence alignment, optimal alignments of two sequences are computed as a function of the penalties for mismatches and spaces, producing many different optimal alignments. Here we give a 3/(2^{7/3}pi^{2/3})n^{2/3} +O(n^{1/3} log n) lower bound on the maximum number of distinct optimal alignment summaries of length-n binary sequences. This shows that the upper bound given by Gusfield et. al. is tight over all alphabets, thereby disproving the square root of n conjecture. Thus the maximum number of distinct optimal alignment summaries (i.e. vertices of the alignment polytope) over all pairs of length-n sequences is Theta(n^{2/3}).
We present a combined mean-field and simulation approach to different models describing the dynamics of classes formed by elements that can appear, disappear or copy themselves. These models, related to a paradigm duplication-innovation model known as Chinese Restaurant Process, are devised to reproduce the scaling behavior observed in the genome-wide repertoire of protein domains of all known species. In view of these data, we discuss the qualitative and quantitative differences of the alternative model formulations, focusing in particular on the roles of element loss and of the specificity of empirical domain classes.
Let $(X, mathcal{B},mu,T)$ be an ergodic measure preserving system, $A in mathcal{B}$ and $epsilon>0$. We study the largeness of sets of the form begin{equation*} begin{split} S = left{ ninmathbb{N}colonmu(Acap T^{-f_1(n)}Acap T^{-f_2(n)}Acapldotscap T^{-f_k(n)}A)> mu(A)^{k+1} - epsilon right} end{split} end{equation*} for various families ${f_1,dots,f_k}$ of sequences $f_icolon mathbb{N} to mathbb{N}$. For $k leq 3$ and $f_{i}(n)=i f(n)$, we show that $S$ has positive density if $f(n)=q(p_n)$ where $q in mathbb{Z}[x]$ satisfies $q(1)$ or $q(-1) =0$ and $p_n$ denotes the $n$-th prime; or when $f$ is a certain Hardy field sequence. If $T^q$ is ergodic for some $q in mathbb{N}$, then for all $r in mathbb{Z}$, $S$ is syndetic if $f(n) = qn + r$. For $f_{i}(n)=a_{i}n$, where $a_{i}$ are distinct integers, we show that $S$ can be empty for $kgeq 4$, and for $k = 3$ we found an interesting relation between the largeness of $S$ and the abundance of solutions to certain linear equations in sparse sets of integers. We also provide some partial results when the $f_{i}$ are distinct polynomials.
Much evolutionary information is stored in the fluctuations of protein length distributions. The genome size and non-coding DNA content can be calculated based only on the protein length distributions. So there is intrinsic relationship between the coding DNA size and non-coding DNA size. According to the correlations and quasi-periodicity of protein length distributions, we can classify life into three domains. Strong evidences are found to support the order in the structures of protein length distributions.
The Cambrian explosion is a grand challenge to science today and involves multidisciplinary study. This event is generally believed as a result of genetic innovations, environmental factors and ecological interactions, even though there are many conflicts on nature and timing of metazoan origins. The crux of the matter is that an entire roadmap of the evolution is missing to discern the biological complexity transition and to evaluate the critical role of the Cambrian explosion in the overall evolutionary context. Here we calculate the time of the Cambrian explosion by an innovative and accurate C-value clock; our result (560 million years ago) quite fits the fossil records. We clarify that the intrinsic reason of genome evolution determined the Cambrian explosion. A general formula for evaluating genome size of different species has been found, by which major questions of the C-value enigma can be solved and the genome size evolution can be illustrated. The Cambrian explosion is essentially a major transition of biological complexity, which corresponds to a turning point in genome size evolution. The observed maximum prokaryotic complexity is just a relic of the Cambrian explosion and it is supervised by the maximum information storage capability in the observed universe. Our results open a new prospect of studying metazoan origins and molecular evolution.
We study the space complexity of sketching cuts and Laplacian quadratic forms of graphs. We show that any data structure which approximately stores the sizes of all cuts in an undirected graph on $n$ vertices up to a $1+epsilon$ error must use $Omega(nlog n/epsilon^2)$ bits of space in the worst case, improving the $Omega(n/epsilon^2)$ bound of Andoni et al. and matching the best known upper bound achieved by spectral sparsifiers. Our proof is based on a rigidity phenomenon for cut (and spectral) approximation which may be of independent interest: any two $d-$regular graphs which approximate each others cuts significantly better than a random graph approximates the complete graph must overlap in a constant fraction of their edges.