ﻻ يوجد ملخص باللغة العربية
Background: Alignment of biological sequences such as DNA, RNA or proteins is one of the most widely used tools in computational bioscience. All existing alignment algorithms rely on heuristic scoring schemes based on biological expertise. Therefore, these algorithms do not provide model independent and objective measures for how similar two (or more) sequences actually are. Although information theory provides such a similarity measure -- the mutual information (MI) -- previous attempts to connect sequence alignment and information theory have not produced realistic estimates for the MI from a given alignment. Results: Here we describe a simple and flexible approach to get robust estimates of MI from {it global} alignments. For mammalian mitochondrial DNA, our approach gives pairwise MI estimates for commonly used global alignment algorithms that are strikingly close to estimates obtained by an entirely unrelated approach -- concatenating and zipping the sequences. Conclusions: This remarkable consistency may help establish MI as a reliable tool for evaluating the quality of global alignments, judging the relative merits of different alignment algorithms, and estimating the significance of specific alignments. We expect that our approach can be extended to establish further connections between information theory and sequence alignment, including applications to local and multiple alignment procedures.
Existing sequence alignment algorithms use heuristic scoring schemes which cannot be used as objective distance metrics. Therefore one relies on measures like the p- or log-det distances, or makes explicit, and often simplistic, assumptions about seq
We analyze the statistical properties of Poincare recurrences of Homo sapiens, mammalian and other DNA sequences taken from Ensembl Genome data base with up to fifteen billions base pairs. We show that the probability of Poincare recurrences decays i
Massively parallel sequencing techniques have revolutionized biological and medical sciences by providing unprecedented insight into the genomes of humans, animals, and microbes. Modern sequencing platforms generate enormous amounts of genomic data i
Protein-fragment seqlets typically feature about 10 amino acid residue positions that are fixed to within conservative substitutions but usually separated by a number of prescribed gaps with arbitrary residue content. By quantifying a general amino a
In the last decade a number of algorithms and associated software have been developed to align next generation sequencing (NGS) reads with relevant reference genomes. The accuracy of these programs may vary significantly, especially when the NGS read