No Arabic abstract
Transcriptome assembly from RNA-Seq reads is an active area of bioinformatics research. The ever-declining cost and the increasing depth of RNA-Seq have provided unprecedented opportunities to better identify expressed transcripts. However, the nonlinear transcript structures and the ultra-high throughput of RNA-Seq reads pose significant algorithmic and computational challenges to the existing transcriptome assembly approaches, either reference-guided or de novo. While reference-guided approaches offer good sensitivity, they rely on alignment results of the splice-aware aligners and are thus unsuitable for species with incomplete reference genomes. In contrast, de novo approaches do not depend on the reference genome but face a computational daunting task derived from the complexity of the graph built for the whole transcriptome. In response to these challenges, we present a hybrid approach to exploit an incomplete reference genome without relying on splice-aware aligners. We have designed a split-and-align procedure to efficiently localize the reads to individual genomic loci, which is followed by an accurate de novo assembly to assemble reads falling into each locus. Using extensive simulation data, we demonstrate a high accuracy and precision in transcriptome reconstruction by comparing to selected transcriptome assembly tools. Our method is implemented in assemblySAM, a GUI software freely available at http://sammate.sourceforge.net.
Tiling arrays make possible a large scale exploration of the genome thanks to probes which cover the whole genome with very high density until 2 000 000 probes. Biological questions usually addressed are either the expression difference between two conditions or the detection of transcribed regions. In this work we propose to consider simultaneously both questions as an unsupervised classification problem by modeling the joint distribution of the two conditions. In contrast to previous methods, we account for all available information on the probes as well as biological knowledge like annotation and spatial dependence between probes. Since probes are not biologically relevant units we propose a classification rule for non-connected regions covered by several probes. Applications to transcriptomic and ChIP-chip data of Arabidopsis thaliana obtained with a NimbleGen tiling array highlight the importance of a precise modeling and the region classification.
Oxford Nanopore MinION sequencer is currently the smallest sequencing device available. While being able to produce very long reads (reads of up to 100~kbp were reported), it is prone to high sequencing error rates of up to 30%. Since most of these errors are insertions or deletions, it is very difficult to adapt popular seed-based algorithms designed for aligning data sets with much lower error rates. Base calling of MinION reads is typically done using hidden Markov models. In this paper, we propose to represent each sequencing read by an ensemble of sequences sampled from such a probabilistic model. This approach can improve the sensitivity and false positive rate of seeding an alignment compared to using a single representative base call sequence for each read.
The amount of completely sequenced chloroplast genomes increases rapidly every day, leading to the possibility to build large-scale phylogenetic trees of plant species. Considering a subset of close plant species defined according to their chloroplasts, the phylogenetic tree that can be inferred by their core genes is not necessarily well supported, due to the possible occurrence of problematic genes (i.e., homoplasy, incomplete lineage sorting, horizontal gene transfers, etc.) which may blur the phylogenetic signal. However, a trustworthy phylogenetic tree can still be obtained provided such a number of blurring genes is reduced. The problem is thus to determine the largest subset of core genes that produces the best-supported tree. To discard problematic genes and due to the overwhelming number of possible combinations, this article focuses on how to extract the largest subset of sequences in order to obtain the most supported species tree. Due to computational complexity, a distributed Binary Particle Swarm Optimization (BPSO) is proposed in sequential and distributed fashions. Obtained results from bo
One of the most computationally intensive tasks in computational biology is de novo genome assembly, the decoding of the sequence of an unknown genome from redundant and erroneous short sequences. A common assembly paradigm identifies overlapping sequences, simplifies their layout, and creates consensus. Despite many algorithms developed in the literature, the efficient assembly of large genomes is still an open problem. In this work, we introduce new distributed-memory parallel algorithms for overlap detection and layout simplification steps of de novo genome assembly, and implement them in the diBELLA 2D pipeline. Our distributed memory algorithms for both overlap detection and layout simplification are based on linear-algebra operations over semirings using 2D distributed sparse matrices. Our layout step consists of performing a transitive reduction from the overlap graph to a string graph. We provide a detailed communication analysis of the main stages of our new algorithms. diBELLA 2D achieves near linear scaling with over 80% parallel efficiency for the human genome, reducing the runtime for overlap detection by 1.2-1.3x for the human genome and 1.5-1.9x for C. elegans compared to the state-of-the-art. Our transitive reduction algorithm outperforms an existing distributed-memory implementation by 10.5-13.3x for the human genome and 18-29x for the C. elegans. Our work paves the way for efficient de novo assembly of large genomes using long reads in distributed memory.
MicroRNAs (miRNAs) are a class of non-coding RNAs that regulate gene expression. Identification of total number of miRNAs even in completely sequenced organisms is still an open problem. However, researchers have been using techniques that can predict limited number of miRNA in an organism. In this paper, we have used homology based approach for comparative analysis of miRNA of hexapoda group .We have used Apis mellifera, Bombyx mori, Anopholes gambiae and Drosophila melanogaster miRNA datasets from miRBase repository. We have done pair wise as well as multiple alignments for the available miRNAs in the repository to identify and analyse conserved regions among related species. Unfortunately, to the best of our knowledge, miRNA related literature does not provide in depth analysis of hexapods. We have made an attempt to derive the commonality among the miRNAs and to identify the conserved regions which are still not available in miRNA repositories. The results are good approximation with a small number of mismatches. However, they are encouraging and may facilitate miRNA biogenesis for