No Arabic abstract
A major hindrance to studies of microbial diversity has been that the vast majority of microbes cannot be cultured in the laboratory and thus are not amenable to traditional methods of characterization. Environmental shotgun sequencing (ESS) overcomes this hurdle by sequencing the DNA from the organisms present in a microbial community. The interpretation of this metagenomic data can be greatly facilitated by associating every sequence read with its source organism. We report the development of CompostBin, a DNA composition-based algorithm for analyzing metagenomic sequence reads and distributing them into taxon-specific bins. Unlike previous methods that seek to bin assembled contigs and often require training on known reference genomes, CompostBin has the ability to accurately bin raw sequence reads without need for assembly or training. It applies principal component analysis to project the data into an informative lower-dimensional space, and then uses the normalized cut clustering algorithm on this filtered data set to classify sequences into taxon-specific bins. We demonstrate the algorithms accuracy on a variety of simulated data sets and on one metagenomic data set with known species assignments. CompostBin is a work in progress, with several refinements of the algorithm planned for the future.
Predicting DNA-protein binding is an important and classic problem in bioinformatics. Convolutional neural networks have outperformed conventional methods in modeling the sequence specificity of DNA-protein binding. However, none of the studies has utilized graph convolutional networks for motif inference. In this work, we propose to use graph convolutional networks for motif inference. We build a sequence k-mer graph for the whole dataset based on k-mer co-occurrence and k-mer sequence relationship and then learn DNA Graph Convolutional Network (DNA-GCN) for the whole dataset. Our DNA-GCN is initialized with a one-hot representation for all nodes, and it then jointly learns the embeddings for both k-mers and sequences, as supervised by the known labels of sequences. We evaluate our model on 50 datasets from ENCODE. DNA-GCN shows its competitive performance compared with the baseline model. Besides, we analyze our model and design several different architectures to help fit different datasets.
Motivation: The MinION device by Oxford Nanopore is the first portable sequencing device. MinION is able to produce very long reads (reads over 100~kBp were reported), however it suffers from high sequencing error rate. In this paper, we show that the error rate can be reduced by improving the base calling process. Results: We present the first open-source DNA base caller for the MinION sequencing platform by Oxford Nanopore. By employing carefully crafted recurrent neural networks, our tool improves the base calling accuracy compared to the default base caller supplied by the manufacturer. This advance may further enhance applicability of MinION for genome sequencing and various clinical applications. Availability: DeepNano can be downloaded at http://compbio.fmph.uniba.sk/deepnano/. Contact:
[email protected]
In the last decade a number of algorithms and associated software have been developed to align next generation sequencing (NGS) reads with relevant reference genomes. The accuracy of these programs may vary significantly, especially when the NGS reads are quite different from the available reference genome. We propose a benchmark to assess accuracy of short reads mapping based on the pre-computed global alignment of related genome sequences. In this paper we propose a benchmark to assess accuracy of the short reads mapping based on the pre-computed global alignment of closely related genome sequences. We outline the method and also present a short report of an experiment performed on five popular alignment tools based on the pairwise alignments of Escherichia coli O6 CFT073 genome with genomes of seven other bacteria.
The understanding of mechanisms that control epigenetic changes is an important research area in modern functional biology. Epigenetic modifications such as DNA methylation are in general very stable over many cell divisions. DNA methylation can however be subject to specific and fast changes over a short time scale even in non-dividing (i.e. not-replicating) cells. Such dynamic DNA methylation changes are caused by a combination of active demethylation and de novo methylation processes which have not been investigated in integrated models. Here we present a hybrid (hidden) Markov model to describe the cycle of methylation and demethylation over (short) time scales. Our hybrid model decribes several molecular events either happening at deterministic points (i.e. describing mechanisms that occur only during cell division) and other events occurring at random time points. We test our model on mouse embryonic stem cells using time-resolved data. We predict methylation changes and estimate the efficiencies of the different modification steps related to DNA methylation and demethylation.
RNA-seq has rapidly become the de facto technique to measure gene expression. However, the time required for analysis has not kept up with the pace of data generation. Here we introduce Sailfish, a novel computational method for quantifying the abundance of previously annotated RNA isoforms from RNA-seq data. Sailfish entirely avoids mapping reads, which is a time-consuming step in all current methods. Sailfish provides quantification estimates much faster than existing approaches (typically 20-times faster) without loss of accuracy.