No Arabic abstract
As genome sequencing tools and techniques improve, researchers are able to incrementally assemble more accurate reference genomes, which enable sensitivity in read mapping and downstream analysis such as variant calling. A more sensitive downstream analysis is critical for a better understanding of the genome donor (e.g., health characteristics). Therefore, read sets from sequenced samples should ideally be mapped to the latest available reference genome that represents the most relevant population. Unfortunately, the increasingly large amount of available genomic data makes it prohibitively expensive to fully re-map each read set to its respective reference genome every time the reference is updated. There are several tools that attempt to accelerate the process of updating a read data set from one reference to another (i.e., remapping). However, if a read maps to a region in the old reference that does not appear with a reasonable degree of similarity in the new reference, the read cannot be remapped. We find that, as a result of this drawback, a significant portion of annotations are lost when using state-of-the-art remapping tools. To address this major limitation in existing tools, we propose AirLift, a fast and comprehensive technique for remapping alignments from one genome to another. Compared to the state-of-the-art method for remapping reads (i.e., full mapping), AirLift reduces 1) the number of reads that need to be fully mapped to the new reference by up to 99.99% and 2) the overall execution time to remap read sets between two reference genom
Gene-gene interactions have long been recognized to be fundamentally important to understand genetic causes of complex disease traits. At present, identifying gene-gene interactions from genome-wide case-control studies is computationally and methodologically challenging. In this paper, we introduce a simple but powerful method, named `BOolean Operation based Screening and Testing(BOOST). To discover unknown gene-gene interactions that underlie complex diseases, BOOST allows examining all pairwise interactions in genome-wide case-control studies in a remarkably fast manner. We have carried out interaction analyses on seven data sets from the Wellcome Trust Case Control Consortium (WTCCC). Each analysis took less than 60 hours on a standard 3.0 GHz desktop with 4G memory running Windows XP system. The interaction patterns identified from the type 1 diabetes data set display significant difference from those identified from the rheumatoid arthritis data set, while both data sets share a very similar hit region in the WTCCC report. BOOST has also identified many undiscovered interactions between genes in the major histocompatibility complex (MHC) region in the type 1 diabetes data set. In the coming era of large-scale interaction mapping in genome-wide case-control studies, our method can serve as a computationally and statistically useful tool.
We show, that the specific distribution of genes length, which is observed in natural genomes, might be a result of a growth process, in which a single length scale $L(t)$ develops that grows with time as $t^{1/3}$. This length scale could be associated with the length of the longest gene in an evolving genome. The growth kinetics of the genes resembles the one observed in physical systems with conserved ordered parameter. We show, that in genome this conservation is guaranteed by compositional compensation along DNA strands of the purine-like trends introduced by genes. The presented mathematical model is the modified Bak-Sneppen model of critical self-organization applied to the one-dimensional system of $N$ spins. The spins take discrete values, which represent genes length.
Microbes are essentially yet convolutedly linked with human lives on the earth. They critically interfere in different physiological processes and thus influence overall health status. Studying microbial species is used to be constrained to those that can be cultured in the lab. But it excluded a huge portion of the microbiome that could not survive on lab conditions. In the past few years, the culture-independent metagenomic sequencing enabled us to explore the complex microbial community coexisting within and on us. Metagenomics has equipped us with new avenues of investigating the microbiome, from studying a single species to a complex community in a dynamic ecosystem. Thus, identifying the involved microbes and their genomes becomes one of the core tasks in metagenomic sequencing. Metagenome-assembled genomes are groups of contigs with similar sequence characteristics from de novo assembly and could represent the microbial genomes from metagenomic sequencing. In this paper, we reviewed a spectrum of tools for producing and annotating metagenome-assembled genomes from metagenomic sequencing data and discussed their technical and biological perspectives.
Motivation: Seed location filtering is critical in DNA read mapping, a process where billions of DNA fragments (reads) sampled from a donor are mapped onto a reference genome to identify genomic variants of the donor. State-of-the-art read mappers 1) quickly generate possible mapping locations for seeds (i.e., smaller segments) within each read, 2) extract reference sequences at each of the mapping locations, and 3) check similarity between each read and its associated reference sequences with a computationally-expensive algorithm (i.e., sequence alignment) to determine the origin of the read. A seed location filter comes into play before alignment, discarding seed locations that alignment would deem a poor match. The ideal seed location filter would discard all poor match locations prior to alignment such that there is no wasted computation on unnecessary alignments. Results: We propose a novel seed location filtering algorithm, GRIM-Filter, optimized to exploit 3D-stacked memory systems that integrate computation within a logic layer stacked under memory layers, to perform processing-in-memory (PIM). GRIM-Filter quickly filters seed locations by 1) introducing a new representation of coarse-grained segments of the reference genome, and 2) using massively-parallel in-memory operations to identify read presence within each coarse-grained segment. Our evaluations show that for a sequence alignment error tolerance of 0.05, GRIM-Filter 1) reduces the false negative rate of filtering by 5.59x--6.41x, and 2) provides an end-to-end read mapper speedup of 1.81x--3.65x, compared to a state-of-the-art read mapper employing the best previous seed location filtering algorithm. Availability: The code is available online at: https://github.com/CMU-SAFARI/GRIM
Motivation: Uncovering the genomic causes of cancer, known as cancer driver genes, is a fundamental task in biomedical research. Cancer driver genes drive the development and progression of cancer, thus identifying cancer driver genes and their regulatory mechanism is crucial to the design of cancer treatment and intervention. Many computational methods, which take the advantages of computer science and data science, have been developed to utilise multiple types of genomic data to reveal cancer drivers and their regulatory mechanism behind cancer development and progression. Due to the complexity of the mechanistic insight of cancer genes in driving cancer and the fast development of the field, it is necessary to have a comprehensive review about the current computational methods for discovering different types of cancer drivers. Results: We survey computational methods for identifying cancer drivers from genomic data. We categorise the methods into three groups, methods for single driver identification, methods for driver module identification, and methods for identifying personalised cancer drivers. We also conduct a case study to compare the performance of the current methods. We further analyse the advantages and limitations of the current methods, and discuss the challenges and future directions of the topic. In addition, we investigate the resources for discovering and validating cancer drivers in order to provide a one-stop reference of the tools to facilitate cancer driver discovery. The ultimate goal of the paper is to help those interested in the topic to establish a solid background to carry out further research in the field.