ترغب بنشر مسار تعليمي؟ اضغط هنا

MetaCache-GPU: Ultra-Fast Metagenomic Classification

85   0   0.0 ( 0 )
 نشر من قبل Robin Kobus
 تاريخ النشر 2021
والبحث باللغة English
 تأليف Robin Kobus




اسأل ChatGPT حول البحث

The cost of DNA sequencing has dropped exponentially over the past decade, making genomic data accessible to a growing number of scientists. In bioinformatics, localization of short DNA sequences (reads) within large genomic sequences is commonly facilitated by constructing index data structures which allow for efficient querying of substrings. Recent metagenomic classification pipelines annotate reads with taxonomic labels by analyzing their $k$-mer histograms with respect to a reference genome database. CPU-based index construction is often performed in a preprocessing phase due to the relatively high cost of building irregular data structures such as hash maps. However, the rapidly growing amount of available reference genomes establishes the need for index construction and querying at interactive speeds. In this paper, we introduce MetaCache-GPU -- an ultra-fast metagenomic short read classifier specifically tailored to fit the characteristics of CUDA-enabled accelerators. Our approach employs a novel hash table variant featuring efficient minhash fingerprinting of reads for locality-sensitive hashing and their rapid insertion using warp-aggregated operations. Our performance evaluation shows that MetaCache-GPU is able to build large reference databases in a matter of seconds, enabling instantaneous operability, while popular CPU-based tools such as Kraken2 require over an hour for index construction on the same data. In the context of an ever-growing number of reference genomes, MetaCache-GPU is the first metagenomic classifier that makes analysis pipelines with on-demand composition of large-scale reference genome sets practical. The source code is publicly available at https://github.com/muellan/metacache .

قيم البحث

اقرأ أيضاً

Arid zones contain a diverse set of microbes capable of survival under dry conditions, some of which can form relationships with plants under drought stress conditions to improve plant health. We studied squash (Cucurbita pepo L.) root microbiome und er historically arid and humid sites, both in situ and performing a common garden experiment. Plants were grown in soils from sites with different drought levels, using in situ collected soils as the microbial source. We described and analyzed bacterial diversity by 16S rRNA gene sequencing (N=48) from the soil, rhizosphere, and endosphere. Proteobacteria were the most abundant phylum present in humid and arid samples, while Actinobacteriota abundance was higher in arid ones. The Beta-diversity analyses showed split microbiomes between arid and humid microbiomes, and aridity and soil pH levels could explain it. These differences between humid and arid microbiomes were maintained in the common garden experiment, showing that it is possible to transplant in situ diversity to the greenhouse. We detected a total of 1009 bacterial genera; 199 exclusively associated with roots under arid conditions. With shotgun metagenomic sequencing of rhizospheres (N=6), we identified 2969 protein families in the squash core metagenome and found an increased number of exclusively protein families from arid (924) than humid samples (158). We found arid conditions enriched genes involved in protein degradation and folding, oxidative stress, compatible solute synthesis, and ion pumps associated with osmotic regulation. Plant phenotyping allowed us to correlate bacterial communities with plant growth. Our study revealed that it is possible to evaluate microbiome diversity ex-situ and identify critical species and genes involved in plant-microbe interactions in historically arid locations.
Microbes are essentially yet convolutedly linked with human lives on the earth. They critically interfere in different physiological processes and thus influence overall health status. Studying microbial species is used to be constrained to those tha t can be cultured in the lab. But it excluded a huge portion of the microbiome that could not survive on lab conditions. In the past few years, the culture-independent metagenomic sequencing enabled us to explore the complex microbial community coexisting within and on us. Metagenomics has equipped us with new avenues of investigating the microbiome, from studying a single species to a complex community in a dynamic ecosystem. Thus, identifying the involved microbes and their genomes becomes one of the core tasks in metagenomic sequencing. Metagenome-assembled genomes are groups of contigs with similar sequence characteristics from de novo assembly and could represent the microbial genomes from metagenomic sequencing. In this paper, we reviewed a spectrum of tools for producing and annotating metagenome-assembled genomes from metagenomic sequencing data and discussed their technical and biological perspectives.
Specified Certainty Classification (SCC) is a new paradigm for employing classifiers whose outputs carry uncertainties, typically in the form of Bayesian posterior probabilities. By allowing the classifier output to be less precise than one of a set of atomic decisions, SCC allows all decisions to achieve a specified level of certainty, as well as provides insights into classifier behavior by examining all decisions that are possible. Our primary illustration is read classification for reference-guided genome assembly, but we demonstrate the breadth of SCC by also analyzing COVID-19 vaccination data.
127 - Tomas Ekeberg , Stefan Engblom , 2014
The classical method of determining the atomic structure of complex molecules by analyzing diffraction patterns is currently undergoing drastic developments. Modern techniques for producing extremely bright and coherent X-ray lasers allow a beam of s treaming particles to be intercepted and hit by an ultrashort high energy X-ray beam. Through machine learning methods the data thus collected can be transformed into a three-dimensional volumetric intensity map of the particle itself. The computational complexity associated with this problem is very high such that clusters of data parallel accelerators are required. We have implemented a distributed and highly efficient algorithm for inversion of large collections of diffraction patterns targeting clusters of hundreds of GPUs. With the expected enormous amount of diffraction data to be produced in the foreseeable future, this is the required scale to approach real time processing of data at the beam site. Using both real and synthetic data we look at the scaling properties of the application and discuss the overall computational viability of this exciting and novel imaging technique.
113 - M. Andrecut 2009
We consider the problem of sparse signal recovery from a small number of random projections (measurements). This is a well known NP-hard to solve combinatorial optimization problem. A frequently used approach is based on greedy iterative procedures, such as the Matching Pursuit (MP) algorithm. Here, we discuss a fast GPU implementation of the MP algorithm, based on the recently released NVIDIA CUDA API and CUBLAS library. The results show that the GPU version is substantially faster (up to 31 times) than the highly optimized CPU version based on CBLAS (GNU Scientific Library).
التعليقات
جاري جلب التعليقات جاري جلب التعليقات
سجل دخول لتتمكن من متابعة معايير البحث التي قمت باختيارها
mircosoft-partner

هل ترغب بارسال اشعارات عن اخر التحديثات في شمرا-اكاديميا