Research papers, master and doctoral theses about Genomics

Squash root microbiome transplants and metagenomic inspection for in situ arid adaptations

363 - Cristobal Hernandez-Alvarez , Felipe Garcia-Oliva , Rocion Cruz-Ortega 2021

Arid zones contain a diverse set of microbes capable of survival under dry conditions, some of which can form relationships with plants under drought stress conditions to improve plant health. We studied squash (Cucurbita pepo L.) root microbiome under historically arid and humid sites, both in situ and performing a common garden experiment. Plants were grown in soils from sites with different drought levels, using in situ collected soils as the microbial source. We described and analyzed bacterial diversity by 16S rRNA gene sequencing (N=48) from the soil, rhizosphere, and endosphere. Proteobacteria were the most abundant phylum present in humid and arid samples, while Actinobacteriota abundance was higher in arid ones. The Beta-diversity analyses showed split microbiomes between arid and humid microbiomes, and aridity and soil pH levels could explain it. These differences between humid and arid microbiomes were maintained in the common garden experiment, showing that it is possible to transplant in situ diversity to the greenhouse. We detected a total of 1009 bacterial genera; 199 exclusively associated with roots under arid conditions. With shotgun metagenomic sequencing of rhizospheres (N=6), we identified 2969 protein families in the squash core metagenome and found an increased number of exclusively protein families from arid (924) than humid samples (158). We found arid conditions enriched genes involved in protein degradation and folding, oxidative stress, compatible solute synthesis, and ion pumps associated with osmotic regulation. Plant phenotyping allowed us to correlate bacterial communities with plant growth. Our study revealed that it is possible to evaluate microbiome diversity ex-situ and identify critical species and genes involved in plant-microbe interactions in historically arid locations.

Genomics Populations and Evolution

FUNKI: Interactive functional footprint-based analysis of omics data

186 - Rosa Hernansaiz-Ballesteros , Christian H. Holland , Aurelien Dugourd 2021

Motivation: Omics data, such as transcriptomics or phosphoproteomics, are broadly used to get a snap-shot of the molecular status of cells. In particular, changes in omics can be used to estimate the activity of pathways, transcription factors and kinases based on known regulated targets, that we call footprints. Then the molecular paths driving these activities can be estimated using causal reasoning on large signaling networks. Results: We have developed FUNKI, a FUNctional toolKIt for footprint analysis. It provides a user-friendly interface for an easy and fast analysis of several omics data, either from bulk or single-cell experiments. FUNKI also features different options to visualise the results and run post-analyses, and is mirrored as a scripted version in R. Availability: FUNKI is a free and open-source application built on R and Shiny, available in GitHub at https://github.com/saezlab/ShinyFUNKI under GNU v3.0 license and accessible also in https://saezlab.shinyapps.io/funki/ Contact: [email protected] Supplementary information: We provide data examples within the app, as well as extensive information about the different variables to select, the results, and the different plots in the help page.

Genomics Computational Engineering

Single-Read Reconstruction for DNA Data Storage Using Transformers

608 - Yotam Nahum , Eyar Ben-Tolila , Leon Anavy 2021

As the global need for large-scale data storage is rising exponentially, existing storage technologies are approaching their theoretical and functional limits in terms of density and energy consumption, making DNA based storage a potential solution for the future of data storage. Several studies introduced DNA based storage systems with high information density (petabytes/gram). However, DNA synthesis and sequencing technologies yield erroneous outputs. Algorithmic approaches for correcting these errors depend on reading multiple copies of each sequence and result in excessive reading costs. The unprecedented success of Transformers as a deep learning architecture for language modeling has led to its repurposing for solving a variety of tasks across various domains. In this work, we propose a novel approach for single-read reconstruction using an encoder-decoder Transformer architecture for DNA based data storage. We address the error correction process as a self-supervised sequence-to-sequence task and use synthetic noise injection to train the model using only the decoded reads. Our approach exploits the inherent redundancy of each decoded file to learn its underlying structure. To demonstrate our proposed approach, we encode text, image and code-script files to DNA, produce errors with high-fidelity error simulator, and reconstruct the original files from the noisy reads. Our model achieves lower error rates when reconstructing the original data from a single read of each DNA strand compared to state-of-the-art algorithms using 2-3 copies. This is the first demonstration of using deep learning models for single-read reconstruction in DNA based storage which allows for the reduction of the overall cost of the process. We show that this approach is applicable for various domains and can be generalized to new domains as well.

Emerging Technologies Machine Learning Genomics

Spike2Vec: An Efficient and Scalable Embedding Approach for COVID-19 Spike Sequences

382 - Sarwan Ali , Murray Patterson 2021

With the rapid global spread of COVID-19, more and more data related to this virus is becoming available, including genomic sequence data. The total number of genomic sequences that are publicly available on platforms such as GISAID is currently several million, and is increasing with every day. The availability of such textit{Big Data} creates a new opportunity for researchers to study this virus in detail. This is particularly important with all of the dynamics of the COVID-19 variants which emerge and circulate. This rich data source will give us insights on the best ways to perform genomic surveillance for this and future pandemic threats, with the ultimate goal of mitigating or eliminating such threats. Analyzing and processing the several million genomic sequences is a challenging task. Although traditional methods for sequence classification are proven to be effective, they are not designed to deal with these specific types of genomic sequences. Moreover, most of the existing methods also face the issue of scalability. Previous studies which were tailored to coronavirus genomic data proposed to use spike sequences (corresponding to a subsequence of the genome), rather than using the complete genomic sequence, to perform different machine learning (ML) tasks such as classification and clustering. However, those methods suffer from scalability issues. In this paper, we propose an approach called Spike2Vec, an efficient and scalable feature vector representation for each spike sequence that can be used for downstream ML tasks. Through experiments, we show that Spike2Vec is not only scalable on several million spike sequences, but also outperforms the baseline models in terms of prediction accuracy, F1-score, etc.

Genomics Machine Learning

Robust haplotype-resolved assembly of diploid individuals without parental data

245 - Haoyu Cheng , Erich D. Jarvis , Olivier Fedrigo 2021

Routine single-sample haplotype-resolved assembly remains an unresolved problem. Here we describe a new algorithm that combines PacBio HiFi reads and Hi-C chromatin interaction data to produce a haplotype-resolved assembly without the sequencing of parents. Applied to human and other vertebrate samples, our algorithm consistently outperforms existing single-sample assembly pipelines and generates assemblies of comparable quality to the best pedigree-based assemblies.

Genomics

Computational methods for differentially expressed gene analysis from RNA-Seq: an overview

263 - Juliana Costa-Silva , Douglas S. Domingues , David Menotti 2021

The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step and their properties, bringing an overview in an organized way in this context. In particular, this review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-Seq), considering the computational methods and its properties. In addition, a timeline of the evolution of computational methods for DEG is presented and discussed, as well as the relationships existing between the main computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review.

Genomics Computational Engineering

Mutation frequency time series reveal complex mixtures of clones in the world-wide SARS-CoV-2 viral population

115 - Hong-Li Zeng , Yue Liu , Vito Dichio 2021

We compute the allele frequencies of the alpha (B.1.1.7), beta (B.1.351) and delta (B.167.2) variants of SARS-CoV-2 from almost two million genome sequences on the GISAID repository. We find that the frequencies of a majority of the defining mutations in alpha rose towards the end of 2020 but drifted apart during spring 2021, a similar pattern being followed by delta during summer of 2021. For beta we find a more complex scenario with frequencies of some mutations rising and some remaining close to zero. Our results point to that what is generally reported as single variants is in fact a collection of variants with different genetic characteristics. For all three variants we further find some alleles with a clearly deviating time series.

Populations and Evolution Genomics Computation

A New Approach to Multilabel Stratified Cross Validation with Application to Large and Sparse Gene Ontology Datasets

356 - Henri Tiittanen , Liisa Holm , Petri Toronen 2021

Multilabel learning is an important topic in machine learning research. Evaluating models in multilabel settings requires specific cross validation methods designed for multilabel data. In this article, we show a weakness in an evaluation metric widely used in literature and we present improv

Machine Learning Genomics

Ease of $textit{de novo}$ gene birth through spontaneous mutations predicted in a parsimonious model

225 - Somya Mani , Tsvi Tlusty 2021

Contrary to long-held views, recent evidence indicates that $textit{de novo}$ birth of genes is not only possible, but is surprisingly prevalent: a substantial fraction of eukaryotic genomes are composed of orphan genes, which show no homology with any conserved genes. And a remarkably large proportion of orphan genes likely originated $textit{de novo}$ from non-genic regions. Here, using a parsimonious mathematical model, we investigate the probability and timescale of $textit{de novo}$ gene birth due to spontaneous mutations. We trace how an initially non-genic locus accumulates beneficial mutations to become a gene. We sample across a wide range of biologically feasible distributions of fitness effects (DFE) of mutations, and calculate the conditions conducive to gene birth. We find that in a time frame of millions of years, gene birth is highly likely for a wide range of DFEs. Moreover, when we allow DFEs to fluctuate, which is expected given the long time frame, gene birth in the model becomes practically inevitable. This supports the idea that gene birth is a ubiquitous process, and should occur in a wide variety of organisms. Our results also demonstrate that intergenic regions are not inactive and silent but are more like dynamic storehouses of potential genes.

Populations and Evolution Biological Physics Genomics

A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data

193 - Chao Yang , Debajyoti Chowdhury , Zhenmiao Zhang 2021

Microbes are essentially yet convolutedly linked with human lives on the earth. They critically interfere in different physiological processes and thus influence overall health status. Studying microbial species is used to be constrained to those that can be cultured in the lab. But it excluded a huge portion of the microbiome that could not survive on lab conditions. In the past few years, the culture-independent metagenomic sequencing enabled us to explore the complex microbial community coexisting within and on us. Metagenomics has equipped us with new avenues of investigating the microbiome, from studying a single species to a complex community in a dynamic ecosystem. Thus, identifying the involved microbes and their genomes becomes one of the core tasks in metagenomic sequencing. Metagenome-assembled genomes are groups of contigs with similar sequence characteristics from de novo assembly and could represent the microbial genomes from metagenomic sequencing. In this paper, we reviewed a spectrum of tools for producing and annotating metagenome-assembled genomes from metagenomic sequencing data and discussed their technical and biological perspectives.

Genomics

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد