No Arabic abstract
The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.
According to the National Cancer Institute, there were 9.5 million cancer-related deaths in 2018. A challenge in improving treatment is resistance in genetically unstable cells. The purpose of this study is to evaluate unsupervised machine learning on classifying treatment-resistant phenotypes in heterogeneous tumors through analysis of single cell RNA sequencing(scRNAseq) data with a pipeline and evaluation metrics. scRNAseq quantifies mRNA in cells and characterizes cell phenotypes. One scRNAseq dataset was analyzed (tumor/non-tumor cells of different molecular subtypes and patient identifications). The pipeline consisted of data filtering, dimensionality reduction with Principal Component Analysis, projection with Uniform Manifold Approximation and Projection, clustering with nine approaches (Ward, BIRCH, Gaussian Mixture Model, DBSCAN, Spectral, Affinity Propagation, Agglomerative Clustering, Mean Shift, and K-Means), and evaluation. Seven models divided tumor versus non-tumor cells and molecular subtype while six models classified different patient identification (13 of which were presented in the dataset); K-Means, Ward, and BIRCH often ranked highest with ~80% accuracy on the tumor versus non-tumor task and ~60% for molecular subtype and patient ID. An optimized classification pipeline using K-Means, Ward, and BIRCH models was evaluated to be most effective for further analysis. In clinical research where there is currently no standard protocol for scRNAseq analysis, clusters generated from this pipeline can be used to understand cancer cell behavior and malignant growth, directly affecting the success of treatment.
Coronavirus disease 2019 (COVID-19) has impacted almost every part of human life worldwide, posing a massive threat to human health. There is no specific drug for COVID-19, highlighting the urgent need for the development of effective therapeutics. To identify potentially repurposable drugs, we employed a systematic approach to mine candidates from U.S. FDA-approved drugs and preclinical small-molecule compounds by integrating the gene expression perturbation data for chemicals from the Library of Integrated Network-Based Cellular Signatures project with a publicly available single-cell RNA sequencing dataset from mild and severe COVID-19 patients. We identified 281 FDA-approved drugs that have the potential to be effective against SARS-CoV-2 infection, 16 of which are currently undergoing clinical trials to evaluate their efficacy against COVID-19. We experimentally tested the inhibitory effects of tyrphostin-AG-1478 and brefeldin-a on the replication of the single-stranded ribonucleic acid (ssRNA) virus influenza A virus. In conclusion, we have identified a list of repurposable anti-SARS-CoV-2 drugs using a systems biology approach.
The analysis of differential gene expression from RNA-Seq data has become a standard for several research areas mainly involving bioinformatics. The steps for the computational analysis of these data include many data types and file formats, and a wide variety of computational tools that can be applied alone or together as pipelines. This paper presents a review of differential expression analysis pipeline, addressing its steps and the respective objectives, the principal methods available in each step and their properties, bringing an overview in an organized way in this context. In particular, this review aims to address mainly the aspects involved in the differentially expressed gene (DEG) analysis from RNA sequencing data (RNA-Seq), considering the computational methods and its properties. In addition, a timeline of the evolution of computational methods for DEG is presented and discussed, as well as the relationships existing between the main computational tools are presented by an interaction network. A discussion on the challenges and gaps in DEG analysis is also highlighted in this review.
The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in metazoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.
Microbes are essentially yet convolutedly linked with human lives on the earth. They critically interfere in different physiological processes and thus influence overall health status. Studying microbial species is used to be constrained to those that can be cultured in the lab. But it excluded a huge portion of the microbiome that could not survive on lab conditions. In the past few years, the culture-independent metagenomic sequencing enabled us to explore the complex microbial community coexisting within and on us. Metagenomics has equipped us with new avenues of investigating the microbiome, from studying a single species to a complex community in a dynamic ecosystem. Thus, identifying the involved microbes and their genomes becomes one of the core tasks in metagenomic sequencing. Metagenome-assembled genomes are groups of contigs with similar sequence characteristics from de novo assembly and could represent the microbial genomes from metagenomic sequencing. In this paper, we reviewed a spectrum of tools for producing and annotating metagenome-assembled genomes from metagenomic sequencing data and discussed their technical and biological perspectives.