Predicting Molecular Phenotypes with Single Cell RNA Sequencing Data: an Assessment of Unsupervised Machine Learning Models

435 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Anastasia Dunca

تاريخ النشر 2021

مجال البحث علم الأحياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Anastasia Dunca - Frederick R. Adler

الجينوم التعلم الآلي سلوك الخلية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

According to the National Cancer Institute, there were 9.5 million cancer-related deaths in 2018. A challenge in improving treatment is resistance in genetically unstable cells. The purpose of this study is to evaluate unsupervised machine learning on classifying treatment-resistant phenotypes in heterogeneous tumors through analysis of single cell RNA sequencing(scRNAseq) data with a pipeline and evaluation metrics. scRNAseq quantifies mRNA in cells and characterizes cell phenotypes. One scRNAseq dataset was analyzed (tumor/non-tumor cells of different molecular subtypes and patient identifications). The pipeline consisted of data filtering, dimensionality reduction with Principal Component Analysis, projection with Uniform Manifold Approximation and Projection, clustering with nine approaches (Ward, BIRCH, Gaussian Mixture Model, DBSCAN, Spectral, Affinity Propagation, Agglomerative Clustering, Mean Shift, and K-Means), and evaluation. Seven models divided tumor versus non-tumor cells and molecular subtype while six models classified different patient identification (13 of which were presented in the dataset); K-Means, Ward, and BIRCH often ranked highest with ~80% accuracy on the tumor versus non-tumor task and ~60% for molecular subtype and patient ID. An optimized classification pipeline using K-Means, Ward, and BIRCH models was evaluated to be most effective for further analysis. In clinical research where there is currently no standard protocol for scRNAseq analysis, clusters generated from this pipeline can be used to understand cancer cell behavior and malignant growth, directly affecting the success of treatment.

قيم البحث

99 - Nan Miles Xi , Jingyi Jessica Li 2021

The existence of doublets is a key confounder in single-cell RNA sequencing (scRNA-seq) data analysis. Computational methods have been developed for detecting doublets from scRNA-seq data. We developed an R package DoubletCollection to integrate the installation and execution of eight doublet-detection methods. DoubletCollection also provides a unified interface to perform and visualize downstream analysis after doublet detection. Here, we present a protocol of using DoubletCollection to benchmark doublet-detection methods. This protocol can automatically accommodate new doublet-detection methods in the fast-growing scRNA-seq field.

الجينوم

Identification of Repurposable Drugs and Adverse Drug Reactions for Various Courses of COVID-19 Based on Single-Cell RNA Sequencing Data

136 - Zhihan Wang , Kai Guo , Pan Gao 2020

Coronavirus disease 2019 (COVID-19) has impacted almost every part of human life worldwide, posing a massive threat to human health. There is no specific drug for COVID-19, highlighting the urgent need for the development of effective therapeutics. T o identify potentially repurposable drugs, we employed a systematic approach to mine candidates from U.S. FDA-approved drugs and preclinical small-molecule compounds by integrating the gene expression perturbation data for chemicals from the Library of Integrated Network-Based Cellular Signatures project with a publicly available single-cell RNA sequencing dataset from mild and severe COVID-19 patients. We identified 281 FDA-approved drugs that have the potential to be effective against SARS-CoV-2 infection, 16 of which are currently undergoing clinical trials to evaluate their efficacy against COVID-19. We experimentally tested the inhibitory effects of tyrphostin-AG-1478 and brefeldin-a on the replication of the single-stranded ribonucleic acid (ssRNA) virus influenza A virus. In conclusion, we have identified a list of repurposable anti-SARS-CoV-2 drugs using a systems biology approach.

الجينوم

Learning interpretable models of phenotypes from whole genome sequences with the Set Covering Machine

523 - Alexandre Drouin , Sebastien Gigu`ere , Vladana Sagatovich 2014

The increased affordability of whole genome sequencing has motivated its use for phenotypic studies. We address the problem of learning interpretable models for discrete phenotypes from whole genomes. We propose a general approach that relies on the Set Covering Machine and a k-mer representation of the genomes. We show results for the problem of predicting the resistance of Pseudomonas Aeruginosa, an important human pathogen, against 4 antibiotics. Our results demonstrate that extremely sparse models which are biologically relevant can be learnt using this approach.

الجينوم الهندسة الحاسوبية، المالية،العلوم التعلم الآلي

Quasi-universality in single-cell sequencing data

137 - Luis Aparicio , Mykola Bordyuh , Andrew J. Blumberg 2018

The development of single-cell technologies provides the opportunity to identify new cellular states and reconstruct novel cell-to-cell relationships. Applications range from understanding the transcriptional and epigenetic processes involved in meta zoan development to characterizing distinct cells types in heterogeneous populations like cancers or immune cells. However, analysis of the data is impeded by its unknown intrinsic biological and technical variability together with its sparseness; these factors complicate the identification of true biological signals amidst artifact and noise. Here we show that, across technologies, roughly 95% of the eigenvalues derived from each single-cell data set can be described by universal distributions predicted by Random Matrix Theory. Interestingly, 5% of the spectrum shows deviations from these distributions and present a phenomenon known as eigenvector localization, where information tightly concentrates in groups of cells. Some of the localized eigenvectors reflect underlying biological signal, and some are simply a consequence of the sparsity of single cell data; roughly 3% is artifactual. Based on the universal distributions and a technique for detecting sparsity induced localization, we present a strategy to identify the residual 2% of directions that encode biological information and thereby denoise single-cell data. We demonstrate the effectiveness of this approach by comparing with standard single-cell data analysis techniques in a variety of examples with marked cell populations.

الأساليب الكمية الميكانيكا الإحصائية الاحتمالات

A Manifold Proximal Linear Method for Sparse Spectral Clustering with Application to Single-Cell RNA Sequencing Data Analysis

104 - Zhongruo Wang , Bingyuan Liu , Shixiang Chen 2020

Spectral clustering is one of the fundamental unsupervised learning methods widely used in data analysis. Sparse spectral clustering (SSC) imposes sparsity to the spectral clustering and it improves the interpretability of the model. This paper consi ders a widely adopted model for SSC, which can be formulated as an optimization problem over the Stiefel manifold with nonsmooth and nonconvex objective. Such an optimization problem is very challenging to solve. Existing methods usually solve its convex relaxation or need to smooth its nonsmooth part using certain smoothing techniques. In this paper, we propose a manifold proximal linear method (ManPL) that solves the original SSC formulation. We also extend the algorithm to solve the multiple-kernel SSC problems, for which an alternating ManPL algorithm is proposed. Convergence and iteration complexity results of the proposed methods are established. We demonstrate the advantage of our proposed methods over existing methods via the single-cell RNA sequencing data analysis.

التعلم الالي التعلم الآلي التحسين والتحكم