MutaGAN: A Seq2seq GAN Framework to Predict Mutations of Evolving Protein Populations

129 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Daniel Berman

تاريخ النشر 2020

مجال البحث علم الأحياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Daniel S. Berman

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

The ability to predict the evolution of a pathogen would significantly improve the ability to control, prevent, and treat disease. Despite significant progress in other problem spaces, deep learning has yet to contribute to the issue of predicting mutations of evolving populations. To address this gap, we developed a novel machine learning framework using generative adversarial networks (GANs) with recurrent neural networks (RNNs) to accurately predict genetic mutations and evolution of future biological populations. Using a generalized time-reversible phylogenetic model of protein evolution with bootstrapped maximum likelihood tree estimation, we trained a sequence-to-sequence generator within an adversarial framework, named MutaGAN, to generate complete protein sequences augmented with possible mutations of future virus populations. Influenza virus sequences were identified as an ideal test case for this deep learning framework because it is a significant human pathogen with new strains emerging annually and global surveillance efforts have generated a large amount of publicly available data from the National Center for Biotechnology Informations (NCBI) Influenza Virus Resource (IVR). MutaGAN generated child sequences from a given parent protein sequence with a median Levenshtein distance of 2.00 amino acids. Additionally, the generator was able to augment the majority of parent proteins with at least one mutation identified within the global influenza virus population. These results demonstrate the power of the MutaGAN framework to aid in pathogen forecasting with implications for broad utility in evolutionary prediction for any protein population.

قيم البحث

132 - Xinsong Du , Jae Min , Mattia Prosperi 2019

Febrile neutropenia (FN) has been associated with high mortality, especially among adults with cancer. Understanding the patient and provider level heterogeneity in FN hospital admissions has potential to inform personalized interventions focused on increasing survival of individuals with FN. We leverage machine learning techniques to disentangling the complex interactions among multi domain risk factors in a population with FN. Data from the Healthcare Cost and Utilization Project (HCUP) National Inpatient Sample and Nationwide Inpatient Sample (NIS) were used to build machine learning based models of mortality for adult cancer patients who were diagnosed with FN during a hospital admission. In particular, the importance of risk factors from different domains (including demographic, clinical, and hospital associated information) was studied. A set of more interpretable (decision tree, logistic regression) as well as more black box (random forest, gradient boosting, neural networks) models were analyzed and compared via multiple cross validation. Our results demonstrate that a linear prediction score of FN mortality among adults with cancer, based on admission information is effective in classifying high risk patients; clinical diagnoses is the domain with the highest predictive power. A number of the risk variables (e.g. sepsis, kidney failure, etc.) identified in this study are clinically actionable and may inform future studies looking at the patients prior medical history are warranted.

الأساليب الكمية التعلم الآلي التعلم الالي

Pre-training of Graph Neural Network for Modeling Effects of Mutations on Protein-Protein Binding Affinity

423 - Xianggen Liu , Yunan Luo , Sen Song 2020

Modeling the effects of mutations on the binding affinity plays a crucial role in protein engineering and drug design. In this study, we develop a novel deep learning based framework, named GraphPPI, to predict the binding affinity changes upon mutat ions based on the features provided by a graph neural network (GNN). In particular, GraphPPI first employs a well-designed pre-training scheme to enforce the GNN to capture the features that are predictive of the effects of mutations on binding affinity in an unsupervised manner and then integrates these graphical features with gradient-boosting trees to perform the prediction. Experiments showed that, without any annotated signals, GraphPPI can capture meaningful patterns of the protein structures. Also, GraphPPI achieved new state-of-the-art performance in predicting the binding affinity changes upon both single- and multi-point mutations on five benchmark datasets. In-depth analyses also showed GraphPPI can accurately estimate the effects of mutations on the binding affinity between SARS-CoV-2 and its neutralizing antibodies. These results have established GraphPPI as a powerful and useful computational tool in the studies of protein design.

الجزيئات الحيوية التعلم الآلي الأساليب الكمية

Comparison of Machine Learning Classifiers to Predict Patient Survival and Genetics of GBM: Towards a Standardized Model for Clinical Implementation

77 - Luca Pasquini , Antonio Napolitano , Martina Lucignani 2021

Radiomic models have been shown to outperform clinical data for outcome prediction in glioblastoma (GBM). However, clinical implementation is limited by lack of parameters standardization. We aimed to compare nine machine learning classifiers, with d ifferent optimization parameters, to predict overall survival (OS), isocitrate dehydrogenase (IDH) mutation, O-6-methylguanine-DNA-methyltransferase (MGMT) promoter methylation, epidermal growth factor receptor (EGFR) VII amplification and Ki-67 expression in GBM patients, based on radiomic features from conventional and advanced MR. 156 adult patients with pathologic diagnosis of GBM were included. Three tumoral regions were analyzed: contrast-enhancing tumor, necrosis and non-enhancing tumor, selected by manual segmentation. Radiomic features were extracted with a custom version of Pyradiomics, and selected through Boruta algorithm. A Grid Search algorithm was applied when computing 4 times K-fold cross validation (K=10) to get the highest mean and lowest spread of accuracy. Once optimal parameters were identified, model performances were assessed in terms of Area Under The Curve-Receiver Operating Characteristics (AUC-ROC). Metaheuristic and ensemble classifiers showed the best performance across tasks. xGB obtained maximum accuracy for OS (74.5%), AB for IDH mutation (88%), MGMT methylation (71,7%), Ki-67 expression (86,6%), and EGFR amplification (81,6%). Best performing features shed light on possible correlations between MR and tumor histology.

الأساليب الكمية التعلم الآلي الجينوم

Phylogenetic Profiles as a Unified Framework for Measuring Protein Structure, Function and Evolution

676 - Kyung Dae Ko , Yoojin Hong (2 2008

The sequence of amino acids in a protein is believed to determine its native state structure, which in turn is related to the functionality of the protein. In addition, information pertaining to evolutionary relationships is contained in homologous s equences. One powerful method for inferring these sequence attributes is through comparison of a query sequence with reference sequences that contain significant homology and whose structure, function, and/or evolutionary relationships are already known. In spite of decades of concerted work, there is no simple framework for deducing structure, function, and evolutionary (SF&E) relationships directly from sequence information alone, especially when the pair-wise identity is less than a threshold figure ~25% [1,2]. However, recent research has shown that sequence identity as low as 8% is sufficient to yield common structure/function relationships and sequence identities as large as 88% may yet result in distinct structure and function [3,4]. Starting with a basic premise that protein sequence encodes information about SF&E, one might ask how one could tease out these measures in an unbiased manner. Here we present a unified framework for inferring SF&E from sequence information using a knowledge-based approach which generates phylogenetic profiles in an unbiased manner. We illustrate the power of phylogenetic profiles generated using the Gestalt Domain Detection Algorithm Basic Local Alignment Tool (GDDA-BLAST) to derive structural domains, functional annotation, and evolutionary relationships for a host of ion-channels and human proteins of unknown function. These data are in excellent accord with published data and new experiments. Our results suggest that there is a wealth of previously unexplored information in protein sequence.

الأساليب الكمية السكان والتطور

Nanopores -- a Versatile Tool to Study Protein Dynamics

85 - Sonja Schmid , Cees Dekker 2020

Proteins are the active working horses in our body. These biomolecules perform all vital cellular functions from DNA replication and general biosynthesis to metabolic signaling and environmental sensing. While static 3D structures are now readily ava ilable, observing the functional cycle of proteins - involving conformational changes and interactions - remains very challenging, e.g., due to ensemble averaging. However, time-resolved information is crucial to gain a mechanistic understanding of protein function. Single-molecule techniques such as FRET and force spectroscopies provide answers but can be limited by the required labelling, a narrow time bandwidth, and more. Here, we describe electrical nanopore detection as a tool for probing protein dynamics. With a time bandwidth ranging from microseconds to hours, it covers an exceptionally wide range of timescales that is very relevant for protein function. First, we discuss the working principle of label-free nanopore experiments, various pore designs, instrumentation, and the characteristics of nanopore signals. In the second part, we review a few nanopore experiments that solved research questions in protein science, and we compare nanopores to other single-molecule techniques. We hope to make electrical nanopore sensing more accessible to the biochemical community, and to inspire new creative solutions to resolve a variety of protein dynamics - one molecule at a time.

الأساليب الكمية