Machine learning modeling of family wide enzyme-substrate specificity screens

96 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Connor Coley

تاريخ النشر 2021

مجال البحث علم الأحياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Samuel Goldman - Ria Das - Kevin K. Yang

الجزيئات الحيوية التعلم الآلي

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their natural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.

قيم البحث

271 - Sean Mullane , Ruoyan Chen , Sri Vaishnavi Vemulapalli 2019

The biological function of a protein stems from its 3-dimensional structure, which is thermodynamically determined by the energetics of interatomic forces between its amino acid building blocks (the order of amino acids, known as the sequence, define s a protein). Given the costs (time, money, human resources) of determining protein structures via experimental means such as X-ray crystallography, can we better describe and compare protein 3D structures in a robust and efficient manner, so as to gain meaningful biological insights? We begin by considering a relatively simple problem, limiting ourselves to just protein secondary structural elements. Historically, many computational methods have been devised to classify amino acid residues in a protein chain into one of several discrete secondary structures, of which the most well-characterized are the geometrically regular $alpha$-helix and $beta$-sheet; irregular structural patterns, such as turns and loops, are less understood. Here, we present a study of Deep Learning techniques to classify the loop-like end cap structures which delimit $alpha$-helices. Previous work used highly empirical and heuristic methods to manually classify helix capping motifs. Instead, we use structural data directly--including (i) backbone torsion angles computed from 3D structures, (ii) macromolecular feature sets (e.g., physicochemical properties), and (iii) helix cap classification data (from CAPS-DB)--as the ground truth to train a bidirectional long short-term memory (BiLSTM) model to classify helix cap residues. We tried different network architectures and scanned hyperparameters in order to train and assess several models; we also trained a Support Vector Classifier (SVC) to use as a baseline. Ultimately, we achieved 85% class-balanced accuracy with a deep BiLSTM model.

الجزيئات الحيوية التعلم الآلي الأساليب الكمية

Review of Machine-Learning Methods for RNA Secondary Structure Prediction

113 - Qi Zhao , Zheng Zhao , Xiaoya Fan 2020

Secondary structure plays an important role in determining the function of non-coding RNAs. Hence, identifying RNA secondary structures is of great value to research. Computational prediction is a mainstream approach for predicting RNA secondary stru cture. Unfortunately, even though new methods have been proposed over the past 40 years, the performance of computational prediction methods has stagnated in the last decade. Recently, with the increasing availability of RNA structure data, new methods based on machine-learning technologies, especially deep learning, have alleviated the issue. In this review, we provide a comprehensive overview of RNA secondary structure prediction methods based on machine-learning technologies and a tabularized summary of the most important methods in this field. The current pending issues in the field of RNA secondary structure prediction and future trends are also discussed.

الجزيئات الحيوية التعلم الآلي التعلم الالي

Toward Drug-Target Interaction Prediction via Ensemble Modeling and Transfer Learning

125 - Po-Yu Kao , Shu-Min Kao , Nan-Lan Huang 2021

Drug-target interaction (DTI) prediction plays a crucial role in drug discovery, and deep learning approaches have achieved state-of-the-art performance in this field. We introduce an ensemble of deep learning models (EnsembleDLM) for DTI prediction. EnsembleDLM only uses the sequence information of chemical compounds and proteins, and it aggregates the predictions from multiple deep neural networks. This approach not only achieves state-of-the-art performance in Davis and KIBA datasets but also reaches cutting-edge performance in the cross-domain applications across different bio-activity types and different protein classes. We also demonstrate that EnsembleDLM achieves a good performance (Pearson correlation coefficient and concordance index > 0.8) in the new domain with approximately 50% transfer learning data, i.e., the training set has twice as much data as the test set.

الجزيئات الحيوية التعلم الآلي الأساليب الكمية

piSAAC: Extended notion of SAAC feature selection novel method for discrimination of Enzymes model using different machine learning algorithm

33 - Zaheer Ullah Khan , Dechang Pi , Izhar Ahmed Khan 2020

Enzymes and proteins are live driven biochemicals, which has a dramatic impact over the environment, in which it is active. So, therefore, it is highly looked-for to build such a robust and highly accurate automatic and computational model to accurat ely predict enzymes nature. In this study, a novel split amino acid composition model named piSAAC is proposed. In this model, protein sequence is discretized in equal and balanced terminus to fully evaluate the intrinsic correlation properties of the sequence. Several state-of-the-art algorithms have been employed to evaluate the proposed model. A 10-folds cross-validation evaluation is used for finding out the authenticity and robust-ness of the model using different statistical measures e.g. Accuracy, sensitivity, specificity, F-measure and area un-der ROC curve. The experimental results show that, probabilistic neural network algorithm with piSAAC feature extraction yields an accuracy of 98.01%, sensitivity of 97.12%, specificity of 95.87%, f-measure of 0.9812and AUC 0.95812, over dataset S1, accuracy of 97.85%, sensitivity of 97.54%, specificity of 96.24%, f-measure of 0.9774 and AUC 0.9803 over dataset S2. Evident from these excellent empirical results, the proposed model would be a very useful tool for academic research and drug designing related application areas.

الجزيئات الحيوية التعلم الآلي

ProGen: Language Modeling for Protein Generation

177 - Ali Madani , Bryan McCann , Nikhil Naik 2020

Generative modeling for protein engineering is key to solving fundamental problems in synthetic biology, medicine, and material science. We pose protein engineering as an unsupervised sequence generation problem in order to leverage the exponentially growing set of proteins that lack costly, structural annotations. We train a 1.2B-parameter language model, ProGen, on ~280M protein sequences conditioned on taxonomic and keyword tags such as molecular function and cellular component. This provides ProGen with an unprecedented range of evolutionary sequence diversity and allows it to generate with fine-grained control as demonstrated by metrics based on primary sequence similarity, secondary structure accuracy, and conformational energy.

الجزيئات الحيوية التعلم الآلي التعلم الالي