Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design

306 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Igor Melnyk

تاريخ النشر 2021

مجال البحث الهندسة المعلوماتية علم الأحياء

والبحث باللغة English

تأليف Yue Cao - Payel Das - Vijil Chenthamarakshan

التعلم الآلي الجزيئات الحيوية

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

Designing novel protein sequences for a desired 3D topological fold is a fundamental yet non-trivial task in protein engineering. Challenges exist due to the complex sequence--fold relationship, as well as the difficulties to capture the diversity of the sequences (therefore structures and functions) within a fold. To overcome these challenges, we propose Fold2Seq, a novel transformer-based generative framework for designing protein sequences conditioned on a specific target fold. To model the complex sequence--structure relationship, Fold2Seq jointly learns a sequence embedding using a transformer and a fold embedding from the density of secondary structural elements in 3D voxels. On test sets with single, high-resolution and complete structure inputs for individual folds, our experiments demonstrate improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design, when compared to existing state-of-the-art methods that include data-driven deep generative models and physics-based RosettaDesign. The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges originating from low-quality, incomplete, or ambiguous input structures. Source code and data are available at https://github.com/IBM/fold2seq.

قيم البحث

364 - Tristan Bepler , Bonnie Berger 2019

Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function . Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.

التعلم الآلي الجزيئات الحيوية التعلم الالي

Efficient fold-change detection based on protein-protein interactions

406 - Wouter Buijsman , Michael Sheinman 2012

Various biological sensory systems exhibit a response to a relative change of the stimulus, often referred to as fold-change detection. In the last few years fold-change detecting mechanisms, based on transcriptional networks, have been proposed. Her e we present fold-change detecting mechanism, based on protein-protein interactions, consisting of two interacting proteins. This mechanism, in contrast to previously proposed mechanisms, does not consume chemical energy and is not subject to transcriptional and translational noise. We show by analytical and numerical calculations, that the mechanism can have a fast, precise and efficient response for parameters that are relevant to eukaryotic cells.

الفيزياء البيولوجية الشبكات الجزيئية

Generative Capacity of Probabilistic Protein Sequence Models

140 - Francisco McGee , Quentin Novinger , Ronald M. Levy 2020

Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict the effect of mutations. Despite encouraging results, quantitative characterizati on and comparison of GPSM-generated probability distributions is still lacking. It is currently unclear whether GPSMs can faithfully reproduce the complex multi-residue mutation patterns observed in natural sequences arising due to epistasis. We develop a set of sequence statistics to assess the generative capacity of three GPSMs of recent interest: the pairwise Potts Hamiltonian, the VAE, and the site-independent model, using natural and synthetic datasets. We show that the generative capacity of the Potts Hamiltonian model is the largest, in that the higher order mutational statistics generated by the model agree with those observed for natural sequences. In contrast, we show that the VAEs generative capacity lies between the pairwise Potts and site-independent models. Importantly, our work measures GPSM generative capacity in terms of higher-order sequence covariation statistics which we have developed, and provides a new framework for evaluating and interpreting GPSM accuracy that emphasizes the role of epistasis.

التعلم الآلي تحليل البيانات والإحصاءات والاحتمال الأساليب الكمية

Autofocused oracles for model-based design

64 - Clara Fannjiang , Jennifer Listgarten 2020

Data-driven design is making headway into a number of application areas, including protein, small-molecule, and materials engineering. The design goal is to construct an object with desired properties, such as a protein that binds to a therapeutic ta rget, or a superconducting material with a higher critical temperature than previously observed. To that end, costly experimental measurements are being replaced with calls to high-capacity regression models trained on labeled data, which can be leveraged in an in silico search for design candidates. However, the design goal necessitates moving into regions of the design space beyond where such models were trained. Therefore, one can ask: should the regression model be altered as the design algorithm explores the design space, in the absence of new data? Herein, we answer this question in the affirmative. In particular, we (i) formalize the data-driven design problem as a non-zero-sum game, (ii) develop a principled strategy for retraining the regression model as the design algorithm proceeds---what we refer to as autofocusing, and (iii) demonstrate the promise of autofocusing empirically.

التعلم الآلي الأساليب الكمية التعلم الالي

Selection originating from protein stability/foldability: Relationships between protein folding free energy, sequence ensemble, and fitness