أوراق بحثية, رسائل ماجستير ودكتوراه حول الجزيئات الحيوية

Emergence of functional information from multivariate correlations

220 - Christoph Adami , Nitash C G 2021

The information content of symbolic sequences (such as nucleic- or amino acid sequences, but also neuronal firings or strings of letters) can be calculated from an ensemble of such sequences, but because information cannot be assigned to single seque nces, we cannot correlate information to other observables attached to the sequence. Here we show that an information score obtained from multivariate (multiple-variable) correlations within sequences of a training ensemble can be used to predict observables of out-of-sample sequences with an accuracy that scales with the complexity of correlations, showing that functional information emerges from a hierarchy of multi-variable correlations.

الجزيئات الحيوية نظرية المعلومات نظرية المعلومات

PDBench: Evaluating Computational Methods for Protein Sequence Design

560 - Leonardo V. Castorina , Rokas Petrenas , Katric Subr 2021

Proteins perform critical processes in all living systems: converting solar energy into chemical energy, replicating DNA, as the basis of highly performant materials, sensing and much more. While an incredible range of functionality has been sampled in nature, it accounts for a tiny fraction of the possible protein universe. If we could tap into this pool of unexplored protein structures, we could search for novel proteins with useful properties that we could apply to tackle the environmental and medical challenges facing humanity. This is the purpose of protein design. Sequence design is an important aspect of protein design, and many successful methods to do this have been developed. Recently, deep-learning methods that frame it as a classification problem have emerged as a powerful approach. Beyond their reported improvement in performance, their primary advantage over physics-based methods is that the computational burden is shifted from the user to the developers, thereby increasing accessibility to the design method. Despite this trend, the tools for assessment and comparison of such models remain quite generic. The goal of this paper is to both address the timely problem of evaluation and to shine a spotlight, within the Machine Learning community, on specific assessment criteria that will accelerate impact. We present a carefully curated benchmark set of proteins and propose a number of standard tests to assess the performance of deep learning based methods. Our robust benchmark provides biological insight into the behaviour of design methods, which is essential for evaluating their performance and utility. We compare five existing models with two novel models for sequence prediction. Finally, we test the designs produced by these models with AlphaFold2, a state-of-the-art structure-prediction algorithm, to determine if they are likely to fold into the intended 3D shapes.

الجزيئات الحيوية التعلم الآلي

A novel hotspot of gelsolin instability and aggregation propensity triggers a new mechanism of amyloidosis

146 - Michela Bollati , Luisa Diomede , Toni Giorgino 2021

The multidomain protein gelsolin (GSN) is composed of six homologous modules, sequentially named G1 to G6. Single point substitutions in this protein are responsible for AGel amyloidosis, a hereditary disease characterized by progressive corneal latt ice dystrophy, cutis laxa, and polyneuropathy. Several different amyloidogenic variants of GSN have been identified over the years, but only the most common D187N/Y mutants, in G2, have been thoroughly characterized, and the underlying functional mechanistic link between mutation, altered protein structure, susceptibility to aberrant furin cleavage and aggregative potential resolved. Little is known about the recently identified mutations A551P, E553K and M517R hosted at the interface between G4 and G5, whose aggregation process likely follows an alternative pathway. We demonstrate that these three substitutions impair temperature and pressure stability of GSN but do not increase its susceptibility to furin cleavage, the first event of the canonical aggregation pathway. The variants are also characterized by a higher tendency to aggregate in the unproteolysed forms and show a higher proteotoxicity in a C. elegans-based assay. Structural studies point to a destabilization of the interface between G4 and G5 due to three different structural determinants: beta-strand breaking, steric hindrance and/or charge repulsion, all implying the impairment of interdomain contacts. All available evidence suggests that the rearrangement of the protein global architecture triggers a furin-independent aggregation of the protein, supporting the existence of a non-canonical pathway of gelsolin amyloidosis pathogenesis.

الجزيئات الحيوية

Vibrational density of states capture the role of dynamic allostery in protein evolution

151 - Tushar Modi , Matthias Heyden , S. Banu Ozkan 2021

Previous studies of the flexibilities of ancestral proteins suggests that proteins evolve their function by altering their native state ensemble. Here we propose a more direct method of visualizing this by measuring the changes in the vibrational den sity of states (VDOS) of proteins as they evolve. Through analysis of VDOS profiles of ancestral and extant proteins we observe that $beta$-lactamase and thioredoxins evolve by altering their density of states in the terahertz region. Particularly, the shift in VDOS profiles between ancestral and extant proteins suggests that nature utilize dynamic allostery for functional evolution. Moreover, we also show that VDOS profile of individual position can be used to describe the flexibility changes, particularly those without any amino acid substitution.

الفيزياء البيولوجية الجزيئات الحيوية

Scaffold-Induced Molecular Graph (SIMG): Effective Graph Sampling Methods for High-Throughput Computational Drug Discovery

138 - Austin Clyde , Ashka Shah , Max Zvyagin 2021

Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold base d drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a scaffold based conception of chemical space is intuitive for design. Lastly, we highlight the utility of scaffold based networks for chemical space as a potential solution to the intractable enumeration problem of chemical space by working inductively on local neighborhoods.

الأساليب الكمية الجزيئات الحيوية

Emerging vaccine-breakthrough SARS-CoV-2 variants

181 - Rui Wang , Jiahui Chen , Yuta Hozumi 2021

The recent global surge in COVID-19 infections has been fueled by new SARS-CoV-2 variants, namely Alpha, Beta, Gamma, Delta, etc. The molecular mechanism underlying such surge is elusive due to 4,653 non-degenerate mutations on the spike protein, whi ch is the target of most COVID-19 vaccines. The understanding of the molecular mechanism of transmission and evolution is a prerequisite to foresee the trend of emerging vaccine-breakthrough variants and the design of mutation-proof vaccines and monoclonal antibodies. We integrate the genotyping of 1,489,884 SARS-CoV-2 genomes isolates, 130 human antibodies, tens of thousands of mutational data points, topological data analysis, and deep learning to reveal SARS-CoV-2 evolution mechanism and forecast emerging vaccine-escape variants. We show that infectivity-strengthening and antibody-disruptive co-mutations on the S protein RBD can quantitatively explain the infectivity and virulence of all prevailing variants. We demonstrate that Lambda is as infectious as Delta but is more vaccine-resistant. We analyze emerging vaccine-breakthrough co-mutations in 20 countries, including the United Kingdom, the United States, Denmark, Brazil, and Germany, etc. We envision that natural selection through infectivity will continue to be the main mechanism for viral evolution among unvaccinated populations, while antibody disruptive co-mutations will fuel the future growth of vaccine-breakthrough variants among fully vaccinated populations. Finally, we have identified the co-mutations that have the great likelihood of becoming dominant: [A411S, L452R, T478K], [L452R, T478K, N501Y], [V401L, L452R, T478K], [K417N, L452R, T478K], [L452R, T478K, E484K, N501Y], and [P384L, K417N, E484K, N501Y]. We predict they, particularly the last four, will break through existing vaccines. We foresee an urgent need to develop new vaccines that target these co-mutations.

الجزيئات الحيوية السكان والتطور

Protein Folding Neural Networks Are Not Robust

212 - Sumit Kumar Jha , Arvind Ramanathan , Rickard Ewetz 2021

Deep neural networks such as AlphaFold and RoseTTAFold predict remarkably accurate structures of proteins compared to other algorithmic approaches. It is known that biologically small perturbations in the protein sequence do not lead to drastic chang es in the protein structure. In this paper, we demonstrate that RoseTTAFold does not exhibit such a robustness despite its high accuracy, and biologically small perturbations for some input sequences result in radically different predicted protein structures. This raises the challenge of detecting when these predicted protein structures cannot be trusted. We define the robustness measure for the predicted structure of a protein sequence to be the inverse of the root-mean-square distance (RMSD) in the predicted structure and the structure of its adversarially perturbed sequence. We use adversarial attack methods to create adversarial protein sequences, and show that the RMSD in the predicted protein structure ranges from 0.119r{A} to 34.162r{A} when the adversarial perturbations are bounded by 20 units in the BLOSUM62 distance. This demonstrates very high variance in the robustness measure of the predicted structures. We show that the magnitude of the correlation (0.917) between our robustness measure and the RMSD between the predicted structure and the ground truth is high, that is, the predictions with low robustness measure cannot be trusted. This is the first paper demonstrating the susceptibility of RoseTTAFold to adversarial attacks.

الجزيئات الحيوية التعلم الآلي

adabmDCA: Adaptive Boltzmann machine learning for biological sequences

115 - Anna Paola Muntoni , Andrea Pagnani , Martin Weigt 2021

Boltzmann machines are energy-based models that have been shown to provide an accurate statistical description of domains of evolutionary-related protein and RNA families. They are parametrized in terms of local biases accounting for residue conserva tion, and pairwise terms to model epistatic coevolution between residues. From the model parameters, it is possible to extract an accurate prediction of the three-dimensional contact map of the target domain. More recently, the accuracy of these models has been also assessed in terms of their ability in predicting mutational effects and generating in silico functional sequences. Our adaptive implementation of Boltzmann machine learning, adabmDCA, can be generally applied to both protein and RNA families and accomplishes several learning set-ups, depending on the complexity of the input data and on the user requirements. The code is fully available at https://github.com/anna-pa-m/adabmDCA. As an example, we have performed the learning of three Boltzmann machines modeling the Kunitz and Beta-lactamase2 protein domains and TPP-riboswitch RNA domain. The models learned by adabmDCA are comparable to those obtained by state-of-the-art techniques for this task, in terms of the quality of the inferred contact map as well as of the synthetically generated sequences. In addition, the code implements both equilibrium and out-of-equilibrium learning, which allows for an accurate and lossless training when the equilibrium one is prohibitive in terms of computational time, and allows for pruning irrelevant parameters using an information-based criterion.

الأساليب الكمية الأنظمة المضطربة والشبكات العصبية الجزيئات الحيوية

Machine learning modeling of family wide enzyme-substrate specificity screens

95 - Samuel Goldman , Ria Das , Kevin K. Yang 2021

Biocatalysis is a promising approach to sustainably synthesize pharmaceuticals, complex natural products, and commodity chemicals at scale. However, the adoption of biocatalysis is limited by our ability to select enzymes that will catalyze their nat ural chemical transformation on non-natural substrates. While machine learning and in silico directed evolution are well-posed for this predictive modeling challenge, efforts to date have primarily aimed to increase activity against a single known substrate, rather than to identify enzymes capable of acting on new substrates of interest. To address this need, we curate 6 different high-quality enzyme family screens from the literature that each measure multiple enzymes against multiple substrates. We compare machine learning-based compound-protein interaction (CPI) modeling approaches from the literature used for predicting drug-target interactions. Surprisingly, comparing these interaction-based models against collections of independent (single task) enzyme-only or substrate-only models reveals that current CPI approaches are incapable of learning interactions between compounds and proteins in the current family level data regime. We further validate this observation by demonstrating that our no-interaction baseline can outperform CPI-based models from the literature used to guide the discovery of kinase inhibitors. Given the high performance of non-interaction based models, we introduce a new structure-based strategy for pooling residue representations across a protein sequence. Altogether, this work motivates a principled path forward in order to build and evaluate meaningful predictive models for biocatalysis and other drug discovery applications.

الجزيئات الحيوية التعلم الآلي

Quantum Crystallography: Projectors and kernel subspaces preserving N-representability

138 - Cherif F. Matta , Lou Massa 2021

Consider a projector matrix P, representing the first order reduced density matrix in a basis of orthonormal atom-centric basis functions. A mathematical question arises, and that is, how to break P into its natural component kernel projector matrice s, while preserving N-representability of P. The answer relies upon 2- projector triple products, PjPPj. The triple product solutions, applicable within the quantum crystallography of large molecules, are determined by a new form of the Clinton equations, which - in their original form - have long been used to ensure N-representability of density matrices consistent with X-ray diffraction scattering factors.

فيزياء الكم الجزيئات الحيوية

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد