No Arabic abstract
In this report, we present an unsupervised machine learning method for determining groups of molecular systems according to similarity in their dynamics or structures using Wards minimum variance objective function. We first apply the minimum variance clustering to a set of simulated tripeptides using the information theoretic Jensen-Shannon divergence between Markovian transition matrices in order to gain insight into how point mutations affect protein dynamics. Then, we extend the method to partition two chemoinformatic datasets according to structural similarity to motivate a train/validation/test split for supervised learning that avoids overfitting.
A transition rate model of cargo transport by $N$ molecular motors is proposed. Under the assumption of steady state, the force-velocity curve of multi-motor system can be derived from the force-velocity curve of single motor. Our work shows, in the case of low load, the velocity of multi-motor system can decrease or increase with increasing motor number, which is dependent on the single motor force-velocity curve. And most commonly, the velocity decreases. This gives a possible explanation to some recent
In spite of decades of research, much remains to be discovered about folding: the detailed structure of the initial (unfolded) state, vestigial folding instructions remaining only in the unfolded state, the interaction of the molecule with the solvent, instantaneous power at each point within the molecule during folding, the fact that the process is stable in spite of myriad possible disturbances, potential stabilization of trajectory by chaos, and, of course, the exact physical mechanism (code or instructions) by which the folding process is specified in the amino acid sequence. Simulations based upon microscopic physics have had some spectacular successes and continue to improve, particularly as super-computer capabilities increase. The simulations, exciting as they are, are still too slow and expensive to deal with the enormous number of molecules of interest. In this paper, we introduce an approximate model based upon physics, empirics, and information science which is proposed for use in machine learning applications in which very large numbers of sub-simulations must be made. In particular, we focus upon machine learning applications in the learning phase and argue that our model is sufficiently close to the physics that, in spite of its approximate nature, can facilitate stepping through machine learning solutions to explore the mechanics of folding mentioned above. We particularly emphasize the exploration of energy flow (power) within the molecule during folding, the possibility of energy scale invariance (above a threshold), vestigial information in the unfolded state as attractive targets for such machine language analysis, and statistical analysis of an ensemble of folding micro-steps.
How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline.
In multi-resolution simulations, different system components are simultaneously modelled at different levels of resolution, these being smoothly coupled together. In the case of enzyme systems, computationally expensive atomistic detail is needed in the active site to capture the chemistry of substrate binding. Global properties of the rest of the protein also play an essential role, determining the structure and fluctuations of the binding site; however, these can be modelled on a coarser level. Similarly, in the most computationally efficient scheme only the solvent hydrating the active site requires atomistic detail. We present a methodology to couple atomistic and coarse-grained protein models, while solvating the atomistic part of the protein in atomistic water. This allows a free choice of which protein and solvent degrees of freedom to include atomistically, without loss of accuracy in the atomistic description. This multi-resolution methodology can successfully model stable ligand binding, and we further confirm its validity via an exploration of system properties relevant to enzymatic function. In addition to a computational speedup, such an approach can allow the identification of the essential degrees of freedom playing a role in a given process, potentially yielding new insights into biomolecular function.
Metabolic heterogeneity is widely recognised as the next challenge in our understanding of non-genetic variation. A growing body of evidence suggests that metabolic heterogeneity may result from the inherent stochasticity of intracellular events. However, metabolism has been traditionally viewed as a purely deterministic process, on the basis that highly abundant metabolites tend to filter out stochastic phenomena. Here we bridge this gap with a general method for prediction of metabolite distributions across single cells. By exploiting the separation of time scales between enzyme expression and enzyme kinetics, our method produces estimates for metabolite distributions without the lengthy stochastic simulations that would be typically required for large metabolic models. The metabolite distributions take the form of Gaussian mixture models that are directly computable from single-cell expression data and standard deterministic models for metabolic pathways. The proposed mixture models provide a systematic method to predict the impact of biochemical parameters on metabolite distributions. Our method lays the groundwork for identifying the molecular processes that shape metabolic heterogeneity and its functional implications in disease.