Learn molecular representations from large-scale unlabeled molecules for drug discovery

127 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Pengyong Li

تاريخ النشر 2020

مجال البحث الهندسة المعلوماتية علم الأحياء

والبحث باللغة English

تأليف Pengyong Li - Jun Wang - Yixuan Qiao

التعلم الآلي الجزيئات الحيوية الأساليب الكمية

قم بزيارة صفحتنا على فيسبوك

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

How to produce expressive molecular representations is a fundamental challenge in AI-driven drug discovery. Graph neural network (GNN) has emerged as a powerful technique for modeling molecular data. However, previous supervised approaches usually suffer from the scarcity of labeled data and have poor generalization capability. Here, we proposed a novel Molecular Pre-training Graph-based deep learning framework, named MPG, that leans molecular representations from large-scale unlabeled molecules. In MPG, we proposed a powerful MolGNet model and an effective self-supervised strategy for pre-training the model at both the node and graph-level. After pre-training on 11 million unlabeled molecules, we revealed that MolGNet can capture valuable chemistry insights to produce interpretable representation. The pre-trained MolGNet can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of drug discovery tasks, including molecular properties prediction, drug-drug interaction, and drug-target interaction, involving 13 benchmark datasets. Our work demonstrates that MPG is promising to become a novel approach in the drug discovery pipeline.

قيم البحث

138 - Austin Clyde , Ashka Shah , Max Zvyagin 2021

Scaffold based drug discovery (SBDD) is a technique for drug discovery which pins chemical scaffolds as the framework of design. Scaffolds, or molecular frameworks, organize the design of compounds into local neighborhoods. We formalize scaffold base d drug discovery into a network design. Utilizing docking data from SARS-CoV-2 virtual screening studies and JAK2 kinase assay data, we showcase how a scaffold based conception of chemical space is intuitive for design. Lastly, we highlight the utility of scaffold based networks for chemical space as a potential solution to the intractable enumeration problem of chemical space by working inductively on local neighborhoods.

الأساليب الكمية الجزيئات الحيوية

Large-Scale Object Mining for Object Discovery from Unlabeled Video

120 - Aljosa Osep , Paul Voigtlaender , Jonathon Luiten 2019

This paper addresses the problem of object discovery from unlabeled driving videos captured in a realistic automotive setting. Identifying recurring object categories in such raw video streams is a very challenging problem. Not only do object candida tes first have to be localized in the input images, but many interesting object categories occur relatively infrequently. Object discovery will therefore have to deal with the difficulties of operating in the long tail of the object distribution. We demonstrate the feasibility of performing fully automatic object discovery in such a setting by mining object tracks using a generic object tracker. In order to facilitate further research in object discovery, we release a collection of more than 360,000 automatically mined object tracks from 10+ hours of video data (560,000 frames). We use this dataset to evaluate the suitability of different feature representations and clustering strategies for object discovery.

الرؤية الحاسوبية وتمييز الأنماط

Do Large Scale Molecular Language Representations Capture Important Structural Information?

501 - Jerret Ross , Brian Belgodere , Vijil Chenthamarakshan 2021

Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate pred ictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.

التعلم الآلي الحساب واللغة الجزيئات الحيوية

Combinatorial analysis of interacting RNA molecules

183 - Thomas J. X. Li , Christian M. Reidys 2010

Recently several minimum free energy (MFE) folding algorithms for predicting the joint structure of two interacting RNA molecules have been proposed. Their folding targets are interaction structures, that can be represented as diagrams with two backb ones drawn horizontally on top of each other such that (1) intramolecular and intermolecular bonds are noncrossing and (2) there is no zig-zag configuration. This paper studies joint structures with arc-length at least four in which both, interior and exterior stack-lengths are at least two (no isolated arcs). The key idea in this paper is to consider a new type of shape, based on which joint structures can be derived via symbolic enumeration. Our results imply simple asymptotic formulas for the number of joint structures with surprisingly small exponential growth rates. They are of interest in the context of designing prediction algorithms for RNA-RNA interactions.

التوافقية الجزيئات الحيوية الأساليب الكمية

Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video

81 - Aljov{s}a Ov{s}ep , Paul Voigtlaender , Jonathon Luiten andn Stefan Breuers 2017

We explore object discovery and detector adaptation based on unlabeled video sequences captured from a mobile platform. We propose a fully automatic approach for object mining from video which builds upon a generic object tracking approach. By applyi ng this method to three large video datasets from autonomous driving and mobile robotics scenarios, we demonstrate its robustness and generality. Based on the object mining results, we propose a novel approach for unsupervised object discovery by appearance-based clustering. We show that this approach successfully discovers interesting objects relevant to driving scenarios. In addition, we perform self-supervised detector adaptation in order to improve detection performance on the KITTI dataset for existing categories. Our approach has direct relevance for enabling large-scale object learning for autonomous driving.

الرؤية الحاسوبية وتمييز الأنماط