No Arabic abstract
Structure-based Deep Fusion models were recently shown to outperform several physics- and machine learning-based protein-ligand binding affinity prediction methods. As part of a multi-institutional COVID-19 pandemic response, over 500 million small molecules were computationally screened against four protein structures from the novel coronavirus (SARS-CoV-2), which causes COVID-19. Three enhancements to Deep Fusion were made in order to evaluate more than 5 billion docked poses on SARS-CoV-2 protein targets. First, the Deep Fusion concept was refined by formulating the architecture as one, coherently backpropagated model (Coherent Fusion) to improve binding-affinity prediction accuracy. Secondly, the model was trained using a distributed, genetic hyper-parameter optimization. Finally, a scalable, high-throughput screening capability was developed to maximize the number of ligands evaluated and expedite the path to experimental evaluation. In this work, we present both the methods developed for machine learning-based high-throughput screening and results from using our computational pipeline to find SARS-CoV-2 inhibitors.
Global coronavirus disease pandemic (COVID-19) caused by newly identified SARS- CoV-2 coronavirus continues to claim the lives of thousands of people worldwide. The unavailability of specific medications to treat COVID-19 has led to drug repositioning efforts using various approaches, including computational analyses. Such analyses mostly rely on molecular docking and require the 3D structure of the target protein to be available. In this study, we utilized a set of machine learning algorithms and trained them on a dataset of RNA-dependent RNA polymerase (RdRp) inhibitors to run inference analyses on antiviral and anti-inflammatory drugs solely based on the ligand information. We also performed virtual screening analysis of the drug candidates predicted by machine learning models and docked them against the active site of SARS- CoV-2 RdRp, a key component of the virus replication machinery. Based on the ligand information of RdRp inhibitors, the machine learning models were able to identify candidates such as remdesivir and baloxavir marboxil, molecules with documented activity against RdRp of the novel coronavirus. Among the other identified drug candidates were beclabuvir, a non-nucleoside inhibitor of the hepatitis C virus (HCV) RdRp enzyme, and HCV protease inhibitors paritaprevir and faldaprevir. Further analysis of these candidates using molecular docking against the SARS-CoV-2 RdRp revealed low binding energies against the enzyme active site. Our approach also identified anti-inflammatory drugs lupeol, lifitegrast, antrafenine, betulinic acid, and ursolic acid to have potential activity against SARS-CoV-2 RdRp. We propose that the results of this study are considered for further validation as potential therapeutic options against COVID-19.
We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million in-stock molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01% of detecting the underlying best scoring 0.1% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100x or even 1000x faster than current techniques.
The novel coronavirus SARS-CoV-2 has resulted in a global pandemic with worldwide 6-digital infection rates and thousands death tolls daily. Enormeous effords are undertaken to achieve high coverage of immunization in order to reach herd immunity to stop spreading of SARS-CoV-2 infection. Several SARS-CoV-2 vaccines, based either on mRNA, viral vectors, or inactivated SARS-CoV-2 virus have been approved and are being applied worldwide. However, recently increased numbers of normally very rare types of thromboses associated with thrombocytopenia have been reported in particular in the context of the adenoviral vector vaccine ChAdOx1 nCoV-19 from Astra Zeneca. While statistical prevalence of these side effects seem to correlate with this particular vaccine type, i.e. adenonoviral vector based vaccines, the exact molecular mechanisms are still not clear. The present review summarizes current data and hypotheses for molecular and cellular mechanisms into one integrated hypothesis indicating that coagulopathies, including thromboses, thrombocytopenia and other related side effects are correlated to an interplay of the two components in the vaccine, i.e. the spike antigen and the adenoviral vector, with the innate and immune system which under certain circumstances can imitate the picture of a limited COVID-19 pathological picture.
We propose a novel numerical method able to determine efficiently and effectively the relationship of complementarity between portions of proteins surfaces. This innovative and general procedure, based on the representation of the molecular iso-electron density surface in terms of 2D Zernike polynomials, allows the rapid and quantitative assessment of the geometrical shape complementarity between interacting proteins, that was unfeasible with previous methods. We first tested the method with a large dataset of known protein complexes obtaining an overall area under the ROC curve of 0.76 in the blind recognition of binding sites and then applied it to investigate the features of the interaction between the Spike protein of SARS-Cov-2 and human cellular receptors. Our results indicate that SARS-CoV-2 uses a dual strategy: its spike protein could also interact with sialic acid receptors of the cells in the upper airways, in addition to the known interaction with Angiotensin-converting enzyme 2.
The SARS-CoV-2 spike (S) protein facilitates viral infection, and has been the focus of many structure determination efforts. This paper studies the conformations of loops in the S protein based on the available Protein Data Bank (PDB) structures. Loops, as flexible regions of the protein, are known to be involved in binding and can adopt multiple conformations. We identify the loop regions of the S protein, and examine their structural variability across the PDB. While most loops had essentially one stable conformation, 17 of 44 loop regions were observed to be structurally variable with multiple substantively distinct conformations. Loop modeling methods were then applied to the S protein loop targets, and loops with multiple conformations were found to be more challenging for the methods to predict accurately. Sequence variants and the up/down structural states of the receptor binding domain were also considered in the analysis.