SARS-Cov-2 RNA Sequence Classification Based on Territory Information

97 0 0.0 ( 0 )

تحميل البحث استخدام كمرجع

نشر من قبل Jingwei Liu

تاريخ النشر 2021

مجال البحث علم الأحياء الهندسة المعلوماتية

والبحث باللغة English

تأليف Jingwei Liu

قم بزيارة صفحتنا على فيسبوك

‎Shamra Academia - شمرا أكاديميا‎

اسأل ChatGPT حول البحث

الملخص بالعربية الملخص بالإنكليزية

CovID-19 genetics analysis is critical to determine virus type,virus variant and evaluate vaccines. In this paper, SARS-Cov-2 RNA sequence analysis relative to region or territory is investigated. A uniform framework of sequence SVM model with various genetics length from short to long and mixed-bases is developed by projecting SARS-Cov-2 RNA sequence to different dimensional space, then scoring it according to the output probability of pre-trained SVM models to explore the territory or origin information of SARS-Cov-2. Different sample size ratio of training set and test set is also discussed in the data analysis. Two SARS-Cov-2 RNA classification tasks are constructed based on GISAID database, one is for mainland, Hongkong and Taiwan of China, and the other is a 6-class classification task (Africa, Asia, Europe, North American, South American& Central American, Ocean) of 7 continents. For 3-class classification of China, the Top-1 accuracy rate can reach 82.45% (train 60%, test=40%); For 2-class classification of China, the Top-1 accuracy rate can reach 97.35% (train 80%, test 20%); For 6-class classification task of world, when the ratio of training set and test set is 20% : 80% , the Top-1 accuracy rate can achieve 30.30%. And, some Top-N results are also given.

قيم البحث

391 - Sarwan Ali , Bikram Sahoo , Naimat Ullah 2021

With the rapid spread of the novel coronavirus (COVID-19) across the globe and its continuous mutation, it is of pivotal importance to design a system to identify different known (and unknown) variants of SARS-CoV-2. Identifying particular variants h elps to understand and model their spread patterns, design effective mitigation strategies, and prevent future outbreaks. It also plays a crucial role in studying the efficacy of known vaccines against each variant and modeling the likelihood of breakthrough infections. It is well known that the spike protein contains most of the information/variation pertaining to coronavirus variants. In this paper, we use spike sequences to classify different variants of the coronavirus in humans. We show that preserving the order of the amino acids helps the underlying classifiers to achieve better performance. We also show that we can train our model to outperform the baseline algorithms using only a small number of training samples ($1%$ of the data). Finally, we show the importance of the different amino acids which play a key role in identifying variants and how they coincide with those reported by the USAs Centers for Disease Control and Prevention (CDC).

الأساليب الكمية التعلم الآلي

Predicting inhibitors for SARS-CoV-2 RNA-dependent RNA polymerase using machine learning and virtual screening

103 - Romeo Cozac Elix 2020

Global coronavirus disease pandemic (COVID-19) caused by newly identified SARS- CoV-2 coronavirus continues to claim the lives of thousands of people worldwide. The unavailability of specific medications to treat COVID-19 has led to drug repositionin g efforts using various approaches, including computational analyses. Such analyses mostly rely on molecular docking and require the 3D structure of the target protein to be available. In this study, we utilized a set of machine learning algorithms and trained them on a dataset of RNA-dependent RNA polymerase (RdRp) inhibitors to run inference analyses on antiviral and anti-inflammatory drugs solely based on the ligand information. We also performed virtual screening analysis of the drug candidates predicted by machine learning models and docked them against the active site of SARS- CoV-2 RdRp, a key component of the virus replication machinery. Based on the ligand information of RdRp inhibitors, the machine learning models were able to identify candidates such as remdesivir and baloxavir marboxil, molecules with documented activity against RdRp of the novel coronavirus. Among the other identified drug candidates were beclabuvir, a non-nucleoside inhibitor of the hepatitis C virus (HCV) RdRp enzyme, and HCV protease inhibitors paritaprevir and faldaprevir. Further analysis of these candidates using molecular docking against the SARS-CoV-2 RdRp revealed low binding energies against the enzyme active site. Our approach also identified anti-inflammatory drugs lupeol, lifitegrast, antrafenine, betulinic acid, and ursolic acid to have potential activity against SARS-CoV-2 RdRp. We propose that the results of this study are considered for further validation as potential therapeutic options against COVID-19.

الأساليب الكمية الجزيئات الحيوية

Effective and scalable clustering of SARS-CoV-2 sequences

219 - Sarwan Ali , Tamkanat-E-Ali , Muhammad Asad Khan 2021

SARS-CoV-2, like any other virus, continues to mutate as it spreads, according to an evolutionary process. Unlike any other virus, the number of currently available sequences of SARS-CoV-2 in public databases such as GISAID is already several million . This amount of data has the potential to uncover the evolutionary dynamics of a virus like never before. However, a million is already several orders of magnitude beyond what can be processed by the traditional methods designed to reconstruct a viruss evolutionary history, such as those that build a phylogenetic tree. Hence, new and scalable methods will need to be devised in order to make use of the ever increasing number of viral sequences being collected. Since identifying variants is an important part of understanding the evolution of a virus, in this paper, we propose an approach based on clustering sequences to identify the current major SARS-CoV-2 variants. Using a $k$-mer based feature vector generation and efficient feature selection methods, our approach is effective in identifying variants, as well as being efficient and scalable to millions of sequences. Such a clustering method allows us to show the relative proportion of each variant over time, giving the rate of spread of each variant in different locations -- something which is important for vaccine development and distribution. We also compute the importance of each amino acid position of the spike protein in identifying a given variant in terms of information gain. Positions of high variant-specific importance tend to agree with those reported by the USAs Centers for Disease Control and Prevention (CDC), further demonstrating our approach.

السكان والتطور التعلم الآلي

Pathogenesis, Symptomatology, and Transmission of SARS-CoV-2 through analysis of Viral Genomics and Structure

113 - Halie M. Rando , Adam L. MacLean , Alexandra J. Lee 2021

The novel coronavirus SARS-CoV-2, which emerged in late 2019, has since spread around the world infecting tens of millions of people with coronavirus disease 2019 (COVID-19). While this viral species was unknown prior to January 2020, its similarity to other coronaviruses that infect humans has allowed for rapid insight into the mechanisms that it uses to infect human hosts, as well as the ways in which the human immune system can respond. Here, we contextualize SARS-CoV-2 among other coronaviruses and identify what is known and what can be inferred about its behavior once inside a human host. Because the genomic content of coronaviruses, which specifies the viruss structure, is highly conserved, early genomic analysis provided a significant head start in predicting viral pathogenesis. The pathogenesis of the virus offers insights into symptomatology, transmission, and individual susceptibility. Additionally, prior research into interactions between the human immune system and coronaviruses has identified how these viruses can evade the immune systems protective mechanisms. We also explore systems-level research into the regulatory and proteomic effects of SARS-CoV-2 infection and the immune response. Understanding the structure and behavior of the virus serves to contextualize the many facets of the COVID-19 pandemic and can influence efforts to control the virus and treat the disease.

الأساليب الكمية

Screening and evaluation of potential clinically significant HIV drug combinations against SARS-CoV-2 virus

135 - Drav{s}ko Tomic 2020

In this study, we investigated the inhibition of SARS-CoV-2 spike glycoprotein with HIV drugs and their combinations. This glycoprotein is essential for the reproduction of the SARS-COV-2 virus, so its inhibition opens new avenues for the treatment o f patients with COVID-19 disease. In doing so, we used the VINI in silico model of cancer, whose high accuracy in finding effective drugs and their combinations was confirmed in vitro by comparison with existing results from NCI-60 bases, and in vivo by comparison with existing clinical trial results. In the first step, the VINI model calculated the inhibition efficiency of SARS-CoV-2 spike glycoprotein with 44 FDA-approved antiviral drugs. Of these drugs, HIV drugs have been shown to be effective, while others mainly have shown weak or no efficiency. Subsequently, the VINI model calculated the inhibition efficiency of all possible double and triple HIV drug combinations, and among them identified ten with the highest inhibition efficiency. These ten combinations were analyzed by Medscape drug-drug interaction software and LexiComp Drug Interactions. All combinations except the combination of cobicistat_abacavir_rilpivirine appear to have serious interactions (risk rating category D) when dosage adjustments/reductions are required for possible toxicity. Finally, the VINI model compared the inhibition efficiency of cobicistat_abacivir_rilpivirine combination with cocktails and individual drugs already used or planned to be tested against SARS-CoV-2. Combination cobicistat_abacivir_rilpivirine demonstrated the highest inhibition of SARS-CoV-2 spike glycoprotein over others. Thus, this combination seems to be a promising candidate for the further in vitro testing and clinical trials.

الأساليب الكمية