Functional annotation of creeping bentgrass protein sequences based on convolutional neural network

79 0 0.0 ( 0 )

Download Cite

Added by Jun He Prof.

Publication date 2021

fields Biology

and research's language is English

Authors Han-Yu Jiang - Jun He

Genomics

visit our facebook page

‎Shamra Academia - شمرا أكاديميا‎

Ask ChatGPT about the research

Abstract in Arabic Abstract in English

Background: Creeping bentgrass (Agrostis soionifera) is a perennial grass of Gramineae, belonging to cold season turfgrass, but has shallow adventitious roots, poor disease-resistance. Little is known about the ISR mechanism of turfgrass and the signal transduction involved in disease-resistance induction, especially the function of a large number of disease-resistance related proteins are urgent to be explored. Results: In this work, the protein sequences of creeping bentgrass were measured and annotated by a functional prediction model based on convolutional neural network. Creeping bentgrass seedlings were grown with BDO treatment, and the ISR response was induced by infecting Rhizoctonia solani. We preformed the transcriptome analysis by Illumina Sequencing and high-quality unigenes were obtained. A minority of assembled unigenes were functionally annotated according to the database alignment while a large part of the obtained amino acid sequences was left non-annotated. To treat the non-annotated sequences, a prediction model was established by training the data set from GO families in three domains to acquire good performance, especially the higher false positive control rate. With such model, we analyzed the non-annotated protein sequences of creeping bentgrass transcriptome, and annotated the disease-resistance response and signal transduction related proteins. Conclusions: The results provide good candidates of the proteins with certain functions. With the results in this work, the waste of transcriptome sequencing data of creeping bentgrass can be avoided, and research time and labor for the analysis of ISR characteristics of creeping bentgrass will be saved in further research. It also provides reference for the sequence analysis of turfgrass disease-resistance research.

rate research

Probabilistic annotation of protein sequences based on functional classifications

521 - Emmanuel D. Levy , Christos A. Ouzounis , Walter R. Gilks 2007

BACKGROUND: One of the most evident achievements of bioinformatics is the development of methods that transfer biological knowledge from characterised proteins to uncharacterised sequences. This mode of protein function assignment is mostly based on the detection of sequence similarity and the premise that functional properties are conserved during evolution. Most automatic approaches developed to date rely on the identification of clusters of homologous proteins and the mapping of new proteins onto these clusters, which are expected to share functional characteristics. RESULTS: Here, we inverse the logic of this process, by considering the mapping of sequences directly to a functional classification instead of mapping functions to a sequence clustering. In this mode, the starting point is a database of labelled proteins according to a functional classification scheme, and the subsequent use of sequence similarity allows defining the membership of new proteins to these functional classes. In this framework, we define the Correspondence Indicators as measures of relationship between sequence and function and further formulate two Bayesian approaches to estimate the probability for a sequence of unknown function to belong to a functional class. This approach allows the parametrisation of different sequence search strategies and provides a direct measure of annotation error rates. We validate this approach with a database of enzymes labelled by their corresponding four-digit EC numbers and analyse specific cases. CONCLUSION: The performance of this method is significantly higher than the simple strategy consisting in transferring the annotation from the highest scoring BLAST match and is expected to find applications in automated functional annotation pipelines.

Quantitative Methods

Functional Protein Structure Annotation Using a Deep Convolutional Generative Adversarial Network

295 - Ethan Moyer , Jeff Winchell , Isamu Isozaki 2021

Identifying novel functional protein structures is at the heart of molecular engineering and molecular biology, requiring an often computationally exhaustive search. We introduce the use of a Deep Convolutional Generative Adversarial Network (DCGAN) to classify protein structures based on their functionality by encoding each sample in a grid object structure using three features in each object: the generic atom type, the position atom type, and its occupancy relative to a given atom. We train DCGAN on 3-dimensional (3D) decoy and native protein structures in order to generate and discriminate 3D protein structures. At the end of our training, loss converges to a local minimum and our DCGAN can annotate functional proteins robustly against adversarial protein samples. In the future we hope to extend the novel structures we found from the generator in our DCGAN with more samples to explore more granular functionality with varying functions. We hope that our effort will advance the field of protein structure prediction.

Biomolecules Machine Learning

DeepSF: deep convolutional neural network for mapping protein sequences to folds

122 - Jie Hou , Badri Adhikari , Jianlin Cheng 2017

Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a tar get protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein se quence into one of 1195 known folds, which is useful for both fold recognition and the study of se quence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and map it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 77.0%. We compare our method with a top profile profile alignment method - HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 14.5%-29.1% higher than HHSearch on template-free modeling targets and 4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking.

Machine Learning Biomolecules

OnionNet-2: A Convolutional Neural Network Model for Predicting Protein-Ligand Binding Affinity based on Residue-Atom Contacting Shells

92 - Zechen Wang , Liangzhen Zheng , Yang Liu 2021

One key task in virtual screening is to accurately predict the binding affinity ($triangle$$G$) of protein-ligand complexes. Recently, deep learning (DL) has significantly increased the predicting accuracy of scoring functions due to the extraordinary ability of DL to extract useful features from raw data. Nevertheless, more efforts still need to be paid in many aspects, for the aim of increasing prediction accuracy and decreasing computational cost. In this study, we proposed a simple scoring function (called OnionNet-2) based on convolutional neural network to predict $triangle$$G$. The protein-ligand interactions are characterized by the number of contacts between protein residues and ligand atoms in multiple distance shells. Compared to published models, the efficacy of OnionNet-2 is demonstrated to be the best for two widely used datasets CASF-2016 and CASF-2013 benchmarks. The OnionNet-2 model was further verified by non-experimental decoy structures from docking program and the CSAR NRC-HiQ data set (a high-quality data set provided by CSAR), which showed great success. Thus, our study provides a simple but efficient scoring function for predicting protein-ligand binding free energy.

Quantitative Methods

DNA-GCN: Graph convolutional networks for predicting DNA-protein binding

153 - Yuhang Guo , Xiao Luo , Liang Chen 2021

Predicting DNA-protein binding is an important and classic problem in bioinformatics. Convolutional neural networks have outperformed conventional methods in modeling the sequence specificity of DNA-protein binding. However, none of the studies has utilized graph convolutional networks for motif inference. In this work, we propose to use graph convolutional networks for motif inference. We build a sequence k-mer graph for the whole dataset based on k-mer co-occurrence and k-mer sequence relationship and then learn DNA Graph Convolutional Network (DNA-GCN) for the whole dataset. Our DNA-GCN is initialized with a one-hot representation for all nodes, and it then jointly learns the embeddings for both k-mers and sequences, as supervised by the known labels of sequences. We evaluate our model on 50 datasets from ENCODE. DNA-GCN shows its competitive performance compared with the baseline model. Besides, we analyze our model and design several different architectures to help fit different datasets.

Genomics Machine Learning