No Arabic abstract
As the infection of 2019-nCoV coronavirus is quickly developing into a global pneumonia epidemic, careful analysis of its transmission and cellular mechanisms is sorely needed. In this report, we re-analyzed the computational approaches and findings presented in two recent manuscripts by Ji et al. (https://doi.org/10.1002/jmv.25682) and by Pradhan et al. (https://doi.org/10.1101/2020.01.30.927871), which concluded that snakes are the intermediate hosts of 2019-nCoV and that the 2019-nCoV spike protein insertions shared a unique similarity to HIV-1. Results from our re-implementation of the analyses, built on larger-scale datasets using state-of-the-art bioinformatics methods and databases, do not support the conclusions proposed by these manuscripts. Based on our analyses and existing data of coronaviruses, we concluded that the intermediate hosts of 2019-nCoV are more likely to be mammals and birds than snakes, and that the novel insertions observed in the spike protein are naturally evolved from bat coronaviruses.
In gene expression, various kinds of proteins need to bind to specific locus of DNA. It is still not clear how these proteins find their target locus. In this study, the mean first-passage time (FPT) of protein binding to its target locus on DNA chain is discussed by a chain-space coupled model. Our results show that the 1-dimensional diffusion constant has a critical value, with which the mean time spent by a protein to find its target locus is almost independent of the binding rate of protein to DNA chain and the detachment rate from DNA chain. Which implies that, the frequency of protein binding to DNA and the sliding time on DNA chain have little influence on the search efficiency, and therefore whether or not the 1-dimensional sliding on DNA chain increases the search efficiency depends on the 1-dimensional diffusion constant of the protein on DNA chain. This study also finds that only protein bindings to DNA loci which are close to the target locus help to increase the search efficiency, while bindings to those loci which are far from the target locus might delay the target binding process. As expected, the mean FPT increases with the distance between the initial position of protein in cell space and its target locus on DNA chain. The direct binding probability, which can be regarded as one index to describe if the 1-dimensional sliding along DNA chain is helpful to increase the search efficiency is calculated. Our results show that the influence of 1-dimensional sliding along DNA chain on the search process depends on both diffusion constants of protein in cell space and on the 1-dimensional DNA chain.
Proteins are key building blocks of virtually all life, providing the material foundation of spider silk, cells, and hair, but also offering other functions from enzymes to drugs, and pathogens like viruses. Based on a nanomechanical analysis of the structure and motions of atoms and molecules at multiple scales, we report sonifi
Proteins employ the information stored in the genetic code and translated into their sequences to carry out well-defined functions in the cellular environment. The possibility to encode for such functions is controlled by the balance between the amount of information supplied by the sequence and that left after that the protein has folded into its structure. We developed a computational algorithm to evaluate the amount of information necessary to specify the protein structure, keeping into account the thermodynamic properties of protein folding. We thus show that the information remaining in the protein sequence after encoding for its structure (the information gap) is very close to what needed to encode for its function and interactions. Then, by predicting the information gap directly from the protein sequence, we show that it may be possible to use these insights from information theory to discriminate between ordered and disordered proteins, to identify unknown functions, and to optimize designed proteins sequences.
Inferring the structural properties of a protein from its amino acid sequence is a challenging yet important problem in biology. Structures are not known for the vast majority of protein sequences, but structure is critical for understanding function. Existing approaches for detecting structural similarity between proteins from sequence are unable to recognize and exploit structural patterns when sequences have diverged too far, limiting our ability to transfer knowledge between structurally related proteins. We newly approach this problem through the lens of representation learning. We introduce a framework that maps any protein sequence to a sequence of vector embeddings --- one per amino acid position --- that encode structural information. We train bidirectional long short-term memory (LSTM) models on protein sequences with a two-part feedback mechanism that incorporates information from (i) global structural similarity between proteins and (ii) pairwise residue contact maps for individual proteins. To enable learning from structural similarity information, we define a novel similarity measure between arbitrary-length sequences of vector embeddings based on a soft symmetric alignment (SSA) between them. Our method is able to learn useful position-specific embeddings despite lacking direct observations of position-level correspondence between sequences. We show empirically that our multi-task framework outperforms other sequence-based methods and even a top-performing structure-based alignment method when predicting structural similarity, our goal. Finally, we demonstrate that our learned embeddings can be transferred to other protein sequence problems, improving the state-of-the-art in transmembrane domain prediction.