No Arabic abstract
Public repositories for genome and proteome annotations, such as the Gene Ontology (GO), rarely stores negative annotations, i.e. proteins not possessing a given function. This leaves undefined or ill defined the set of negative examples, which is crucial for training the majority of machine learning methods inferring proteins functions. Automated techniques to choose reliable negative proteins are thereby required to train accurate function prediction models. This study proposes the first extensive analysis of the temporal evolution of protein annotations in the GO repository. Novel annotations registered through the years have been analyzed to verify the presence of annotation patterns in the GO hierarchy. Our research supplied fundamental clues about proteins likely to be unreliable as negative examples, that we verified into a novel algorithm of our own construction, validated on two organisms in a genome wide fashion against approaches proposed to choose negative examples in the context of functional prediction.
Stratifying cancer patients based on their gene expression levels allows improving diagnosis, survival analysis and treatment planning. However, such data is extremely highly dimensional as it contains expression values for over 20000 genes per patient, and the number of samples in the datasets is low. To deal with such settings, we propose to incorporate prior biological knowledge about genes from ontologies into the machine learning system for the task of patient classification given their gene expression data. We use ontology embeddings that capture the semantic similarities between the genes to direct a Graph Convolutional Network, and therefore sparsify the network connections. We show this approach provides an advantage for predicting clinical targets from high-dimensional low-sample data.
Limit analysis is a computationally efficient tool to assess the resistance and the failure mode of structures but does not provide any information on the displacement capacity, which is one of the concepts which most affects the seismic safety. Therefore, since many researchers did not consider limit analysis as a possible tool for the seismic assessment of structures, its widespread employment has been prevented. In this paper this common belief is questioned and the authors show that limit analysis can be useful in the evaluation of the seismic performance of frame structures. In particular, to overcome the limitation on the possibility to evaluate the displacements of a structure based on a limit analysis approach, an approximated capacity curve is reconstructed. The latter is based on a limit analysis strategy, which takes into account the second order effects, and evaluates the displacement capacity considering a post-peak softening branch and a threshold on the allowed plastic rotations. Then, based on this simplified capacity curve, an equivalent single degree of freedom system is defined in order to assess the seismic performance of frame structures. The proposed simplified strategy is implemented in a dedicated software and the obtained results are validated with well-established approaches based on nonlinear static analyses, showing the reliability and the computational efficiency of this methodology
Alignment-free sequence analysis approaches provide important alternatives over multiple sequence alignment (MSA) in biological sequence analysis because alignment-free approaches have low computation complexity and are not dependent on high level of sequence identity, however, most of the existing alignment-free methods do not employ true full information content of sequences and thus can not accurately reveal similarities and differences among DNA sequences. We present a novel alignment-free computational method for sequence analysis based on Ramanujan-Fourier transform (RFT), in which complete information of DNA sequences is retained. We represent DNA sequences as four binary indicator sequences and apply RFT on the indicator sequences to convert them into frequency domain. The Euclidean distance of the complete RFT coefficients of DNA sequences are used as similarity measure. To address the different lengths in Euclidean space of RFT coefficients, we pad zeros to short DNA binary sequences so that the binary sequences equal the longest length in the comparison sequence data. Thus, the DNA sequences are compared in the same dimensional frequency space without information loss. We demonstrate the usefulness of the proposed method by presenting experimental results on hierarchical clustering of genes and genomes. The proposed method opens a new channel to biological sequence analysis, classification, and structural module identification.
BACKGROUND: The uncoupling protein (UCP) genes belong to the superfamily of electron transport carriers of the mitochondrial inner membrane. Members of the uncoupling protein family are involved in thermogenesis and determining the functional evolution of UCP genes is important to understand the evolution of thermo-regulation in vertebrates. RESULTS: Sequence similarity searches of genome and scaffold data identified homologues of UCP in eutherians, teleosts and the first squamates uncoupling proteins. Phylogenetic analysis was used to characterize the family evolutionary history by identifying two duplications early in vertebrate evolution and two losses in the avian lineage (excluding duplications within a species, excluding the losses due to incompletely sequenced taxa and excluding the losses and duplications inferred through mismatch of species and gene trees). Estimates of synonymous and nonsynonymous substitution rates (dN/dS) and more complex branch and site models suggest that the duplication events were not associated with positive Darwinian selection and that the UCP is constrained by strong purifying selection except for a single site which has undergone positive Darwinian selection, demonstrating that the UCP gene family must be highly conserved. CONCLUSION: We present a phylogeny describing the evolutionary history of the UCP gene family and show that the genes have evolved through duplications followed by purifying selection except for a single site in the mitochondrial matrix between the 5th and 6th alpha-helices which has undergone positive selection.
Annotations in Visual Analytics (VA) have become a common means to support the analysis by integrating additional information into the VA system. That additional information often depends on the current process step in the visual analysis. For example, the data preprocessing step has data structuring operations while the data exploration step focuses on user interaction and input. Describing suitable annotations to meet the goals of the different steps is challenging. To tackle this issue, we identify individual annotations for each step and outline their gathering and design properties for the visual analysis of heterogeneous clinical data. We integrate our annotation design into a visual analysis tool to show its applicability to data from the ophthalmic domain. In interviews and application sessions with experts we asses its usefulness for the analysis of patients with different medications.