No Arabic abstract
Modern biological techniques such as Hi-C permit to measure probabilities that different chromosomal regions are close in space. These probabilities can be visualised as matrices called contact maps. In this paper, we introduce a multifractal analysis of chromosomal contact maps. Our analysis reveals that Hi-C maps are bifractal, i.e. complex geometrical objects characterized by two distinct fractal dimensions. To rationalize this observation, we introduce a model that describes chromosomes as a hierarchical set of nested domains and we solve it exactly. The predicted multifractal spectrum is in excellent quantitative agreement with experimental data. Moreover, we show that our theory yields to a more robust estimation of the scaling exponent of the contact probability than existing methods. By applying this method to experimental data, we detect subtle conformational changes among chromosomes during differentiation of human stem cells.
Several experiments show that the three dimensional (3D) organization of chromosomes affects genetic processes such as transcription and gene regulation. To better understand this connection, researchers developed the Hi-C method that is able to detect the pairwise physical contacts of all chromosomal loci. The Hi-C data show that chromosomes are composed of 3D compartments that range over a variety of scales. However, it is challenging to systematically detect these cross-scale structures. Most studies have therefore designed methods for specific scales to study foremost topologically associated domains (TADs) and A/B compartments. To go beyond this limitation, we tailor a network community detection method that finds communities in compact fractal globule polymer systems. Our method allows us to continuously scan through all scales with a single resolution parameter. We found: (i) polymer segments belonging to the same 3D community do not have to be in consecutive order along the polymer chain. In other words, several TADs may belong to the same 3D community. (ii) CTCF proteins---a loop-stabilizing protein that is ascribed a big role in TAD formation---are well correlated with community borders only at one level of organization. (iii) TADs and A/B compartments are traditionally treated as two weakly related 3D structures and detected with different algorithms. With our method, we detect both by simply adjusting the resolution parameter. We therefore argue that they represent two specific levels of a continuous spectrum 3D communities, rather than seeing them as different structural entities.
We combined the genetic crossover, which is one of the operations of genetic algorithm, and replica-exchange method in parallel molecular dynamics simulations. The genetic crossover and replica-exchange method can search the global conformational space by exchanging the corresponding parts between a pair of conformations of a protein. In this study, we applied this method to an $alpha$-helical protein, Trp-cage mini protein, which has 20 amino-acid residues. The conformations obtained from the simulations are in good agreement with the experimental results.
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
In eukaryotic genes the protein coding sequence is split into several fragments, the exons, separated by non-coding DNA stretches, the introns. Prokaryotes do not have introns in their genome. We report the calculations of stability domains of actin genes for various organisms in the animal, plant and fungi kingdoms. Actin genes have been chosen because they have been highly conserved during evolution. In these genes all introns were removed so as to mimic ancient genes at the time of the early eukaryotic development, i.e. before introns insertion. Common stability boundaries are found in evolutionary distant organisms, which implies that these boundaries date from the early origin of eukaryotes. In general boundaries correspond with introns positions of vertebrates and other animals actins, but not much for plants and fungi. The sharpest boundary is found in a locus where fungi, algae and animals have introns in positions separated by one nucleotide only, which identifies a hot-spot for insertion. These results suggest that some introns may have been incorporated into the genomes through a thermodynamic driven mechanism, in agreement with previous observations on human genes. They also suggest a different mechanism for introns insertion in plants and animals.
Many non-coding RNAs are known to play a role in the cell directly linked to their structure. Structure prediction based on the sole sequence is however a challenging task. On the other hand, thanks to the low cost of sequencing technologies, a very large number of homologous sequences are becoming available for many RNA families. In the protein community, it has emerged in the last decade the idea of exploiting the covariance of mutations within a family to predict the protein structure using the direct-coupling-analysis (DCA) method. The application of DCA to RNA systems has been limited so far. We here perform an assessment of the DCA method on 17 riboswitch families, comparing it with the commonly used mutual information analysis and with state-of-the-art R-scape covariance method. We also compare different flavors of DCA, including mean-field, pseudo-likelihood, and a proposed stochastic procedure (Boltzmann learning) for solving exactly the DCA inverse problem. Boltzmann learning outperforms the other methods in predicting contacts observed in high resolution crystal structures.