No Arabic abstract
We analyze publicly available data on Affymetrix microarrays spike-in experiments on the human HGU133 chipset in which sequences are added in solution at known concentrations. The spike-in set contains sequences of bacterial, human and artificial origin. Our analysis is based on a recently introduced molecular-based model [E. Carlon and T. Heim, Physica A 362, 433 (2006)] which takes into account both probe-target hybridization and target-target partial hybridization in solution. The hybridization free energies are obtained from the nearest-neighbor model with experimentally determined parameters. The molecular-based model suggests a rescaling that should result in a collapse of the data at different concentrations into a single universal curve. We indeed find such a collapse, with the same parameters as obtained before for the older HGU95 chip set. The quality of the collapse varies according to the probe set considered. Artificial sequences, chosen by Affymetrix to be as different as possible from any other human genome sequence, generally show a much better collapse and thus a better agreement with the model than all other sequences. This suggests that the observed deviations from the predicted collapse are related to the choice of probes or have a biological origin, rather than being a problem with the proposed model.
In the past couple of years several studies have shown that hybridization in Affymetrix DNA microarrays can be rather well understood on the basis of simple models of physical chemistry. In the majority of the cases a Langmuir isotherm was used to fit experimental data. Although there is a general consensus about this approach, some discrepancies between different studies are evident. For instance, some authors have fitted the hybridization affinities from the microarray fluorescent intensities, while others used affinities obtained from melting experiments in solution. The former approach yields fitted affinities that at first sight are only partially consistent with solution values. In this paper we show that this discrepancy exists only superficially: a sufficiently complete model provides effective affinities which are fully consistent with those fitted to experimental data. This link provides new insight on the relevant processes underlying the functioning of DNA microarrays.
DNA microarrays are devices that are able, in principle, to detect and quantify the presence of specific nucleic acid sequences in complex biological mixtures. The measurement consists in detecting fluorescence signals from several spots on the microarray surface onto which different probe sequences are grafted. One of the problems of the data analysis is that the signal contains a noisy background component due to non-specific binding. This paper presents a physical model for background estimation in Affymetrix Genechips. It combines two different approaches. The first is based on the sequence composition, specifically its sequence dependent hybridization affinity. The second is based on the strong correlation of intensities from locations which are the physical neighbors of a specific spot on the chip. Both effects are incorporated in a background functional which contains 24 free parameters, fixed by minimization on a training data set. In all data analyzed the sequence specific parameters, obtained by minimization, are found to strongly correlate with empirically determined stacking free energies for RNA/DNA hybridization in solution. Moreover, there is an overall agreement with experimental background data and we show that the physics-based model proposed in this paper performs on average better than purely statistical approaches for background calculations. The model thus provides an interesting alternative method for background subtraction schemes in Affymetrix Genechips.
Identifying protein-protein interactions is crucial for a systems-level understanding of the cell. Recently, algorithms based on inverse statistical physics, e.g. Direct Coupling Analysis (DCA), have allowed to use evolutionarily related sequences to address two conceptually related inference tasks: finding pairs of interacting proteins, and identifying pairs of residues which form contacts between interacting proteins. Here we address two underlying questions: How are the performances of both inference tasks related? How does performance depend on dataset size and the quality? To this end, we formalize both tasks using Ising models defined over stochastic block models, with individual blocks representing single proteins, and inter-block couplings protein-protein interactions; controlled synthetic sequence data are generated by Monte-Carlo simulations. We show that DCA is able to address both inference tasks accurately when sufficiently large training sets are available, and that an iterative pairing algorithm (IPA) allows to make predictions even without a training set. Noise in the training data deteriorates performance. In both tasks we find a quadratic scaling relating dataset quality and size that is consistent with noise adding in square-root fashion and signal adding linearly when increasing the dataset. This implies that it is generally good to incorporate more data even if its quality is imperfect, thereby shedding light on the empirically observed performance of DCA applied to natural protein sequences.
The complementary strands of DNA molecules can be separated when stretched apart by a force; the unzipping signal is correlated to the base content of the sequence but is affected by thermal and instrumental noise. We consider here the ideal case where opening events are known to a very good time resolution (very large bandwidth), and study how the sequence can be reconstructed from the unzipping data. Our approach relies on the use of statistical Bayesian inference and of Viterbi decoding algorithm. Performances are studied numerically on Monte Carlo generated data, and analytically. We show how multiple unzippings of the same molecule may be exploited to improve the quality of the prediction, and calculate analytically the number of required unzippings as a function of the bandwidth, the sequence content, the elasticity parameters of the unzipped strands.
Various approaches have explored the covariation of residues in multiple-sequence alignments of homologous proteins to extract functional and structural information. Among those are principal component analysis (PCA), which identifies the most correlated groups of residues, and direct coupling analysis (DCA), a global inference method based on the maximum entropy principle, which aims at predicting residue-residue contacts. In this paper, inspired by the statistical physics of disordered systems, we introduce the Hopfield-Potts model to naturally interpolate between these two approaches. The Hopfield-Potts model allows us to identify relevant patterns of residues from the knowledge of the eigenmodes and eigenvalues of the residue-residue correlation matrix. We show how the computation of such statistical patterns makes it possible to accurately predict residue-residue contacts with a much smaller number of parameters than DCA. This dimensional reduction allows us to avoid overfitting and to extract contact information from multiple-sequence alignments of reduced size. In addition, we show that low-eigenvalue correlation modes, discarded by PCA, are important to recover structural information: the corresponding patterns are highly localized, that is, they are concentrated in few sites, which we find to be in close contact in the three-dimensional protein fold.