Motivation: Time course data obtained from biological samples subject to specific treatments can be very useful for revealing complex and novel biological phenomena. Although an increasing number of time course microarray datasets becomes available, most of them contain few biological replicates and time points. So far there are few computational methods that can effectively reveal differentially expressed genes and their patterns in such data. Results: We have proposed a new two-step nonparametric statistical procedure, LRSA, to reveal differentially expressed genes and their expression trends in temporal microarray data. We have also employed external controls as a surrogate to estimate false discovery rates and thus to guide the discovery of differentially expressed genes. Our results showed that LRSA reveals substantially more differentially expressed genes and have much lower than two other methods, STEM and ANOVA, in both real data and the simulated data. Our computational results are confirmed using real-time PCRs. Contact: [email protected]
CD8 T cells are specialized immune cells that play an important role in the regulation of antiviral immune response and the generation of protective immunity. In this paper we investigate the differentiation of memory CD8 T cells in the immune response using a short time course microarray experiment. Structurally, this experiment is similar to many in that it involves measurements taken on independent samples, in one biological group, at a small number of irregularly spaced time points, and exhibiting patterns of temporal nonstationarity. To analyze this CD8 T-cell experiment, we develop a hierarchical state space model so that we can: (1) detect temporally differentially expressed genes, (2) identify the direction of successive changes over time, and (3) assess the magnitude of successive changes over time. We incorporate hidden Markov models into our model to utilize the information embedded in the time series and set up the proposed hierarchical state space model in an empirical Bayes framework to utilize the population information from the large-scale data. Analysis of the CD8 T-cell experiment using the proposed model results in biologically meaningful findings. Temporal patterns involved in the differentiation of memory CD8 T cells are summarized separately and performance of the proposed model is illustrated in a simulation study.
Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited number of available samples, is a much more elusive problem. Small changes in the expression values often produce different gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage regularization method able to learn linear models characterized by a high prediction performance. By varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and microarray data confirm the interesting properties of the proposed method and its potential as a starting point for further biological investigations
We develop a novel peak detection algorithm for the analysis of comprehensive two-dimensional gas chromatography time-of-flight mass spectrometry (GC$times$GC-TOF MS) data using normal-exponential-Bernoulli (NEB) and mixture probability models. The algorithm first performs baseline correction and denoising simultaneously using the NEB model, which also defines peak regions. Peaks are then picked using a mixture of probability distribution to deal with the co-eluting peaks. Peak merging is further carried out based on the mass spectral similarities among the peaks within the same peak group. The algorithm is evaluated using experimental data to study the effect of different cutoffs of the conditional Bayes factors and the effect of different mixture models including Poisson, truncated Gaussian, Gaussian, Gamma and exponentially modified Gaussian (EMG) distributions, and the optimal version is introduced using a trial-and-error approach. We then compare the new algorithm with two existing algorithms in terms of compound identification. Data analysis shows that the developed algorithm can detect the peaks with lower false discovery rates than the existing algorithms, and a less complicated peak picking model is a promising alternative to the more complicated and widely used EMG mixture models.
To analyse a very large data set containing lengthy variables, we adopt a sequential estimation idea and propose a parallel divide-and-conquer method. We conduct several conventional sequential estimation procedures separately, and properly integrate their results while maintaining the desired statistical properties. Additionally, using a criterion from the statistical experiment design, we adopt an adaptive sample selection, together with an adaptive shrinkage estimation method, to simultaneously accelerate the estimation procedure and identify the effective variables. We confirm the cogency of our methods through theoretical justifications and numerical results derived from synthesized data sets. We then apply the proposed method to three real data sets, including those pertaining to appliance energy use and particulate matter concentration.
This paper describes the VARTOOLS program, which is an open-source command-line utility, written in C, for analyzing astronomical time-series data, especially light curves. The program provides a general-purpose set of tools for processing light curves including signal identification, filtering, light curve manipulation, time