No Arabic abstract
Gene expression analysis aims at identifying the genes able to accurately predict biological parameters like, for example, disease subtyping or progression. While accurate prediction can be achieved by means of many different techniques, gene identification, due to gene correlation and the limited number of available samples, is a much more elusive problem. Small changes in the expression values often produce different gene lists, and solutions which are both sparse and stable are difficult to obtain. We propose a two-stage regularization method able to learn linear models characterized by a high prediction performance. By varying a suitable parameter these linear models allow to trade sparsity for the inclusion of correlated genes and to produce gene lists which are almost perfectly nested. Experimental results on synthetic and microarray data confirm the interesting properties of the proposed method and its potential as a starting point for further biological investigations
In system identification, estimating parameters of a model using limited observations results in poor identifiability. To cope with this issue, we propose a new method to simultaneously select and estimate sensitive parameters as key model parameters and fix the remaining parameters to a set of typical values. Our method is formulated as a nonlinear least squares estimator with L1-regularization on the deviation of parameters from a set of typical values. First, we provide consistency and oracle properties of the proposed estimator as a theoretical foundation. Second, we provide a novel approach based on Levenberg-Marquardt optimization to numerically find the solution to the formulated problem. Third, to show the effectiveness, we present an application identifying a biomechanical parametric model of a head position tracking task for 10 human subjects from limited data. In a simulation study, the variances of estimated parameters are decreased by 96.1% as compared to that of the estimated parameters without L1-regularization. In an experimental study, our method improves the model interpretation by reducing the number of parameters to be estimated while maintaining variance accounted for (VAF) at above 82.5%. Moreover, the variances of estimated parameters are reduced by 71.1% as compared to that of the estimated parameters without L1-regularization. Our method is 54 times faster than the standard simplex-based optimization to solve the regularized nonlinear regression.
Motivation: Time course data obtained from biological samples subject to specific treatments can be very useful for revealing complex and novel biological phenomena. Although an increasing number of time course microarray datasets becomes available, most of them contain few biological replicates and time points. So far there are few computational methods that can effectively reveal differentially expressed genes and their patterns in such data. Results: We have proposed a new two-step nonparametric statistical procedure, LRSA, to reveal differentially expressed genes and their expression trends in temporal microarray data. We have also employed external controls as a surrogate to estimate false discovery rates and thus to guide the discovery of differentially expressed genes. Our results showed that LRSA reveals substantially more differentially expressed genes and have much lower than two other methods, STEM and ANOVA, in both real data and the simulated data. Our computational results are confirmed using real-time PCRs. Contact:
[email protected]
This paper presents a new variational data assimilation (VDA) approach for the formal treatment of bias in both model outputs and observations. This approach relies on the Wasserstein metric stemming from the theory of optimal mass transport to penalize the distance between the probability histograms of the analysis state and an a priori reference dataset, which is likely to be more uncertain but less biased than both model and observations. Unlike previous bias-aware VDA approaches, the new Wasserstein metric VDA (WM-VDA) dynamically treats systematic biases of unknown magnitude and sign in both model and observations through assimilation of the reference data in the probability domain and can fully recover the probability histogram of the analysis state. The performance of WM-VDA is compared with the classic three-dimensional VDA (3D-Var) scheme on first-order linear dynamics and the chaotic Lorenz attractor. Under positive systematic biases in both model and observations, we consistently demonstrate a significant reduction in the forecast bias and unbiased root mean squared error.
Measuring veracity or reliability of noisy data is of utmost importance, especially in the scenarios where the information are gathered through automated systems. In a recent paper, Chakraborty et. al. (2019) have introduced a veracity scoring technique for geostatistical data. The authors have used a high-quality `reference data to measure the veracity of the varying-quality observations and incorporated the veracity scores in their analysis of mobile-sensor generated noisy weather data to generate efficient predictions of the ambient temperature process. In this paper, we consider the scenario when no reference data is available and hence, the veracity scores (referred as VS) are defined based on `local summaries of the observations. We develop a VS-based estimation method for parameters of a spatial regression model. Under a non-stationary noise structure and fairly general assumptions on the underlying spatial process, we show that the VS-based estimators of the regression parameters are consistent. Moreover, we establish the advantage of the VS-based estimators as compared to the ordinary least squares (OLS) estimator by analyzing their asymptotic mean squared errors. We illustrate the merits of the VS-based technique through simulations and apply the methodology to a real data set on mass percentages of ash in coal seams in Pennsylvania.
In recent years, with the development of microarray technique, discovery of useful knowledge from microarray data has become very important. Biclustering is a very useful data mining technique for discovering genes which have similar behavior. In microarray data, several objectives have to be optimized simultaneously and often these objectives are in conflict with each other. A Multi Objective model is capable of solving such problems. Our method proposes a Hybrid algorithm which is based on the Multi Objective Particle Swarm Optimization for discovering biclusters in gene expression data. In our method, we will consider a low level of overlapping amongst the biclusters and try to cover all elements of the gene expression matrix. Experimental results in the bench mark database show a significant improvement in both overlap among biclusters and coverage of elements in the gene expression matrix.