No Arabic abstract
For data with high-dimensional covariates but small to moderate sample sizes, the analysis of single datasets often generates unsatisfactory results. The integrative analysis of multiple independent datasets provides an effective way of pooling information and outperforms single-dataset analysis and some alternative multi-datasets approaches including meta-analysis. Under certain scenarios, multiple datasets are expected to share common important covariates, that is, the multiple models have similarity in sparsity structures. However, the existing methods do not have a mechanism to {it promote} the similarity of sparsity structures in integrative analysis. In this study, we consider penalized variable selection and estimation in integrative analysis. We develop an $L_0$-penalty based approach, which is the first to explicitly promote the similarity of sparsity structures. Computationally it is realized using a coordinate descent algorithm. Theoretically it has the much desired consistency properties. In simulation, it significantly outperforms the competing alternative when the models in multiple datasets share common important covariates. It has better or similar performance as the alternative when the sparsity structures share no similarity. Thus it provides a safe choice for data analysis. Applying the proposed method to three lung cancer datasets with gene expression measurements leads to models with significantly more similar sparsity structures and better prediction performance.
We propose a distributed quadratic inference function framework to jointly estimate regression parameters from multiple potentially heterogeneous data sources with correlated vector outcomes. The primary goal of this joint integrative analysis is to estimate covariate effects on all outcomes through a marginal regression model in a statistically and computationally efficient way. We develop a data integration procedure for statistical estimation and inference of regression parameters that is implemented in a fully distributed and parallelized computational scheme. To overcome computational and modeling challenges arising from the high-dimensional likelihood of the correlated vector outcomes, we propose to analyze each data source using Qu, Lindsay and Li (2000)s quadratic inference functions, and then to jointly reestimate parameters from each data source by accounting for correlation between data sources using a combined meta-estimator in a similar spirit to Hansen (1982)s generalised method of moments. We show both theoretically and numerically that the proposed method yields efficiency improvements and is computationally fast. We illustrate the proposed methodology with the joint integrative analysis of the association between smoking and metabolites in a large multi-cohort study and provide an R package for ease of implementation.
For survival data with high-dimensional covariates, results generated in the analysis of a single dataset are often unsatisfactory because of the small sample size. Integrative analysis pools raw data from multiple independent studies with comparable designs, effectively increases sample size, and has better performance than meta-analysis and single-dataset analysis. In this study, we conduct integrative analysis of survival data under the accelerated failure time (AFT) model. The sparsity structures of multiple datasets are described using the homogeneity and heterogeneity models. For variable selection under the homogeneity model, we adopt group penalization approaches. For variable selection under the heterogeneity model, we use composite penalization and sparse group penalization approaches. As a major advancement from the existing studies, the asymptotic selection and estimation properties are rigorously established. Simulation study is conducted to compare different penalization methods and against alternatives. We also analyze four lung cancer prognosis datasets with gene expression measurements.
This paper focuses on exploring the sparsity of the inverse covariance matrix $bSigma^{-1}$, or the precision matrix. We form blocks of parameters based on each off-diagonal band of the Cholesky factor from its modified Cholesky decomposition, and penalize each block of parameters using the $L_2$-norm instead of individual elements. We develop a one-step estimator, and prove an oracle property which consists of a notion of block sign-consistency and asymptotic normality. In particular, provided the initial estimator of the Cholesky factor is good enough and the true Cholesky has finite number of non-zero off-diagonal bands, oracle property holds for the one-step estimator even if $p_n gg n$, and can even be as large as $log p_n = o(n)$, where the data $y$ has mean zero and tail probability $P(|y_j| > x) leq Kexp(-Cx^d)$, $d > 0$, and $p_n$ is the number of variables. We also prove an operator norm convergence result, showing the cost of dimensionality is just $log p_n$. The advantage of this method over banding by Bickel and Levina (2008) or nested LASSO by Levina emph{et al.} (2007) is that it allows for elimination of weaker signals that precede stronger ones in the Cholesky factor. A method for obtaining an initial estimator for the Cholesky factor is discussed, and a gradient projection algorithm is developed for calculating the one-step estimate. Simulation results are in favor of the newly proposed method and a set of real data is analyzed using the new procedure and the banding method.
Least-squares reverse time migration is well-known for its capability to generate artifact-free true-amplitude subsurface images through fitting observed data in the least-squares sense. However, when applied to realistic imaging problems, this approach is faced with issues related to overfitting and excessive computational costs induced by many wave-equation solves. The fact that the source function is unknown complicates this situation even further. Motivated by recent results in stochastic optimization and transform-domain sparsity-promotion, we demonstrate that the computational costs of inversion can be reduced significantly while avoiding imaging artifacts and restoring amplitudes. While powerful, these new approaches do require accurate information on the source-time function, which is often lacking. Without this information, the imaging quality deteriorates rapidly. We address this issue by presenting an approach where the source-time function is estimated on the fly through a technique known as variable projection. Aside from introducing negligible computational overhead, the proposed method is shown to perform well on imaging problems with noisy data and problems that involve complex settings such as salt. In either case, the presented method produces high resolution high-amplitude fidelity images including an estimates for the source-time function. In addition, due to its use of stochastic optimization, we arrive at these images at roughly one to two times the cost of conventional reverse time migration involving all data.
In the high-dimensional sparse modeling literature, it has been crucially assumed that the sparsity structure of the model is homogeneous over the entire population. That is, the identities of important regressors are invariant across the population and across the individuals in the collected sample. In practice, however, the sparsity structure may not always be invariant in the population, due to heterogeneity across different sub-populations. We consider a general, possibly non-smooth M-estimation framework, allowing a possible structural change regarding the identities of important regressors in the population. Our penalized M-estimator not only selects covariates but also discriminates between a model with homogeneous sparsity and a model with a structural change in sparsity. As a result, it is not necessary to know or pretest whether the structural change is present, or where it occurs. We derive asymptotic bounds on the estimation loss of the penalized M-estimators, and achieve the oracle properties. We also show that when there is a structural change, the estimator of the threshold parameter is super-consistent. If the signal is relatively strong, the rates of convergence can be further improved and asymptotic distributional properties of the estimators including the threshold estimator can be established using an adaptive penalization. The proposed methods are then applied to quantile regression and logistic regression models and are illustrated via Monte Carlo experiments.