No Arabic abstract
Though Gaussian graphical models have been widely used in many scientific fields, limited progress has been made to link graph structures to external covariates because of substantial challenges in theory and computation. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our framework facilitates estimation of both population- and subject-level gene regulatory networks, and detection of how subject-level networks vary with genetic variants and clinical conditions. Our framework accommodates high dimensional responses and covariates, and encourages covariate effects on both the mean and the precision matrix to be sparse. In particular for the precision matrix, we stipulate simultaneous sparsity, i.e., group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and show in both cases that the convergence rate of the estimated precision parameters is faster than that obtained by lasso or group lasso, a desirable property for the sparse group lasso estimation. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients.
Modeling of longitudinal data often requires diffusion models that incorporate overall time-dependent, nonlinear dynamics of multiple components and provide sufficient flexibility for subject-specific modeling. This complexity challenges parameter inference and approximations are inevitable. We propose a method for approximate maximum-likelihood parameter estimation in multivariate time-inhomogeneous diffusions, where subject-specific flexibility is accounted for by incorporation of multidimensional mixed effects and covariates. We consider $N$ multidimensional independent diffusions $X^i = (X^i_t)_{0leq tleq T^i}, 1leq ileq N$, with common overall model structure and unknown fixed-effects parameter $mu$. Their dynamics differ by the subject-specific random effect $phi^i$ in the drift and possibly by (known) covariate information, different initial conditions and observation times and duration. The distribution of $phi^i$ is parametrized by an unknown $vartheta$ and $theta = (mu, vartheta)$ is the target of statistical inference. Its maximum likelihood estimator is derived from the continuous-time likelihood. We prove consistency and asymptotic normality of $hat{theta}_N$ when the number $N$ of subjects goes to infinity using standard techniques and consider the more general concept of local asymptotic normality for less regular models. The bias induced by time-discretization of sufficient statistics is investigated. We discuss verification of conditions and investigate parameter estimation and hypothesis testing in simulations.
$ell_1$-penalized quantile regression is widely used for analyzing high-dimensional data with heterogeneity. It is now recognized that the $ell_1$-penalty introduces non-negligible estimation bias, while a proper use of concave regularization may lead to estimators with refined convergence rates and oracle properties as the signal strengthens. Although folded concave penalized $M$-estimation with strongly convex loss functions have been well studied, the extant literature on quantile regression is relatively silent. The main difficulty is that the quantile loss is piecewise linear: it is non-smooth and has curvature concentrated at a single point. To overcome the lack of smoothness and strong convexity, we propose and study a convolution-type smoothed quantile regression with iteratively reweighted $ell_1$-regularization. The resulting smoothed empirical loss is twice continuously differentiable and (provably) locally strongly convex with high probability. We show that the iteratively reweighted $ell_1$-penalized smoothed quantile regression estimator, after a few iterations, achieves the optimal rate of convergence, and moreover, the oracle rate and the strong oracle property under an almost necessary and sufficient minimum signal strength condition. Extensive numerical studies corroborate our theoretical results.
The issue of honesty in constructing confidence sets arises in nonparametric regression. While optimal rate in nonparametric estimation can be achieved and utilized to construct sharp confidence sets, severe degradation of confidence level often happens after estimating the degree of smoothness. Similarly, for high-dimensional regression, oracle inequalities for sparse estimators could be utilized to construct sharp confidence sets. Yet the degree of sparsity itself is unknown and needs to be estimated, causing the honesty problem. To resolve this issue, we develop a novel method to construct honest confidence sets for sparse high-dimensional linear regression. The key idea in our construction is to separate signals into a strong and a weak group, and then construct confidence sets for each group separately. This is achieved by a projection and shrinkage approach, the latter implemented via Stein estimation and the associated Stein unbiased risk estimate. Our confidence set is honest over the full parameter space without any sparsity constraints, while its diameter adapts to the optimal rate of $n^{-1/4}$ when the true parameter is indeed sparse. Through extensive numerical comparisons, we demonstrate that our method outperforms other competitors with big margins for finite samples, including oracle methods built upon the true sparsity of the underlying model.
The purpose of this paper is to construct confidence intervals for the regression coefficients in the Fine-Gray model for competing risks data with random censoring, where the number of covariates can be larger than the sample size. Despite strong motivation from biomedical applications, a high-dimensional Fine-Gray model has attracted relatively little attention among the methodological or theoretical literature. We fill in this gap by developing confidence intervals based on a one-step bias-correction for a regularized estimation. We develop a theoretical framework for the partial likelihood, which does not have independent and identically distributed entries and therefore presents many technical challenges. We also study the approximation error from the weighting scheme under random censoring for competing risks and establish new concentration results for time-dependent processes. In addition to the theoretical results and algorithms, we present extensive numerical experiments and an application to a study of non-cancer mortality among prostate cancer patients using the linked Medicare-SEER data.
Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and unstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.