This paper introduces and analyzes a stochastic search method for parameter estimation in linear regression models in the spirit of Beran and Millar (1987). The idea is to generate a random finite subset of a parameter space which will automatically contain points which are very close to an unknown true parameter. The motivation for this procedure comes from recent work of Duembgen, Samworth and Schuhmacher (2011) on regression models with log-concave error distributions.
Skepticism about the assumption of no unmeasured confounding, also known as exchangeability, is often warranted in making causal inferences from observational data; because exchangeability hinges on an investigators ability to accurately measure covariates that capture all potential sources of confounding. In practice, the most one can hope for is that covariate measurements are at best proxies of the true underlying confounding mechanism operating in a given observational study. In this paper, we consider the framework of proximal causal inference introduced by Tchetgen Tchetgen et al. (2020), which while explicitly acknowledging covariate measurements as imperfect proxies of confounding mechanisms, offers an opportunity to learn about causal effects in settings where exchangeability on the basis of measured covariates fails. We make a number of contributions to proximal inference including (i) an alternative set of conditions for nonparametric proximal identification of the average treatment effect; (ii) general semiparametric theory for proximal estimation of the average treatment effect including efficiency bounds for key semiparametric models of interest; (iii) a characterization of proximal doubly robust and locally efficient estimators of the average treatment effect. Moreover, we provide analogous identification and efficiency results for the average treatment effect on the treated. Our approach is illustrated via simulation studies and a data application on evaluating the effectiveness of right heart catheterization in the intensive care unit of critically ill patients.
The Youden index is a popular summary statistic for receiver operating characteristic curve. It gives the optimal cutoff point of a biomarker to distinguish the diseased and healthy individuals. In this paper, we propose to model the distributions of a biomarker for individuals in the healthy and diseased groups via a semiparametric density ratio model. Based on this model, we use the maximum empirical likelihood method to estimate the Youden index and the optimal cutoff point. We further establish the asymptotic normality of the proposed estimators and construct valid confidence intervals for the Youden index and the corresponding optimal cutoff point. The proposed method automatically covers both cases when there is no lower limit of detection (LLOD) and when there is a fixed and finite LLOD for the biomarker. Extensive simulation studies and a real data example are used to illustrate the effectiveness of the proposed method and its advantages over the existing methods.
We develop a unified approach to hypothesis testing for various types of widely used functional linear models, such as scalar-on-function, function-on-function and function-on-scalar models. In addition, the proposed test applies to models of mixed types, such as models with both functional and scalar predictors. In contrast with most existing methods that rest on the large-sample distributions of test statistics, the proposed method leverages the technique of bootstrapping max statistics and exploits the variance decay property that is an inherent feature of functional data, to improve the empirical power of tests especially when the sample size is limited and the signal is relatively weak. Theoretical guarantees on the validity and consistency of the proposed test are provided uniformly for a class of test statistics.
Though Gaussian graphical models have been widely used in many scientific fields, limited progress has been made to link graph structures to external covariates because of substantial challenges in theory and computation. We propose a Gaussian graphical regression model, which regresses both the mean and the precision matrix of a Gaussian graphical model on covariates. In the context of co-expression quantitative trait locus (QTL) studies, our framework facilitates estimation of both population- and subject-level gene regulatory networks, and detection of how subject-level networks vary with genetic variants and clinical conditions. Our framework accommodates high dimensional responses and covariates, and encourages covariate effects on both the mean and the precision matrix to be sparse. In particular for the precision matrix, we stipulate simultaneous sparsity, i.e., group sparsity and element-wise sparsity, on effective covariates and their effects on network edges, respectively. We establish variable selection consistency first under the case with known mean parameters and then a more challenging case with unknown means depending on external covariates, and show in both cases that the convergence rate of the estimated precision parameters is faster than that obtained by lasso or group lasso, a desirable property for the sparse group lasso estimation. The utility and efficacy of our proposed method is demonstrated through simulation studies and an application to a co-expression QTL study with brain cancer patients.
In this paper, we focus on the variable selection techniques for a class of semiparametric spatial regression models which allow one to study the effects of explanatory variables in the presence of the spatial information. The spatial smoothing problem in the nonparametric part is tackled by means of bivariate splines over triangulation, which is able to deal efficiently with data distributed over irregularly shaped regions. In addition, we develop a unified procedure for variable selection to identify significant covariates under a double penalization framework, and we show that the penalized estimators enjoy the oracle property. The proposed method can simultaneously identify non-zero spatially distributed covariates and solve the problem of leakage across complex domains of the functional spatial component. To estimate the standard deviations of the proposed estimators for the coefficients, a sandwich formula is developed as well. In the end, Monte Carlo simulation examples and a real data example are provided to illustrate the proposed methodology. All technical proofs are given in the supplementary materials.