أوراق بحثية, رسائل ماجستير ودكتوراه منشورة من قبل Zijian Guo

SIHR: An R Package for Statistical Inference in High-dimensional Linear and Logistic Regression Models

102 - Prabrisha Rakshit , T. Tony Cai , Zijian Guo 2021

We introduce and illustrate through numerical examples the R package texttt{SIHR} which handles the statistical inference for (1) linear and quadratic functionals in the high-dimensional linear regression and (2) linear functional in the high-dimensi onal logistic regression. The focus of the proposed algorithms is on the point estimation, confidence interval construction and hypothesis testing. The inference methods are extended to multiple regression models. We include real data applications to demonstrate the packages performance and practicality.

حساب المنهجية إحصاء

Surrogate Assisted Semi-supervised Inference for High Dimensional Risk Prediction

212 - Jue Hou , Zijian Guo , Tianxi Cai 2021

Risk modeling with EHR data is challenging due to a lack of direct observations on the disease outcome, and the high dimensionality of the candidate predictors. In this paper, we develop a surrogate assisted semi-supervised-learning (SAS) approach to risk modeling with high dimensional predictors, leveraging a large unlabeled data on candidate predictors and surrogates of outcome, as well as a small labeled data with annotated outcomes. The SAS procedure borrows information from surrogates along with candidate predictors to impute the unobserved outcomes via a sparse working imputation model with moment conditions to achieve robustness against mis-specification in the imputation model and a one-step bias correction to enable interval estimation for the predicted risk. We demonstrate that the SAS procedure provides valid inference for the predicted risk derived from a high dimensional working model, even when the underlying risk prediction model is dense and the risk model is mis-specified. We present an extensive simulation study to demonstrate the superiority of our SSL approach compared to existing supervised methods. We apply the method to derive genetic risk prediction of type-2 diabetes mellitus using a EHR biobank cohort.

نظرية الإحصاء المنهجية التعلم الالي

Causal Inference with Invalid Instruments: Post-selection Problems and A Solution Using Searching and Sampling

248 - Zijian Guo 2021

Instrumental variable methods are among the most commonly used causal inference approaches to account for unmeasured confounders in observational studies. The presence of invalid instruments is a major concern for practical applications and a fast-gr owing area of research is inference for the causal effect with possibly invalid instruments. The existing inference methods rely on correctly separating valid and invalid instruments in a data dependent way. In this paper, we illustrate post-selection problems of these existing methods. We construct uniformly valid confidence intervals for the causal effect, which are robust to the mistakes in separating valid and invalid instruments. Our proposal is to search for the causal effect such that a sufficient amount of candidate instruments can be taken as valid. We further devise a novel sampling method, which, together with searching, lead to a more precise confidence interval. Our proposed searching and sampling confidence intervals are shown to be uniformly valid under the finite-sample majority and plurality rules. We compare our proposed methods with existing inference methods over a large set of simulation studies and apply them to study the effect of the triglyceride level on the glucose level over a mouse data set.

المنهجية نظرية الإحصاء نظرية الإحصاء

Inference for the Case Probability in High-dimensional Logistic Regression

330 - Zijian Guo , Prabrisha Rakshit , Daniel S. Herman 2020

Labeling patients in electronic health records with respect to their statuses of having a disease or condition, i.e. case or control statuses, has increasingly relied on prediction models using high-dimensional variables derived from structured and u nstructured electronic health record data. A major hurdle currently is a lack of valid statistical inference methods for the case probability. In this paper, considering high-dimensional sparse logistic regression models for prediction, we propose a novel bias-corrected estimator for the case probability through the development of linearization and variance enhancement techniques. We establish asymptotic normality of the proposed estimator for any loading vector in high dimensions. We construct a confidence interval for the case probability and propose a hypothesis testing procedure for patient case-control labelling. We demonstrate the proposed method via extensive simulation studies and application to real-world electronic health record data.

المنهجية نظرية الإحصاء نظرية الإحصاء

Inference for High-dimensional Maximin Effects in Heterogeneous Regression Models Using a Sampling Approach

266 - Zijian Guo 2020

Heterogeneity is an important feature of modern data sets and a central task is to extract information from large-scale and heterogeneous data. In this paper, we consider multiple high-dimensional linear models and adopt the definition of maximin eff ect (Meinshausen, B{u}hlmann, AoS, 43(4), 1801--1830) to summarize the information contained in this heterogeneous model. We define the maximin effect for a targeted population whose covariate distribution is possibly different from that of the observed data. We further introduce a ridge-type maximin effect to simultaneously account for reward optimality and statistical stability. To identify the high-dimensional maximin effect, we estimate the regression covariance matrix by a debiased estimator and use it to construct the aggregation weights for the maximin effect. A main challenge for statistical inference is that the estimated weights might have a mixture distribution and the resulted maximin effect estimator is not necessarily asymptotic normal. To address this, we devise a novel sampling approach to construct the confidence interval for any linear contrast of high-dimensional maximin effects. The coverage and precision properties of the proposed confidence interval are studied. The proposed method is demonstrated over simulations and a genetic data set on yeast colony growth under different environments.

المنهجية نظرية الإحصاء التعلم الالي

Causal Inference for Nonlinear Outcome Models with Possibly Invalid Instrumental Variables

375 - Sai Li , Zijian Guo 2020

Instrumental variable methods are widely used for inferring the causal effect of an exposure on an outcome when the observed relationship is potentially affected by unmeasured confounders. Existing instrumental variable methods for nonlinear outcome models require stringent identifiability conditions. We develop a robust causal inference framework for nonlinear outcome models, which relaxes the conventional identifiability conditions. We adopt a flexible semi-parametric potential outcome model and propose new identifiability conditions for identifying the model parameters and causal effects. We devise a novel three-step inference procedure for the conditional average treatment effect and establish the asymptotic normality of the proposed point estimator. We construct confidence intervals for the causal effect by the bootstrap method. The proposed method is demonstrated in a large set of simulation studies and is applied to study the causal effects of lipid levels on whether the glucose level is normal or high over a mice dataset.

المنهجية

Doubly Debiased Lasso: High-Dimensional Inference under Hidden Confounding

158 - Zijian Guo , Domagoj Cevid , Peter Buhlmann 2020

Inferring causal relationships or related associations from observational data can be invalidated by the existence of hidden confounding. We focus on a high-dimensional linear regression setting, where the measured covariates are affected by hidden c onfounding and propose the {em Doubly Debiased Lasso} estimator for individual components of the regression coefficient vector. Our advocated method simultaneously corrects both the bias due to estimation of high-dimensional parameters as well as the bias caused by the hidden confounding. We establish its asymptotic normality and also prove that it is efficient in the Gauss-Markov sense. The validity of our methodology relies on a dense confounding assumption, i.e. that every confounding variable affects many covariates. The finite sample performance is illustrated with an extensive simulation study and a genomic application.

المنهجية نظرية الإحصاء نظرية الإحصاء

Group Inference in High Dimensions with Applications to Hierarchical Testing

283 - Zijian Guo , Claude Renaux , Peter Buhlmann 2019

High-dimensional group inference is an essential part of statistical methods for analysing complex data sets, including hierarchical testing, tests of interaction, detection of heterogeneous treatment effects and inference for local heritability. Gro up inference in regression models can be measured with respect to a weighted quadratic functional of the regression sub-vector corresponding to the group. Asymptotically unbiased estimators of these weighted quadratic functionals are constructed and a novel procedure using these estimators for inference is proposed. We derive its asymptotic Gaussian distribution which enables the construction of asymptotically valid confidence intervals and tests which perform well in terms of length or power. The proposed test is computationally efficient even for a large group, statistically valid for any group size and achieving good power performance for testing large groups with many small regression coefficients. We apply the methodology to several interesting statistical problems and demonstrate its strength and usefulness on simulated and real data.

المنهجية

Local Inference in Additive Models with Decorrelated Local Linear Estimator

105 - Zijian Guo , Cun-Hui Zhang 2019

Additive models, as a natural generalization of linear regression, have played an important role in studying nonlinear relationships. Despite of a rich literature and many recent advances on the topic, the statistical inference problem in additive mo dels is still relatively poorly understood. Motivated by the inference for the exposure effect and other applications, we tackle in this paper the statistical inference problem for $f_1(x_0)$ in additive models, where $f_1$ denotes the univariate function of interest and $f_1(x_0)$ denotes its first order derivative evaluated at a specific point $x_0$. The main challenge for this local inference problem is the understanding and control of the additional uncertainty due to the need of estimating other components in the additive model as nuisance functions. To address this, we propose a decorrelated local linear estimator, which is particularly useful in reducing the effect of the nuisance function estimation error on the estimation accuracy of $f_1(x_0)$. We establish the asymptotic limiting distribution for the proposed estimator and then construct confidence interval and hypothesis testing procedures for $f_1(x_0)$. The variance level of the proposed estimator is of the same order as that of the local least squares in nonparametric regression, or equivalently the additive model with one component, while the bias of the proposed estimator is jointly determined by the statistical accuracies in estimating the nuisance functions and the relationship between the variable of interest and the nuisance variables. The method is developed for general additive models and is demonstrated in the high-dimensional sparse setting.

نظرية الإحصاء المنهجية نظرية الإحصاء

يمكنك البدء بجني المال وتحقيق ربح مادي من أبحاثك العلمية، المزيد