No Arabic abstract
While a randomized controlled trial (RCT) readily measures the average treatment effect (ATE), this measure may need to be generalized to the target population to account for a sampling bias in the RCTs population. Identifying this target population treatment effect needs covariates in both sets to capture all treatment effect modifiers that are shifted between the two sets. Standard estimators then use either weighting (IPSW), outcome modeling (G-formula), or combine the two in doubly robust approaches (AIPSW). However such covariates are often not available in both sets. Therefore, after completing existing proofs on the complete case consistency of those three estimators, we compute the expected bias induced by a missing covariate, assuming a Gaussian distribution and a semi-parametric linear model. This enables sensitivity analysis for each missing covariate pattern, giving the sign of the expected bias. We also show that there is no gain in imputing a partially-unobserved covariate. Finally we study the replacement of a missing covariate by a proxy. We illustrate all these results on simulations, as well as semi-synthetic benchmarks using data from the Tennessee Student/Teacher Achievement Ratio (STAR), and with a real-world example from critical care medicine.
We focus on the problem of generalizing a causal effect estimated on a randomized controlled trial (RCT) to a target population described by a set of covariates from observational data. Available methods such as inverse propensity weighting are not designed to handle missing values, which are however common in both data sources. In addition to coupling the assumptions for causal effect identifiability and for the missing values mechanism and to defining appropriate estimation strategies, one difficulty is to consider the specific structure of the data with two sources and treatment and outcome only available in the RCT. We propose and compare three multiple imputation strategies (separate imputation, joint imputation with fixed effect, joint imputation without source information), as well as a technique that uses estimators that can handle missing values directly without imputing them. These methods are assessed in an extensive simulation study, showing the empirical superiority of fixed effect multiple imputation followed with any complete data generalizing estimators. This work is motivated by the analysis of a large registry of over 20,000 major trauma patients and a RCT studying the effect of tranexamic acid administration on mortality. The analysis illustrates how the missing values handling can impact the conclusion about the effect generalized from the RCT to the target population.
In the field of disparities research, there has been growing interest in developing a counterfactual-based decomposition analysis to identify underlying mediating mechanisms that help reduce disparities in populations. Despite rapid development in the area, most prior studies have been limited to regression-based methods, undermining the possibility of addressing complex models with multiple mediators and/or heterogeneous effects. We propose an estimation method that effectively addresses complex models. Moreover, we develop a novel sensitivity analysis for possible violations of identification assumptions. The proposed method and sensitivity analysis are demonstrated with data from the Midlife Development in the US study to investigate the degree to which disparities in cardiovascular health at the intersection of race and gender would be reduced if the distributions of education and perceived discrimination were the same across intersectional groups.
Missing data and confounding are two problems researchers face in observational studies for comparative effectiveness. Williamson et al. (2012) recently proposed a unified approach to handle both issues concurrently using a multiply-robust (MR) methodology under the assumption that confounders are missing at random. Their approach considers a union of models in which any submodel has a parametric component while the remaining models are unrestricted. We show that while their estimating function is MR in theory, the possibility for multiply robust inference is complicated by the fact that parametric models for different components of the union model are not variation independent and therefore the MR property is unlikely to hold in practice. To address this, we propose an alternative transparent parametrization of the likelihood function, which makes explicit the model dependencies between various nuisance functions needed to evaluate the MR efficient score. The proposed method is genuinely doubly-robust (DR) in that it is consistent and asymptotic normal if one of two sets of modeling assumptions holds. We evaluate the performance and doubly robust property of the DR method via a simulation study.
Missing data is a common problem in clinical data collection, which causes difficulty in the statistical analysis of such data. To overcome problems caused by incomplete data, we propose a new imputation method called projective resampling imputation mean estimation (PRIME), which can also address ``the curse of dimensionality problem in imputation with less information loss. We use various sample sizes, missing-data rates, covariate correlations, and noise levels in simulation studies, and all results show that PRIME outperformes other methods such as iterative least-squares estimation (ILSE), maximum likelihood (ML), and complete-case analysis (CC). Moreover, we conduct a study of influential factors in cardiac surgery-associated acute kidney injury (CSA-AKI), which show that our method performs better than the other models. Finally, we prove that PRIME has a consistent property under some regular conditions.
Causal inference has been increasingly reliant on observational studies with rich covariate information. To build tractable causal models, including the propensity score models, it is imperative to first extract important features from high dimensional data. Unlike the familiar task of variable selection for prediction modeling, our feature selection procedure aims to control for confounding while maintaining efficiency in the resulting causal effect estimate. Previous empirical studies imply that one should aim to include all predictors of the outcome, rather than the treatment, in the propensity score model. In this paper, we formalize this intuition through rigorous proofs, and propose the causal ball screening for selecting these variables from modern ultra-high dimensional data sets. A distinctive feature of our proposal is that we do not require any modeling on the outcome regression, thus providing robustness against misspecification of the functional form or violation of smoothness conditions. Our theoretical analyses show that the proposed procedure enjoys a number of oracle properties including model selection consistency, normality and efficiency. Synthetic and real data analyses show that our proposal performs favorably with existing methods in a range of realistic settings.